Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow on NERSC compute nodes #86

Closed
wagmanbe opened this issue Feb 4, 2021 · 7 comments
Closed

slow on NERSC compute nodes #86

wagmanbe opened this issue Feb 4, 2021 · 7 comments

Comments

@wagmanbe
Copy link

wagmanbe commented Feb 4, 2021

Hi,
My E3SM diagnostics jobs aren't running. Could the e3sm unified environment be bogging it down?

Interactive jobs on NERSC knl and haswell slow to a crawl after I load the e3sm unified environment, e.g

`salloc --nodes=1 --partition=debug --time=00:30:00 -C knl

source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh`

After this, everything slows down and my diagnostic script hangs on the import statements.

These problems do not occur on the login node.

@chengzhuzhang
Copy link

I'm wondering it might not be an e3sm-unified problem. I just tried e3sm-diags from e3sm-unified through interactive jobs on haswell. It ran well. However knl has been problematic, it's a known issue: E3SM-Project/e3sm_diags#314. @wagmanbe would you try it again on haswell? If it still gives trouble, could you share your run script and I will try reproduce.

@xylar
Copy link
Contributor

xylar commented Feb 5, 2021

Are you not seeing these problems on knl when you use an E3SM_Diags development environment? I have always found python packages to run slowly on knl, so I would be surprised if this is specific to E3SM-Unified but can investigate if it appears to be. But I agree that haswell is the recommended option for all python codes.

@wagmanbe
Copy link
Author

wagmanbe commented Feb 5, 2021

It's affecting both knl and haswell. Maybe it's a NERSC issue?
salloc --nodes=1 --partition=debug --time=00:20:00 -C haswell
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh <-- Hangs for minutes.
python <--slow
import os
from acme_diags.parameter.core_parameter import CoreParameter <--hangs for minutes.

@darincomeau
Copy link
Member

NERSC was having problems yesterday afternoon/evening with very slow compute node performance that a few of us experienced, and was posted on their status page: https://www.nersc.gov/live-status/motd/
There's no notice now, so I'd recommend trying again.

@wagmanbe
Copy link
Author

wagmanbe commented Feb 5, 2021

Thank you, but this problem is occurring just the same today.

@chengzhuzhang
Copy link

In this case, I suspect that the compute node problem is still there. I tried similar commands as below yesterday afternoon and got the same behavior. But tried again much later yesterday, everything looked fine...

It's affecting both knl and haswell. Maybe it's a NERSC issue?
salloc --nodes=1 --partition=debug --time=00:20:00 -C haswell
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh <-- Hangs for minutes.
python <--slow
import os
from acme_diags.parameter.core_parameter import CoreParameter <--hangs for minutes.

@wagmanbe
Copy link
Author

wagmanbe commented Feb 5, 2021

It's at least 10x faster this afternoon.

@wagmanbe wagmanbe closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants