Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/one step save baseline #193

Merged
merged 83 commits into from
Mar 26, 2020

Conversation

nbren12
Copy link
Contributor

@nbren12 nbren12 commented Mar 19, 2020

This adds several features to the one-step pipeline

  • big zarr. Everything is stored as one zarr file
  • saves physics outputs
  • some refactoring of the job submission.

Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67

TODO

  • I think the forecast_time dimension is still broken.
  • raise errors if the jobs fail.

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 25, 2020

mpi_comm=MPI.COMM_WORLD,
)

begin_monitor = fv3gfs.ZarrMonitor(
Copy link
Contributor

@AnnaKwa AnnaKwa Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "physics" being referred to here in "begin_physics" the same physics that's being stepped in L261? I am kind of confused with the naming here and what the difference is between "begin physics" and "before physics", since the former stores the state right before the dynamics step.

Edit:
Never mind, I think I see what the names refer to. Could the intermediate zarr names use the same names as the step dim ["begin", "after_dynamics", "after_physics"] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I rewrote the code to pass the paths of the monitors as a dictionary whose keys are the "step".

@@ -0,0 +1,30 @@
import fsspec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file supposed to stay in the final PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove it. I thought it was a helpful script though.

Copy link
Contributor

@AnnaKwa AnnaKwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Noah, this PR is really useful. Minor comments/requests and one spot in one_step.py that is breaking on my end.



def rename_sfc_dt_atmos(sfc: xr.Dataset) -> xr.Dataset:
DIMS = {"grid_xt": "x", "grid_yt": "y", "time": "forecast_time"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really in the scope of this PR but a note for future PRs to the develop branch: need to make sure all references to the old set of variable names in downstream workflows are updated to work with this output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. maybe we could move this note to the PR to be created from the develop branch to master.

import time

# avoid out of memory errors
# dask.config.set(scheduler='single-threaded')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some info to readme instructing people to use this if this is an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a section to the README

@@ -1,89 +0,0 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the deletion of this file, is the intention for all one step jobs to use orchestrate_submit_jobs.py? Could you update the README for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. It doesn't seem like the original script was used by anything.

Copy link
Contributor Author

@nbren12 nbren12 Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frodre let me know if I should add it back.

I updated the README, where I deleted the help string, since that is bound to get out of date.

Copy link
Contributor

@brianhenn brianhenn Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the README. If the entry point is the orchestrator can you update the one_step config section in the workflows/end_to_end/full_workflow_config.yaml with a set if working parameters? FYI I was able to run this using _run_steps.sh, but when I tried to use the orchestrator I was having some issues.

curr_output_url,
job_labels=job_labels,
**kube_config,
model_config_url, "/tmp/null", job_labels=labels, **kube_kwargs
Copy link
Contributor

@AnnaKwa AnnaKwa Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is /tmp/null supposed to be the output url in this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. this job does not need to upload the run directory any longer since all the useful info is in the big zarr. Setting outdir=/tmp/null was the easiest way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to make this clear.

Copy link
Contributor

@AnnaKwa AnnaKwa Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run a test (writing to my own test dir), it complains that the output tmp/null needs to be a remote path, any idea I doing something wrong? Running the test script seems to work for others.

INFO:fv3net.pipelines.kube_jobs.one_step:Number of input times: 11
INFO:fv3net.pipelines.kube_jobs.one_step:Number of completed times: 0
INFO:fv3net.pipelines.kube_jobs.one_step:Number of times to process: 11
INFO:fv3net.pipelines.kube_jobs.one_step:Running the first time step to initialize the zarr store
Traceback (most recent call last):
  File "/home/AnnaK/fv3net/workflows/one_step_jobs/orchestrate_submit_jobs.py", line 116, in <module>
    local_vertical_grid_file=local_vgrid_file,
  File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 330, in submit_jobs
    run_job(index=k, init=True, wait=True, timesteps=timestep_list)
  File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 322, in run_job
    model_config_url, local_tmp_dir, job_labels=labels, **kube_kwargs
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 74, in run_kubernetes
    job_labels,
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 92, in _get_job
    _ensure_locations_are_remote(config_location, outdir)
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 113, in _ensure_locations_are_remote
    _ensure_is_remote(location, description)
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 119, in _ensure_is_remote
    f"{description} must be a remote path when running on kubernetes, "
ValueError: output directory must be a remote path when running on kubernetes, instead is /tmp/null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into this too and in my case it was because I hadn't pulled and installed the latest fv3config that includes Noah's changes. Once I did that ti worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another good reason to have run these kind of tests in CI.

logger.info(f"Writing {variable} to {group}")
dims = group[variable].attrs["_ARRAY_DIMENSIONS"][1:]
dask_arr = ds[variable].transpose(*dims).data
dask_arr.store(group[variable], regions=(index,))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think index needs to get passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops...how did that not get passed? I am surprised this didn't cause an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess index is defined in the __main__ block...anyway I fixed this.

state = fv3gfs.get_state(names=VARIABLES + (TIME,))
if rank == 0:
logger.info("Beginning steps")
for i in range(fv3gfs.get_step_count()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output zarr has a forecast_time length of 15, whereas our previous approached produced a length of 16 (initial conditions + results after 15 steps). Is the thinking that we could extract the 16 values from the current approach using the "begin" and "after_physics" points in the step dimension, or only use 15 values and compute 14 tendencies (instead of 15 as of now)? Not sure this matters much but it's slightly different than before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the "begin" and "after_physics" dimension are redundant. The idea is we can compute the total model tendency by taking differences along the "step" dimension (e.g.
ds.sel(step="after_physics") - ds.sel(step="begin")) rather than along the "forecast_time" dimension.

Copy link
Contributor

@brianhenn brianhenn Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK that makes sense. So after_physics at time i is the same as begin at i + 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It's redundant, but does simplify the logic IMO.

@@ -0,0 +1,14 @@
workdir=$(pwd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with having this script and the empty run_steps.sh script?

Copy link
Contributor Author

@nbren12 nbren12 Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were for prototyping only, so I deleted these and pasted them into the readme as a quick example.

@@ -1,89 +0,0 @@
import logging
Copy link
Contributor

@brianhenn brianhenn Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the README. If the entry point is the orchestrator can you update the one_step config section in the workflows/end_to_end/full_workflow_config.yaml with a set if working parameters? FYI I was able to run this using _run_steps.sh, but when I tried to use the orchestrator I was having some issues.

Copy link
Contributor

@brianhenn brianhenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this worked great overall. See comments about using the orchestrator and the forecast time dim.

Copy link
Contributor

@brianhenn brianhenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

Copy link
Contributor

@AnnaKwa AnnaKwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, approving

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 26, 2020

Great thanks a lot guys!

@nbren12 nbren12 merged commit f929f75 into develop-one-steps Mar 26, 2020
@nbren12 nbren12 deleted the feature/one-step-save-baseline branch March 26, 2020 23:32
nbren12 added a commit that referenced this pull request Apr 13, 2020
* Feature/one step save baseline (#193)

This adds several features to the one-step pipeline

- big zarr. Everything is stored as one zarr file
- saves physics outputs
- some refactoring of the job submission.

Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67

* save lat/lon grid variables from sfc_dt_atmos (#204)

* save lat/lon grid variables from sfc_dt_atmos

* Feature/use onestep zarr train data (#207)

Use the big zarr from the one step workflow as input to the create training data pipeline

* One-step sfc variables time alignment (#214)

This makes the diagnostics variables appended to the big zarr have the appropriate step and forecast_time dimensions, just as the variables extracted by the wrapper do.

* One step zarr fill values (#223)

This accomplishes two things: 1) preventing true model 0s from being cast to NaNs in the one-step big zarr output, and 2) initializing big zarr arrays with NaNs via full so that if they are not filled in due to a failed timestep or other reason, it is more apparent than using empty which produces arbitrary output.

* adjustments to be able to run workflows in dev branch (#218)

Remove references to hard coded dims and data variables or imports from vcm.cubedsphere.constants, replace with arguments.
Can provide coords and dims as args for mappable var

* One steps start index (#231)

Allows for starting the one-step jobs at the specified index in the timestep list to allow for testing/avoiding spinup timesteps

* Dev fix/integration tests (#234)

* change order of required args so output is last

* fix arg for onestep input to be dir containing big zarr

* update end to end integration test ymls

* prognostic run adjustments

* Improved fv3 logging (#225)

This PR introduces several improvements to the logging capability of our prognostic run image

- include upstream changes to disable output capturing in `fv3config.fv3run`
- Add `capture_fv3gfs_func` function. When called this capture the raw fv3gfs outputs and re-emit it as DEBUG level logging statements that can more easily be filtered.
- Refactor `runtime` to `external/runtime/runtime`. This was easy since it did not depend on any other module in fv3net. (except implicitly the code in `fv3net.regression` which is imported when loading the sklearn model with pickle).
- updates fv3config to master

* manually merge in the refactor from master while keeping new names from develop (#237)

* lint

* remove logging from testing

* Dev fix/arg order (#238)

* update history

* fix positional args

* fix function args

* update history

* linting

Co-authored-by: Anna Kwa <annak@vulcan.com>
Co-authored-by: brianhenn <brianhenn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants