Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow to run model with supplied nudging tendency #215

Merged
merged 56 commits into from
Apr 9, 2020

Conversation

oliverwm1
Copy link
Contributor

@oliverwm1 oliverwm1 commented Apr 6, 2020

Major changes

  • add workflow for running FV3GFS model with some externally specified nudging tendency (in this case, monthly-mean tendencies from separate nudged-to-observations runs). In the future I plan to extend this workflow to handle making prognostic ML predictions of nudging tendencies.
  • refactor code that handles configuration of FV3GFS runs which are being nudged towards GFS analysis. This code was previously in the single_fv3gfs_runs workflow but has now been moved to fv3net/pipelines/kube_jobs and tests have been added. This code move goes against our new guideline (I did the refactor before that guideline was discussed). I think the correct ultimate solution is to move the entire kube_jobs set of modules to an external package, as they now being used be four workflows (one_step_jobs, prognostic_c48_run, single_fv3gfs_run, and the new run_with_learned_nudging). I'd rather not do that now, since I don't want to mess with the core ML workflow steps.

Minor changes

  • add import of get_protocol to vcm.cloud.__init__.py
  • pin pandas to 1.0.1 in prognostic_run image since there is a bug in later versions which causes issues when doing time operations with Julian datetimes.
  • added dataset of low-res GFS analysis data to catalog
  • change API of diagnostics-to-zarr workflow so that by default it saves Zarr stores of diagnostic output in the rundir of the given experiment instead of in the parent of the rundir.

@oliverwm1 oliverwm1 marked this pull request as ready for review April 6, 2020 23:27
Copy link
Contributor

@mcgibbon mcgibbon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some over-arching points:

  • There's some configuration currently provided using global constants which should probably be provided in a yaml file somewhere. This workflow could take them in as command-line arguments, or via the fv3config.yml.
  • The "nudging" runfile seems to actually be a "forcing" runfile. I would suggest refactoring the names to reflect that you are forcing a model using saved nudging data, rather than nudging the model against a reference.
  • Some of the helper methods around config dictionaries should probably be added to fv3config rather than fv3net.

fv3net/pipelines/diagnostics_to_zarr.py Outdated Show resolved Hide resolved
fv3net/pipelines/kube_jobs/nudge_to_obs.py Outdated Show resolved Hide resolved
tests/test_kube_jobs_nudge_to_obs.py Outdated Show resolved Hide resolved
fv3net/pipelines/kube_jobs/nudge_to_obs.py Outdated Show resolved Hide resolved
workflows/run_with_learned_nudging/mean_nudging_runfile.py Outdated Show resolved Hide resolved
workflows/run_with_learned_nudging/submit_to_kubernetes.py Outdated Show resolved Hide resolved
Comment on lines 19 to 23
def get_config():
"""Return fv3config dictionary"""
with open("fv3config.yml") as f:
config = yaml.safe_load(f)
return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added to fv3config as a config_from_yaml routine to match config_to_yaml. I would imagine fv3config.yml should be a global constant somewhere, possibly in fv3net.runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a config_to_yaml function to fv3config (will put up PR in a moment) and added a TODO to refactor to use that once fv3net points to an updated fv3config version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just bump the patch version on fv3config also, and point fv3net to the updated fv3config? You could do the version bumping and new features in the same PR as described in the release instructions. Would avoid having to keep track of this and fix it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fv3net/pipelines/kube_jobs/nudge_to_obs.py Outdated Show resolved Hide resolved
Comment on lines 30 to 32
def get_timestep():
"""Return model timestep in seconds"""
return get_namelist()["coupler_nml"]["dt_atmos"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added as a helper method in fv3config (in derive).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a function to fv3config, and put in TODO to refactor that usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored workflow to use fv3config version of get_timestep().

@mcgibbon
Copy link
Contributor

mcgibbon commented Apr 7, 2020

pin pandas to 1.0.1 in prognostic_run image since there is a bug in later versions which causes issues when doing time operations with Julian datetimes

Is it a workflow which has issues with this, or a package like fv3net or vcm? If it's a package, can this also be added to the setup.py for that package?

@oliverwm1
Copy link
Contributor Author

pin pandas to 1.0.1 in prognostic_run image since there is a bug in later versions which causes issues when doing time operations with Julian datetimes

Is it a workflow which has issues with this, or a package like fv3net or vcm? If it's a package, can this also be added to the setup.py for that package?

Noah ran into this issue in the new one-step workflow, I ran into it in this workflow. These both use the prognostic run image, so just pinning the version there should handle it. Looks like it's already been pinned in fv3net's environment.yml.

@spencerkclark
Copy link
Member

FYI I think the bug you are referring to has been accommodated for temporarily in the latest release of xarray (0.15.1), pydata/xarray#3764, and will ultimately be fixed in pandas in their next release (1.1.0), pandas-dev/pandas#32905.

Copy link
Contributor

@nbren12 nbren12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Oli. This looks like a lot of work, and overall quite clean.

Apart from some minor comments and FYIs below, my main concern is the number of different scripts used by this workflow. There are scripts for preparing, submitting, and preprocessing the jobs, and it is not 100% clear what the entry point of this workflow is, since it is not documented in the README. My preference would be that running make -C workflows/run_with_learned_nudging from the fv3net root would do everything that needs to happen (including preprocessing) for this job to complete successfully.

Also please add some explanation to HISTORY.rst.

@@ -106,6 +106,23 @@ sources:
access: read_only
urlpath: "gs://vcm-ml-data/2020-02-25-additional-november-C3072-simulation-C384-diagnostics/atmos_8xdaily_C3072_to_C384.zarr"

GFS_analysis_T85_2015_2016:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. I think this will be very helpful for us.

@@ -63,9 +62,9 @@ def _parse_categories(diagnostic_categories, rundir):
return diagnostic_categories


def _parse_diagnostic_dir(diagnostic_dir, rundir):
def _get_diagnostic_dir(diagnostic_dir, rundir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This function does a very general thing, but has an overly specific name. I would just move this logic to the run function

diagnostic_dir = rundir if args.diagnostic_dir is None else args.diagnostic_dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

]


def _get_and_write_nudge_files_description_asset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has an "and" in the name. I would move the listing and writing into two separate function calls in update_config_for_nudging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1,7 +1,7 @@
## Diagnostics-to-zarr workflow
This workflow takes a path/url to a run directory as an input and saves zarr stores
of the diagnostic model output to a specified location. This workflow requires a
specific xarray version (0.14.0) and so to run locally, one must ensure your
specific xarray version (0.15.0) and so to run locally, one must ensure your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should mention this in the README since it is bound to get our of sync, and you already have it in the setup.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the reference to a particular version number.

@@ -0,0 +1,5 @@
This workflow allows an external nudging tendency be applied to FV3GFS runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a title with at the ## level so that this README is correctly interpreted by the sphinx docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise please mention the path workflows/run_with_learned_nudging so people will know how to get to this folder when looking at the sphinx docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,28 @@
IMAGE = us.gcr.io/vcm-ml/prognostic_run:mean_nudging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear where this docker image is built from this PR and what is in it. Are you planning to adjust this to point at a versioned tag of the prognostic_run image before mergining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added another Dockerfile in fv3net/docker. I didn't add this new docker image to the build_images rule in fv3net's Makefile, since it's not used by any of the steps in the main end-to-end pipeline. There are instructions in the README about building the image though.


# constants
ROOT_URL=gs://vcm-ml-data/2020-03-30-learned-nudging-FV3GFS-runs
LATLON_VAR_LIST=DLWRFsfc,DSWRFsfc,DSWRFtoa,LHTFLsfc,PRATEsfc,SHTFLsfc,ULWRFsfc,ULWRFtoa,USWRFsfc,USWRFtoa,TMP2m,TMPsfc,SOILM,PRESsfc,ucomp,vcomp,temp,sphum,ps_dt_nudge,delp_dt_nudge,u_dt_nudge,v_dt_nudge,t_dt_nudge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some improvements in #153 that could avoid explicitly listing these in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know.

if [ "$DO_REGRID" = true ]; then
for RUN in $RUNS; do
# regrid certain monthly-mean variables to lat-lon grid
argo --cluster $ARGO_CLUSTER submit workflows/fregrid_cube_netcdfs/pipeline.yaml \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass -w to argo submit this command will block until the job completes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. postprocess.sh is the last step, so no need to block, but good to know.

config_url = os.path.abspath(config_url)
config = prepare_config(template, base_config, args.nudge_label, config_url)
fs = vcm.cloud.get_fs(config_url)
fs.mkdirs(config_url, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the new run_k8s can be passed the config dictionary directly, without needing it to be uploaded first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for reminder. Already uploading other assets, so I'll leave this be.

@@ -0,0 +1,47 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include this postprocessing as a dependency in the makefile someplace? It is currently not clear how it fits into the rest of the pipeline. Ideally, typing make should result in everything being run that needs to be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to the README as we discussed.

@oliverwm1
Copy link
Contributor Author

FYI I think the bug you are referring to has been accommodated for temporarily in the latest release of xarray (0.15.1), pydata/xarray#3764, and will ultimately be fixed in pandas in their next release (1.1.0), pandas-dev/pandas#32905.

Thanks for the heads up Spencer!

Copy link
Contributor

@mcgibbon mcgibbon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes!

Copy link
Contributor

@nbren12 nbren12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

This workflow allows an external nudging tendency be applied to FV3GFS runs.
## Run with learned nudging workflow

This workflow (in `workflows/run_with_learned_nudging`) allows an external nudging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the readme.

@oliverwm1 oliverwm1 merged commit 541ae51 into master Apr 9, 2020
@oliverwm1 oliverwm1 deleted the feature/apply-mean-nudging branch April 9, 2020 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants