Feature/one step save baseline #193

nbren12 · 2020-03-19T23:03:03Z

This adds several features to the one-step pipeline

big zarr. Everything is stored as one zarr file
saves physics outputs
some refactoring of the job submission.

Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67

TODO

I think the forecast_time dimension is still broken.
raise errors if the jobs fail.

Builds a docker image with fv3net, vcm, and fv3gfs-python installed.

It looks cleaner to use a closure for this

The former is constant for all runs, while the latter changes for each time step. It is cleaner to separate them, and remove some of the "and" functions.

nbren12 · 2020-03-25T17:43:37Z

Compute the R2 of the baseline physics.

AnnaKwa · 2020-03-25T20:20:16Z

workflows/one_step_jobs/runfile.py

+        mpi_comm=MPI.COMM_WORLD,
+    )
+
+    begin_monitor = fv3gfs.ZarrMonitor(


Is the "physics" being referred to here in "begin_physics" the same physics that's being stepped in L261? I am kind of confused with the naming here and what the difference is between "begin physics" and "before physics", since the former stores the state right before the dynamics step.

Edit:
Never mind, I think I see what the names refer to. Could the intermediate zarr names use the same names as the step dim ["begin", "after_dynamics", "after_physics"] ?

Okay. I rewrote the code to pass the paths of the monitors as a dictionary whose keys are the "step".

AnnaKwa · 2020-03-25T21:11:35Z

workflows/one_step_jobs/zarr_stat.py

@@ -0,0 +1,30 @@
+import fsspec


Is this file supposed to stay in the final PR?

I guess we can remove it. I thought it was a helpful script though.

AnnaKwa

Thanks Noah, this PR is really useful. Minor comments/requests and one spot in one_step.py that is breaking on my end.

AnnaKwa · 2020-03-25T21:48:00Z

workflows/one_step_jobs/runfile.py

+
+
+def rename_sfc_dt_atmos(sfc: xr.Dataset) -> xr.Dataset:
+    DIMS = {"grid_xt": "x", "grid_yt": "y", "time": "forecast_time"}


Not really in the scope of this PR but a note for future PRs to the develop branch: need to make sure all references to the old set of variable names in downstream workflows are updated to work with this output.

Okay. maybe we could move this note to the PR to be created from the develop branch to master.

AnnaKwa · 2020-03-25T22:26:10Z

workflows/one_step_jobs/runfile.py

+import time
+
+# avoid out of memory errors
+# dask.config.set(scheduler='single-threaded')


Could you add some info to readme instructing people to use this if this is an issue?

added a section to the README

AnnaKwa · 2020-03-25T23:00:55Z

workflows/one_step_jobs/submit_jobs.py

@@ -1,89 +0,0 @@
-import logging


With the deletion of this file, is the intention for all one step jobs to use orchestrate_submit_jobs.py? Could you update the README for this change?

Indeed. It doesn't seem like the original script was used by anything.

@frodre let me know if I should add it back.

I updated the README, where I deleted the help string, since that is bound to get out of date.

Thanks for updating the README. If the entry point is the orchestrator can you update the one_step config section in the workflows/end_to_end/full_workflow_config.yaml with a set if working parameters? FYI I was able to run this using _run_steps.sh, but when I tried to use the orchestrator I was having some issues.

AnnaKwa · 2020-03-25T23:09:00Z

fv3net/pipelines/kube_jobs/one_step.py

-            curr_output_url,
-            job_labels=job_labels,
-            **kube_config,
+            model_config_url, "/tmp/null", job_labels=labels, **kube_kwargs


Is /tmp/null supposed to be the output url in this line?

Yes. this job does not need to upload the run directory any longer since all the useful info is in the big zarr. Setting outdir=/tmp/null was the easiest way to do this.

I added a comment to make this clear.

When I run a test (writing to my own test dir), it complains that the output tmp/null needs to be a remote path, any idea I doing something wrong? Running the test script seems to work for others.

INFO:fv3net.pipelines.kube_jobs.one_step:Number of input times: 11 INFO:fv3net.pipelines.kube_jobs.one_step:Number of completed times: 0 INFO:fv3net.pipelines.kube_jobs.one_step:Number of times to process: 11 INFO:fv3net.pipelines.kube_jobs.one_step:Running the first time step to initialize the zarr store Traceback (most recent call last): File "/home/AnnaK/fv3net/workflows/one_step_jobs/orchestrate_submit_jobs.py", line 116, in <module> local_vertical_grid_file=local_vgrid_file, File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 330, in submit_jobs run_job(index=k, init=True, wait=True, timesteps=timestep_list) File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 322, in run_job model_config_url, local_tmp_dir, job_labels=labels, **kube_kwargs File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 74, in run_kubernetes job_labels, File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 92, in _get_job _ensure_locations_are_remote(config_location, outdir) File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 113, in _ensure_locations_are_remote _ensure_is_remote(location, description) File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 119, in _ensure_is_remote f"{description} must be a remote path when running on kubernetes, " ValueError: output directory must be a remote path when running on kubernetes, instead is /tmp/null

I ran into this too and in my case it was because I hadn't pulled and installed the latest fv3config that includes Noah's changes. Once I did that ti worked.

Oh I see, thanks

another good reason to have run these kind of tests in CI.

brianhenn · 2020-03-26T05:37:38Z

workflows/one_step_jobs/runfile.py

+        logger.info(f"Writing {variable} to {group}")
+        dims = group[variable].attrs["_ARRAY_DIMENSIONS"][1:]
+        dask_arr = ds[variable].transpose(*dims).data
+        dask_arr.store(group[variable], regions=(index,))


I think index needs to get passed?

whoops...how did that not get passed? I am surprised this didn't cause an error.

I guess index is defined in the __main__ block...anyway I fixed this.

brianhenn · 2020-03-26T18:05:10Z

workflows/one_step_jobs/runfile.py

+    state = fv3gfs.get_state(names=VARIABLES + (TIME,))
+    if rank == 0:
+        logger.info("Beginning steps")
+    for i in range(fv3gfs.get_step_count()):


The output zarr has a forecast_time length of 15, whereas our previous approached produced a length of 16 (initial conditions + results after 15 steps). Is the thinking that we could extract the 16 values from the current approach using the "begin" and "after_physics" points in the step dimension, or only use 15 values and compute 14 tendencies (instead of 15 as of now)? Not sure this matters much but it's slightly different than before

Indeed the "begin" and "after_physics" dimension are redundant. The idea is we can compute the total model tendency by taking differences along the "step" dimension (e.g.
ds.sel(step="after_physics") - ds.sel(step="begin")) rather than along the "forecast_time" dimension.

Oh OK that makes sense. So after_physics at time i is the same as begin at i + 1?

That's right. It's redundant, but does simplify the logic IMO.

brianhenn · 2020-03-26T18:07:47Z

workflows/one_step_jobs/_run_steps.sh

@@ -0,0 +1,14 @@
+workdir=$(pwd)


What's up with having this script and the empty run_steps.sh script?

These were for prototyping only, so I deleted these and pasted them into the readme as a quick example.

brianhenn · 2020-03-26T18:49:02Z

workflows/one_step_jobs/submit_jobs.py

@@ -1,89 +0,0 @@
-import logging


Thanks for updating the README. If the entry point is the orchestrator can you update the one_step config section in the workflows/end_to_end/full_workflow_config.yaml with a set if working parameters? FYI I was able to run this using _run_steps.sh, but when I tried to use the orchestrator I was having some issues.

brianhenn

Thanks this worked great overall. See comments about using the orchestrator and the forecast time dim.

brianhenn

LGTM thanks

AnnaKwa

Looks good to me, approving

nbren12 · 2020-03-26T23:31:36Z

Great thanks a lot guys!

* Feature/one step save baseline (#193) This adds several features to the one-step pipeline - big zarr. Everything is stored as one zarr file - saves physics outputs - some refactoring of the job submission. Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67 * save lat/lon grid variables from sfc_dt_atmos (#204) * save lat/lon grid variables from sfc_dt_atmos * Feature/use onestep zarr train data (#207) Use the big zarr from the one step workflow as input to the create training data pipeline * One-step sfc variables time alignment (#214) This makes the diagnostics variables appended to the big zarr have the appropriate step and forecast_time dimensions, just as the variables extracted by the wrapper do. * One step zarr fill values (#223) This accomplishes two things: 1) preventing true model 0s from being cast to NaNs in the one-step big zarr output, and 2) initializing big zarr arrays with NaNs via full so that if they are not filled in due to a failed timestep or other reason, it is more apparent than using empty which produces arbitrary output. * adjustments to be able to run workflows in dev branch (#218) Remove references to hard coded dims and data variables or imports from vcm.cubedsphere.constants, replace with arguments. Can provide coords and dims as args for mappable var * One steps start index (#231) Allows for starting the one-step jobs at the specified index in the timestep list to allow for testing/avoiding spinup timesteps * Dev fix/integration tests (#234) * change order of required args so output is last * fix arg for onestep input to be dir containing big zarr * update end to end integration test ymls * prognostic run adjustments * Improved fv3 logging (#225) This PR introduces several improvements to the logging capability of our prognostic run image - include upstream changes to disable output capturing in `fv3config.fv3run` - Add `capture_fv3gfs_func` function. When called this capture the raw fv3gfs outputs and re-emit it as DEBUG level logging statements that can more easily be filtered. - Refactor `runtime` to `external/runtime/runtime`. This was easy since it did not depend on any other module in fv3net. (except implicitly the code in `fv3net.regression` which is imported when loading the sklearn model with pickle). - updates fv3config to master * manually merge in the refactor from master while keeping new names from develop (#237) * lint * remove logging from testing * Dev fix/arg order (#238) * update history * fix positional args * fix function args * update history * linting Co-authored-by: Anna Kwa <annak@vulcan.com> Co-authored-by: brianhenn <brianhenn@gmail.com>

nbren12 added 30 commits March 17, 2020 19:18

Refactore online_modules to fv3net

8ca8819

black

5909c73

Create us.gcr.io/vcm-ml/prognostic_run:v0.1.0

cee28fd

Builds a docker image with fv3net, vcm, and fv3gfs-python installed.

Refactor us.gcr.io/vcm-ml/fv3net image build code

305629f

Add build_images makefile target

ff9f938

Add __version__ to fv3net init

b31a099

update prognostic_run_diags configuration

8f9cb3c

black

31930d9

update readme

f994766

Fix table in README

871be05

fix yaml bug

db16a39

pin pandas version to 1.0.1

47d69e3

save and process outputs at different stages

cfe4523

post process runfile

1b120ad

Initialize the zarr store

61c2036

write code to insert output in zarr store

46b4f0f

get data for one time-step saved in the cloud

f6827c0

test for full set of steps

7eb6314

update fv3config submodule

2981824

update fv3config to master

3f6edda

Refactor config generation

a735a2e

It looks cleaner to use a closure for this

print more info in zarr_stat

8a7a49c

remove some files

ff62ff3

Separate kube_config and fv3config handling

a152ed7

The former is constant for all runs, while the latter changes for each time step. It is cleaner to separate them, and remove some of the "and" functions.

black and add workflow to job lables

e397c13

linter

71f7300

save coordinate information and attrs to the big zarr

74d5c9b

add coordinate info in runfile

27cb13d

debug

b8a9c04

move writing into runfile

0cf8cff

AnnaKwa reviewed Mar 25, 2020

View reviewed changes

AnnaKwa requested changes Mar 25, 2020

View reviewed changes

nbren12 added 10 commits March 26, 2020 00:41

unify the naming of the monitors and step names

558011d

Add out of memory troubleshooting info

9f41b4b

Update info about submission

a962ef9

add comment clarifying the local upload dir

0e537bb

lint

3805291

fix typo

9873e54

Fix another bug

eb11110

another typo

3dd8cf0

fix another bug

c144a5b

fix key

2c5f2fe

brianhenn reviewed Mar 26, 2020

View reviewed changes

brianhenn requested changes Mar 26, 2020

View reviewed changes

nbren12 added 7 commits March 26, 2020 21:06

pass index to write_zarr_store

643757f

remove prototyping functions

d40b044

lint

4a953da

bake the runfile into the submission script

92df973

print logging information

d4fa24f

lint

71cc1ec

update yaml with brian's code

5a6fa36

brianhenn approved these changes Mar 26, 2020

View reviewed changes

AnnaKwa approved these changes Mar 26, 2020

View reviewed changes

nbren12 merged commit f929f75 into develop-one-steps Mar 26, 2020

nbren12 deleted the feature/one-step-save-baseline branch March 26, 2020 23:32

This was referenced Apr 6, 2020

Start skeleton of BigToZarr workflow #124

Closed

ETL directory of one-step runs to zarr #156

Closed



		def rename_sfc_dt_atmos(sfc: xr.Dataset) -> xr.Dataset:
		DIMS = {"grid_xt": "x", "grid_yt": "y", "time": "forecast_time"}

Feature/one step save baseline #193

Feature/one step save baseline #193

Conversation

nbren12 commented Mar 19, 2020 • edited Loading

nbren12 commented Mar 25, 2020

AnnaKwa Mar 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaKwa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

brianhenn Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

AnnaKwa Mar 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaKwa Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianhenn Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

brianhenn Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

brianhenn left a comment

Choose a reason for hiding this comment

brianhenn left a comment

Choose a reason for hiding this comment

AnnaKwa left a comment

Choose a reason for hiding this comment

nbren12 commented Mar 26, 2020

nbren12 commented Mar 19, 2020 •

edited

Loading

AnnaKwa Mar 25, 2020 •

edited

Loading

nbren12 Mar 26, 2020 •

edited

Loading

brianhenn Mar 26, 2020 •

edited

Loading

AnnaKwa Mar 25, 2020 •

edited

Loading

AnnaKwa Mar 26, 2020 •

edited

Loading

brianhenn Mar 26, 2020 •

edited

Loading

nbren12 Mar 26, 2020 •

edited

Loading

brianhenn Mar 26, 2020 •

edited

Loading