Improved fv3 logging #225

nbren12 · 2020-04-08T01:39:11Z

This PR introduces several improvements to the logging capability of our prognostic run image

include upstream changes to disable output capturing in fv3config.fv3run
Add capture_fv3gfs_func function. When called this capture the raw fv3gfs outputs and re-emit it as DEBUG level logging statements that can more easily be filtered.
Refactor runtime to external/runtime/runtime. This was easy since it did not depend on any other module in fv3net. (except implicitly the code in fv3net.regression which is imported when loading the sklearn model with pickle).

this has some better logging features

this doesn't work with mpi

Add stdout capturing routines to this module

workflows/one_step_jobs/runfile_test_logging.py

frodre

There's a lot of good stuff in here, thanks! I had some tests failing on my machine with fv3config.fv3run, but I that's probably something about the installation on my system, and it looks like those are skipped on CI anyways.

I think the main comment worth thinking about is whether it's better to have the list of functions we're adjusting output for should be defined within fv3net (could break if there are function name changes upstream or be incomplete), or designated in fv3gfs (lose the explicit definition of what's being altered here).

Otherwise, there are a few clean-up items to attend to and some general comments/questions. Otherwise, LGTM!

frodre · 2020-04-10T19:30:42Z

Makefile

@@ -3,7 +3,7 @@
 #################################################################################
 # GLOBALS                                                                       #
 #################################################################################
-VERSION ?= v0.1.0-a1
+VERSION ?= $(shell git rev-parse HEAD)


Just thinking through this change. Building locally, we would get a unique version tied to our current commit. It makes things a bit tricky if we want to re-use a previous revision, but nor more tricky than looking up the tag. It doesn't affect the CI builds because those adjust VERSION on their own. Seems fine to me and I like the build uniqueness tag linking to git commits.

Yah. I think we should also be tagging our CI images with commit. I remember seeing various places online that that is recommended workflow that maximizes reproducibility since tags can change. I am more worried about building images from a dirty working tree.

frodre · 2020-04-10T19:36:30Z

environment.yml

@@ -47,6 +47,7 @@ dependencies:
  - bokeh=1.4
  - backoff
  - pip:
+    - poetry


Starting to use poetry! If we expand the use eventually, we can adjust the pytest_default caching a bit differently with the availability of the lockfile, potentially able do partial updates. Nice addition towards this with a working example in the runtime package.

Yah. It would be great to use poetry in the images too.

frodre · 2020-04-10T19:44:28Z

external/runtime/tests/test_runtime.py

+from runtime import __version__
+
+
+def test_version():


I haven't seen a test like this before? Is it just for the sake of testing an import? Or was there weird stuff in the versions.

On a second note, maybe we should include the version bump integration in setup.cfg from the getgo so stuff like this is easy to catch and update automatically.

i don't think we have decided whether the versions in external should be the same as fv3net. vcm isn't anywhere in the setup.cfg either.

frodre · 2020-04-10T19:47:22Z

external/runtime/runtime/capture.py

+    """Surpress stderr and stdout from all fv3gfs functions"""
+    import fv3gfs  # noqa
+
+    for func in ["step_dynamics", "step_physics", "initialize", "cleanup"]:


Perhaps we'd want to keep track of functions that have outputs within fv3gfs and reference that list here? That would prevent this list from becoming desync-ed from the package function names.

OTOH, it's nice to have an explicit list here of what's being decorated. We could add logging to get the same information.

I think this is OK atm since those functions are part of the public API of fv3gfs, and that package has a pretty clear versioning strategy. Ultimately, I think this logic should become part of fv3gfs-python, which will mitigate your concern about this code become out of date.

frodre · 2020-04-10T19:54:11Z

fv3net/pipelines/kube_jobs/one_step.py

@@ -286,6 +183,16 @@ def submit_jobs(
    logger.info(pprint.pformat(locals()))
    # kube kwargs are shared by all jobs
    kube_kwargs = get_run_kubernetes_kwargs(one_step_config["kubernetes"], config_url)
+    kube_kwargs["capture_output"] = False
+    kube_kwargs["memory_gb"] = 6.0


These look like some new defaults? If so, there's a defaults dictionary you can throw them in this file above.

https://github.com/VulcanClimateModeling/fv3net/blob/60795240f4d2147132df44fca48fdbfe17440116/fv3net/pipelines/kube_jobs/one_step.py#L24

Okay. Thanks, I added it.

frodre · 2020-04-10T20:54:18Z

tests/test_kube_jobs_utils.py

@@ -152,6 +174,7 @@ def mock_batch_api():
    return num_jobs, num_success, mock_api, labels


+@pytest.mark.skip()


Probably good to get rid of the mock monstrosity I wrote given the new testing strategy.

haha. I remember being pretty into mocks at the time it was written. Now, I think a much better strategy for testing K8s resources would be to spin-up a small cluster with minikube or something.

deleted the mock.

frodre · 2020-04-10T21:00:39Z

external/runtime/runtime/capture.py

+def capture_stream(stream):
+    """fork is not compatible with mpi so this won't work
+
+    The wait line below hangs indefinitely.
+    """


Worth keeping around if it doesn't work?

I kind of want to. It took me a full work day to figure out how to do it without a tempfile! :(

frodre · 2020-04-10T22:28:55Z

I meant for that to be a general comment review without approval or requested changes, but I think it stands as is fine.

brianhenn · 2020-04-10T23:32:09Z

fv3net/pipelines/kube_jobs/one_step.py

@@ -33,30 +32,19 @@


 def timesteps_to_process(


The merge conflict here and in orchestrate_submit_jobs.sh is with an additional argument I added to specify the start index in addition to number of steps, I can send you the resolved code I got working on my local

brianhenn · 2020-04-10T23:37:16Z

external/runtime/runtime/capture.py

+            logger = logging.getLogger(logger_name)
+            out.seek(0)
+            for line in out:
+                logger.debug(line.strip().decode("UTF-8"))


This worked great to simplify the output to the gcs k8s UI

I will make a github issue to refactor this code to fv3gfs-python.

…improved-one-step-logging

* Feature/one step save baseline (#193) This adds several features to the one-step pipeline - big zarr. Everything is stored as one zarr file - saves physics outputs - some refactoring of the job submission. Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67 * save lat/lon grid variables from sfc_dt_atmos (#204) * save lat/lon grid variables from sfc_dt_atmos * Feature/use onestep zarr train data (#207) Use the big zarr from the one step workflow as input to the create training data pipeline * One-step sfc variables time alignment (#214) This makes the diagnostics variables appended to the big zarr have the appropriate step and forecast_time dimensions, just as the variables extracted by the wrapper do. * One step zarr fill values (#223) This accomplishes two things: 1) preventing true model 0s from being cast to NaNs in the one-step big zarr output, and 2) initializing big zarr arrays with NaNs via full so that if they are not filled in due to a failed timestep or other reason, it is more apparent than using empty which produces arbitrary output. * adjustments to be able to run workflows in dev branch (#218) Remove references to hard coded dims and data variables or imports from vcm.cubedsphere.constants, replace with arguments. Can provide coords and dims as args for mappable var * One steps start index (#231) Allows for starting the one-step jobs at the specified index in the timestep list to allow for testing/avoiding spinup timesteps * Dev fix/integration tests (#234) * change order of required args so output is last * fix arg for onestep input to be dir containing big zarr * update end to end integration test ymls * prognostic run adjustments * Improved fv3 logging (#225) This PR introduces several improvements to the logging capability of our prognostic run image - include upstream changes to disable output capturing in `fv3config.fv3run` - Add `capture_fv3gfs_func` function. When called this capture the raw fv3gfs outputs and re-emit it as DEBUG level logging statements that can more easily be filtered. - Refactor `runtime` to `external/runtime/runtime`. This was easy since it did not depend on any other module in fv3net. (except implicitly the code in `fv3net.regression` which is imported when loading the sklearn model with pickle). - updates fv3config to master * manually merge in the refactor from master while keeping new names from develop (#237) * lint * remove logging from testing * Dev fix/arg order (#238) * update history * fix positional args * fix function args * update history * linting Co-authored-by: Anna Kwa <annak@vulcan.com> Co-authored-by: brianhenn <brianhenn@gmail.com>

nbren12 added 9 commits April 7, 2020 20:16

bump fv3config version

3e8465a

this has some better logging features

use fork to print the output

eaaac01

this doesn't work with mpi

add mpi friendly function

0f422da

runfile for logging

fc2a740

Refactor runtime to external

54e7ab0

Add stdout capturing routines to this module

add capture_fv3gfs functions to runtime

133f246

fix requirements and capture fv3gfs logs in runfile

45237ba

freeze requirements

6599492

fix nameeror

da2cc5a

nbren12 changed the base branch from master to develop-one-steps April 8, 2020 01:39

nbren12 added 15 commits April 8, 2020 19:49

cleanup wait for complete

d0b5055

lint

1e5ca64

fix completion waiting

1766474

remove pod deletion step

00ae83e

change job deletion function

2d5e920

reformat

28bbc02

increase memory limits of job

49e8878

uncomment

99e3ebd

add runtime to list of local environments

96f8f36

change package installation

ffb2560

add poetry to environment.yml

c0b3fd0

get tests working

e417de9

Delete unused tests

999c641

lint

0e9b9ca

delete dead function

6079524

nbren12 commented Apr 9, 2020

View reviewed changes

workflows/one_step_jobs/runfile_test_logging.py Outdated Show resolved Hide resolved

frodre approved these changes Apr 10, 2020

View reviewed changes

brianhenn approved these changes Apr 10, 2020

View reviewed changes

delete runfile

7e886fa

nbren12 added 9 commits April 13, 2020 18:52

respond to Andre's comments

16a4b80

Merge branch 'develop-one-steps' into feature/improved-one-step-logging

2027889

Add retries and make job name better

5213df2

actually use label with jobid

55b9010

fix arg handling of n_stop

be8fb6d

update fv3config to master and fix name generation

2cfa298

lint

3207139

lint

de75d2b

Merge remote-tracking branch 'origin/develop-one-steps' into feature/…

44b4ba2

…improved-one-step-logging

nbren12 merged commit 332eb79 into develop-one-steps Apr 13, 2020

nbren12 deleted the feature/improved-one-step-logging branch April 13, 2020 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved fv3 logging #225

Improved fv3 logging #225

nbren12 commented Apr 8, 2020

frodre left a comment

frodre Apr 10, 2020

nbren12 Apr 13, 2020 •

edited

Loading

frodre Apr 10, 2020

nbren12 Apr 13, 2020 •

edited

Loading

frodre Apr 10, 2020

nbren12 Apr 13, 2020

frodre Apr 10, 2020

nbren12 Apr 13, 2020 •

edited

Loading

frodre Apr 10, 2020

nbren12 Apr 13, 2020

frodre Apr 10, 2020

nbren12 Apr 13, 2020

nbren12 Apr 13, 2020

frodre Apr 10, 2020

nbren12 Apr 13, 2020

frodre commented Apr 10, 2020

brianhenn Apr 10, 2020

brianhenn Apr 10, 2020

nbren12 Apr 13, 2020

		@@ -152,6 +174,7 @@ def mock_batch_api():
		return num_jobs, num_success, mock_api, labels


		@pytest.mark.skip()

Improved fv3 logging #225

Improved fv3 logging #225

Conversation

nbren12 commented Apr 8, 2020

frodre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frodre commented Apr 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Apr 13, 2020 •

edited

Loading

nbren12 Apr 13, 2020 •

edited

Loading

nbren12 Apr 13, 2020 •

edited

Loading