Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/one step save baseline #193

Merged
merged 83 commits into from
Mar 26, 2020
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
8ca8819
Refactore online_modules to fv3net
nbren12 Mar 17, 2020
5909c73
black
nbren12 Mar 17, 2020
cee28fd
Create us.gcr.io/vcm-ml/prognostic_run:v0.1.0
nbren12 Mar 17, 2020
305629f
Refactor us.gcr.io/vcm-ml/fv3net image build code
nbren12 Mar 17, 2020
ff9f938
Add build_images makefile target
nbren12 Mar 17, 2020
b31a099
Add __version__ to fv3net init
nbren12 Mar 17, 2020
8f9cb3c
update prognostic_run_diags configuration
nbren12 Mar 17, 2020
31930d9
black
nbren12 Mar 17, 2020
f994766
update readme
nbren12 Mar 17, 2020
871be05
Fix table in README
nbren12 Mar 17, 2020
db16a39
fix yaml bug
nbren12 Mar 17, 2020
47d69e3
pin pandas version to 1.0.1
nbren12 Mar 17, 2020
cfe4523
save and process outputs at different stages
nbren12 Mar 18, 2020
1b120ad
post process runfile
nbren12 Mar 18, 2020
61c2036
Initialize the zarr store
nbren12 Mar 18, 2020
46b4f0f
write code to insert output in zarr store
nbren12 Mar 18, 2020
f6827c0
get data for one time-step saved in the cloud
nbren12 Mar 18, 2020
7eb6314
test for full set of steps
nbren12 Mar 18, 2020
2981824
update fv3config submodule
nbren12 Mar 19, 2020
3f6edda
update fv3config to master
nbren12 Mar 19, 2020
a735a2e
Refactor config generation
nbren12 Mar 19, 2020
8a7a49c
print more info in zarr_stat
nbren12 Mar 19, 2020
ff62ff3
remove some files
nbren12 Mar 19, 2020
a152ed7
Separate kube_config and fv3config handling
nbren12 Mar 19, 2020
e397c13
black and add workflow to job lables
nbren12 Mar 19, 2020
71f7300
linter
nbren12 Mar 19, 2020
74d5c9b
save coordinate information and attrs to the big zarr
nbren12 Mar 19, 2020
27cb13d
add coordinate info in runfile
nbren12 Mar 19, 2020
b8a9c04
debug
nbren12 Mar 19, 2020
0cf8cff
move writing into runfile
nbren12 Mar 19, 2020
8f443f7
black
nbren12 Mar 19, 2020
fdac3bc
log zarr creation stuff
nbren12 Mar 19, 2020
c6ad40b
initialize array with timesteps
nbren12 Mar 19, 2020
b7de07f
save many more outputs
nbren12 Mar 19, 2020
32ff298
parallelize opening writing across variables
nbren12 Mar 19, 2020
c8bff73
write data with only the master thread
nbren12 Mar 20, 2020
cad7e08
debug out of memory error
nbren12 Mar 20, 2020
49334f5
fix store call
nbren12 Mar 20, 2020
d0c77d5
change versioning info in makefile
nbren12 Mar 20, 2020
cfed20f
add tracer variables
nbren12 Mar 20, 2020
ae2e383
change image version and chunking
nbren12 Mar 20, 2020
97e4627
change chunks size
nbren12 Mar 20, 2020
34426c0
fix out of memory errors
nbren12 Mar 20, 2020
fba67f5
show dataset size in summary script
nbren12 Mar 20, 2020
4da489b
Add surface variables
nbren12 Mar 21, 2020
dc1f502
merge in two d data
nbren12 Mar 24, 2020
669df7e
reformat
nbren12 Mar 24, 2020
5a2d1a1
Merge branch 'master' into feature/one-step-save-baseline
nbren12 Mar 24, 2020
6f698bc
remove get_runfile_config
nbren12 Mar 24, 2020
56fe153
update fv3config to master
nbren12 Mar 24, 2020
e57c543
fix time dimension
nbren12 Mar 24, 2020
b85d0e9
add test of runfile
nbren12 Mar 24, 2020
92f2353
comment on need for fill_value=Nan
nbren12 Mar 24, 2020
1bc45b5
black
nbren12 Mar 24, 2020
b72a670
add provenance information to file
nbren12 Mar 24, 2020
08a7dc5
add git provenance info
nbren12 Mar 25, 2020
f2a4f0a
change order of arguments following upstream changes
nbren12 Mar 25, 2020
7a0d7b0
lint
nbren12 Mar 25, 2020
09cc500
fix runfile
nbren12 Mar 25, 2020
669bc68
comment out git logging, discovering the reop fails
nbren12 Mar 25, 2020
513562d
another runfile fix
nbren12 Mar 25, 2020
cf0e5ff
another bug 30 mins later...ugh
nbren12 Mar 25, 2020
f685dc9
delete submit_jobs.py
nbren12 Mar 25, 2020
597aab2
add some type hints
nbren12 Mar 25, 2020
732128f
lint
nbren12 Mar 25, 2020
e524d4e
Merge branch 'develop-one-steps' into feature/one-step-save-baseline
nbren12 Mar 25, 2020
558011d
unify the naming of the monitors and step names
nbren12 Mar 26, 2020
9f41b4b
Add out of memory troubleshooting info
nbren12 Mar 26, 2020
a962ef9
Update info about submission
nbren12 Mar 26, 2020
0e537bb
add comment clarifying the local upload dir
nbren12 Mar 26, 2020
3805291
lint
nbren12 Mar 26, 2020
9873e54
fix typo
nbren12 Mar 26, 2020
eb11110
Fix another bug
nbren12 Mar 26, 2020
3dd8cf0
another typo
nbren12 Mar 26, 2020
c144a5b
fix another bug
nbren12 Mar 26, 2020
2c5f2fe
fix key
nbren12 Mar 26, 2020
643757f
pass index to write_zarr_store
nbren12 Mar 26, 2020
d40b044
remove prototyping functions
nbren12 Mar 26, 2020
4a953da
lint
nbren12 Mar 26, 2020
92df973
bake the runfile into the submission script
nbren12 Mar 26, 2020
d4fa24f
print logging information
nbren12 Mar 26, 2020
71cc1ec
lint
nbren12 Mar 26, 2020
5a6fa36
update yaml with brian's code
nbren12 Mar 26, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 13 additions & 8 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#################################################################################
# GLOBALS #
#################################################################################
VERSION = v0.1.0
VERSION = v0.1.0-a1
ENVIRONMENT_SCRIPTS = .environment-scripts
PROJECT_DIR := $(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))
BUCKET = [OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')
Expand All @@ -24,21 +24,26 @@ endif
# COMMANDS #
#################################################################################
.PHONY: wheels build_images push_image
wheels:
pip wheel --no-deps .
pip wheel --no-deps external/vcm
# wheels:
# pip wheel --no-deps .
# pip wheel --no-deps external/vcm
# pip wheel --no-deps external/fv3config

# pattern rule for building docker images
build_image_%:
docker build -f docker/$*/Dockerfile . -t us.gcr.io/vcm-ml/$*:$(VERSION)

build_image_prognostic_run: wheels
enter_%:
docker run -ti -w /fv3net -v $(shell pwd):/fv3net us.gcr.io/vcm-ml/$*:$(VERSION) bash

build_image_prognostic_run:

build_images: build_image_fv3net build_image_prognostic_run

push_image:
docker push us.gcr.io/vcm-ml/fv3net:$(VERSION)
docker push us.gcr.io/vcm-ml/prognostic_run:$(VERSION)
push_images: push_image_prognostic_run push_image_fv3net

push_image_%:
docker push us.gcr.io/vcm-ml/$*:$(VERSION)

enter: build_image
docker run -it -v $(shell pwd):/code \
Expand Down
14 changes: 10 additions & 4 deletions docker/prognostic_run/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
FROM us.gcr.io/vcm-ml/fv3gfs-python:v0.2.1
FROM us.gcr.io/vcm-ml/fv3gfs-python:v0.3.1


COPY docker/prognostic_run/requirements.txt /tmp/requirements.txt
RUN pip3 install -r /tmp/requirements.txt
COPY fv3net-0.1.0-py3-none-any.whl /wheels/fv3net-0.1.0-py3-none-any.whl
COPY vcm-0.1.0-py3-none-any.whl /wheels/vcm-0.1.0-py3-none-any.whl
RUN pip3 install --no-deps /wheels/fv3net-0.1.0-py3-none-any.whl && pip3 install /wheels/vcm-0.1.0-py3-none-any.whl
RUN pip3 install wheel

# cache external package installation
COPY external/fv3config /fv3net/external/fv3config
COPY external/vcm /fv3net/external/vcm
RUN pip3 install -e /fv3net/external/vcm -e /fv3net/external/fv3config

COPY . /fv3net
RUN pip3 install --no-deps -e /fv3net
3 changes: 2 additions & 1 deletion docker/prognostic_run/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ scikit-learn==0.22.1
dask
joblib
zarr
scikit-image
scikit-image
google-cloud-logging
2 changes: 1 addition & 1 deletion external/fv3config
112 changes: 54 additions & 58 deletions fv3net/pipelines/kube_jobs/one_step.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
import logging
import os
import fsspec
from toolz import assoc
import uuid
import yaml
import re
from copy import deepcopy
from multiprocessing import Pool
from typing import List, Tuple
from typing import List, Dict

import fv3config
from . import utils
Expand Down Expand Up @@ -197,20 +198,16 @@ def _update_config(
workflow_name: str,
base_config_version: str,
user_model_config: dict,
user_kubernetes_config: dict,
input_url: str,
config_url: str,
timestep: str,
) -> Tuple[dict]:
) -> Dict:
"""
Update kubernetes and fv3 configurations with user inputs
to prepare for fv3gfs one-step runs.
"""
base_model_config = utils.get_base_fv3config(base_config_version)
model_config = utils.update_nested_dict(base_model_config, user_model_config)
kubernetes_config = utils.update_nested_dict(
deepcopy(KUBERNETES_CONFIG_DEFAULT), user_kubernetes_config
)

model_config = fv3config.enable_restart(model_config)
model_config["experiment_name"] = _get_experiment_name(workflow_name, timestep)
Expand All @@ -225,27 +222,19 @@ def _update_config(
}
)

return model_config, kubernetes_config
return model_config


def _upload_config_files(
model_config: dict,
kubernetes_config: dict,
config_url: str,
local_vertical_grid_file=None,
local_vertical_grid_file,
upload_config_filename="fv3config.yml",
) -> Tuple[dict]:
) -> str:
"""
Upload any files to remote paths necessary for fv3config and the
fv3gfs one-step runs.
"""

if "runfile" in kubernetes_config:
runfile_path = kubernetes_config["runfile"]
kubernetes_config["runfile"] = utils.transfer_local_to_remote(
runfile_path, config_url
)

model_config["diag_table"] = utils.transfer_local_to_remote(
model_config["diag_table"], config_url
)
Expand All @@ -259,38 +248,22 @@ def _upload_config_files(
with fsspec.open(config_path, "w") as config_file:
config_file.write(yaml.dump(model_config))

return model_config, kubernetes_config
return config_path


def prepare_and_upload_config(
workflow_name: str,
input_url: str,
config_url: str,
timestep: str,
one_step_config: dict,
base_config_version: str,
**kwargs,
) -> Tuple[dict]:
"""Update model and kubernetes configurations for this particular
timestep and upload necessary files to GCS"""

user_model_config = one_step_config["fv3config"]
user_kubernetes_config = one_step_config["kubernetes"]

model_config, kube_config = _update_config(
workflow_name,
base_config_version,
user_model_config,
user_kubernetes_config,
input_url,
config_url,
timestep,
)
model_config, kube_config = _upload_config_files(
model_config, kube_config, config_url, **kwargs
def get_run_kubernetes_kwargs(user_kubernetes_config, config_url):

kubernetes_config = utils.update_nested_dict(
deepcopy(KUBERNETES_CONFIG_DEFAULT), user_kubernetes_config
)

return model_config, kube_config
if "runfile" in kubernetes_config:
runfile_path = kubernetes_config["runfile"]
kubernetes_config["runfile"] = utils.transfer_local_to_remote(
runfile_path, config_url
)

return kubernetes_config


def submit_jobs(
Expand All @@ -305,28 +278,51 @@ def submit_jobs(
local_vertical_grid_file=None,
) -> None:
"""Submit one-step job for all timesteps in timestep_list"""
for timestep in timestep_list:

zarr_url = os.path.join(output_url, "big.zarr")
# kube kwargs are shared by all jobs
kube_kwargs = get_run_kubernetes_kwargs(one_step_config["kubernetes"], config_url)

def config_factory(**kwargs):
timestep = timestep_list[kwargs["index"]]
curr_input_url = os.path.join(input_url, timestep)
curr_output_url = os.path.join(output_url, timestep)
curr_config_url = os.path.join(config_url, timestep)

model_config, kube_config = prepare_and_upload_config(
config = deepcopy(one_step_config)
kwargs["url"] = zarr_url
config["fv3config"]["one_step"] = kwargs

model_config = _update_config(
workflow_name,
base_config_version,
config["fv3config"],
curr_input_url,
curr_config_url,
timestep,
one_step_config,
base_config_version,
local_vertical_grid_file=local_vertical_grid_file,
)
return _upload_config_files(
model_config, curr_config_url, local_vertical_grid_file
)

def run_job(wait=False, **kwargs):
"""Run a run_kubernetes job

jobname = model_config["experiment_name"]
kube_config["jobname"] = jobname
kwargs are passed workflows/one_step_jobs/runfile.py:post_process

"""
uid = str(uuid.uuid4())
labels = assoc(job_labels, "jobid", uid)
model_config_url = config_factory(**kwargs)
fv3config.run_kubernetes(
os.path.join(curr_config_url, "fv3config.yml"),
curr_output_url,
job_labels=job_labels,
**kube_config,
model_config_url, "/tmp/null", job_labels=labels, **kube_kwargs
Copy link
Contributor

@AnnaKwa AnnaKwa Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is /tmp/null supposed to be the output url in this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. this job does not need to upload the run directory any longer since all the useful info is in the big zarr. Setting outdir=/tmp/null was the easiest way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to make this clear.

Copy link
Contributor

@AnnaKwa AnnaKwa Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run a test (writing to my own test dir), it complains that the output tmp/null needs to be a remote path, any idea I doing something wrong? Running the test script seems to work for others.

INFO:fv3net.pipelines.kube_jobs.one_step:Number of input times: 11
INFO:fv3net.pipelines.kube_jobs.one_step:Number of completed times: 0
INFO:fv3net.pipelines.kube_jobs.one_step:Number of times to process: 11
INFO:fv3net.pipelines.kube_jobs.one_step:Running the first time step to initialize the zarr store
Traceback (most recent call last):
  File "/home/AnnaK/fv3net/workflows/one_step_jobs/orchestrate_submit_jobs.py", line 116, in <module>
    local_vertical_grid_file=local_vgrid_file,
  File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 330, in submit_jobs
    run_job(index=k, init=True, wait=True, timesteps=timestep_list)
  File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 322, in run_job
    model_config_url, local_tmp_dir, job_labels=labels, **kube_kwargs
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 74, in run_kubernetes
    job_labels,
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 92, in _get_job
    _ensure_locations_are_remote(config_location, outdir)
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 113, in _ensure_locations_are_remote
    _ensure_is_remote(location, description)
  File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 119, in _ensure_is_remote
    f"{description} must be a remote path when running on kubernetes, "
ValueError: output directory must be a remote path when running on kubernetes, instead is /tmp/null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into this too and in my case it was because I hadn't pulled and installed the latest fv3config that includes Noah's changes. Once I did that ti worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another good reason to have run these kind of tests in CI.

)
logger.info(f"Submitted job {jobname}")
if wait:
utils.wait_for_complete(job_labels, sleep_interval=10)

for k, timestep in enumerate(timestep_list):
if k == 0:
logger.info("Running the first time step to initialize the zarr store")
run_job(index=k, init=True, wait=True, timesteps=timestep_list)
else:
logger.info(f"Submitting job for timestep {timestep}")
run_job(index=k, init=False)
2 changes: 1 addition & 1 deletion fv3net/runtime/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from . import sklearn_interface as sklearn
from .state_io import init_writers, append_to_writers, CF_TO_RESTART_MAP
from .config import get_runfile_config, get_namelist
from .config import get_namelist, get_config
4 changes: 2 additions & 2 deletions fv3net/runtime/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ class dotdict(dict):
__delattr__ = dict.__delitem__


def get_runfile_config():
def get_config():
with open("fv3config.yml") as f:
config = yaml.safe_load(f)
return dotdict(config["scikit_learn"])
return config


def get_namelist():
Expand Down
14 changes: 14 additions & 0 deletions workflows/one_step_jobs/_run_steps.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
workdir=$(pwd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with having this script and the empty run_steps.sh script?

Copy link
Contributor Author

@nbren12 nbren12 Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were for prototyping only, so I deleted these and pasted them into the readme as a quick example.

src=gs://vcm-ml-data/orchestration-testing/test-andrep/coarsen_restarts_source-resolution_384_target-resolution_48/
output=gs://vcm-ml-data/testing-noah/one-step
image=us.gcr.io/vcm-ml/prognostic_run:v0.1.0-a1
yaml=$PWD/deep-conv-off.yml

gsutil -m rm -r $output > /dev/null
(
cd ../../
python $workdir/orchestrate_submit_jobs.py \
$src $output $yaml $image -o \
--config-version v0.3
)

1 change: 1 addition & 0 deletions workflows/one_step_jobs/deep-conv-off.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
kubernetes:
docker_image: us.gcr.io/vcm-ml/fv3gfs-python:v0.2.1
runfile: workflows/one_step_jobs/runfile.py
fv3config:
diag_table: workflows/one_step_jobs/diag_table_one_step
namelist:
Expand Down
5 changes: 4 additions & 1 deletion workflows/one_step_jobs/orchestrate_submit_jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,10 @@ def _create_arg_parser():
one_step_config = yaml.load(file, Loader=yaml.FullLoader)
workflow_name = Path(args.one_step_yaml).with_suffix("").name
short_id = get_alphanumeric_unique_tag(8)
job_label = {"orchestrator-jobs": f"{workflow_name}-{short_id}"}
job_label = {
"orchestrator-jobs": f"{workflow_name}-{short_id}",
"workflow": "one_step_jobs",
}

if not args.config_url:
config_url = os.path.join(args.output_url, CONFIG_DIRECTORY_NAME)
Expand Down
Empty file.
Loading