# 1D: Saving the full state of the model to restart from later

Sometimes we may not want to run a full-length simulation all at once.  For that reason it is useful to be able to checkpoint the state of the model at a given time and restart from that state later.  This notebook illustrates how we do that with the Python wrapper -- note one could also do it with the conventional fortran method of doing so.

I have found that within this notebook environment re-initializing the model tends to hang, requiring that the docker container be restarted.  For that reason I have included the boilerplate code that starts the ipyparallel cluster in front of each section where I expect you will need to pick up the notebook where you left off after restarting the docker container.

## Recursively copy the contents of the example run directory to a clean folder in the docker container

In [1]:
import os
import shutil

In [2]:
REFERENCE = "reference_rundir"
RUNDIR = "rundir_1D"

if os.path.isdir(RUNDIR):
    shutil.rmtree(RUNDIR)
shutil.copytree(REFERENCE, RUNDIR);

## Start the ipyparallel session for the notebook

This involves running a couple shell commands (hence the `%%bash` cell magic command at the top of the following cell); note this means these commands are executed in the shell rather than the notebook.

In [3]:
%%bash
# if you get a crash, add --debug to this command to put more info in logs
# logs are in /root/.ipython/profile_mpi/log
ipcluster start --profile=mpi -n 6 --daemonize
sleep 10  # command is asynchronous, so let's wait to avoid an error in the next cell

In [4]:
import ipyparallel as ipp
rc = ipp.Client(profile='mpi', targets='all', block=True)
dv = rc[:]
dv.activate()
dv.block = True

In [5]:
print("Running IPython Parallel on {0} MPI engines".format(len(rc.ids)))
print("Commands in the following cells will be executed in parallel (disable with %autopx)")
%autopx

Running IPython Parallel on 6 MPI engines
Commands in the following cells will be executed in parallel (disable with %autopx)
%autopx enabled


## Use `mpi4py` gain access to the communicator for the notebook.

In [6]:
from mpi4py import MPI

comm = MPI.COMM_WORLD

## Enter the run directory

Next we move into the run directory we created.  Note that we need to re-import `os` and re-define any variables we created before we started the cluster.  `fv3gfs.wrapper` requires that its routines are called from within a valid run directory.

In [7]:
import os
RUNDIR = "rundir_1D"
os.chdir(RUNDIR)

## Writing the state of the model to disk

To checkpoint the state of the model it can be useful to write its state to disk.  We can do this using `fv3gfs.util.write_state`.  The names of the fields necessary for restarting the model can be found using `fv3gfs.wrapper.get_restart_names()`.  Let's take a look at what that returns.

In [8]:
import os
import fv3gfs.util

from fv3gfs import wrapper

In [9]:
wrapper.initialize()

In [10]:
if comm.rank == 0: print(wrapper.get_restart_names())

[stdout:0] ['time', 'x_wind', 'y_wind', 'accumulated_x_mass_flux', 'accumulated_y_mass_flux', 'accumulated_x_courant_number', 'accumulated_y_courant_number', 'eastward_wind', 'northward_wind', 'x_wind_on_c_grid', 'y_wind_on_c_grid', 'air_temperature', 'pressure_thickness_of_atmospheric_layer', 'vertical_wind', 'vertical_pressure_velocity', 'vertical_thickness_of_atmospheric_layer', 'surface_geopotential', 'atmosphere_hybrid_a_coordinate', 'atmosphere_hybrid_b_coordinate', 'eastward_wind_at_surface', 'northward_wind_at_surface', 'total_condensate_mixing_ratio', 'surface_pressure', 'interface_pressure', 'logarithm_of_interface_pressure', 'interface_pressure_raised_to_power_of_kappa', 'layer_mean_pressure_raised_to_power_of_kappa', 'dissipation_estimate_from_heat_source', 'specific_humidity', 'cloud_water_mixing_ratio', 'rain_mixing_ratio', 'cloud_ice_mixing_ratio', 'snow_mixing_ratio', 'graupel_mixing_ratio', 'ozone_mixing_ratio', 'cloud_amount', 'air_temperature_after_physics', 'northwa

We'll run the model forward 10 timesteps and then checkpoint the state of all of these variables.

In [11]:
for i in range(10):
    wrapper.step_dynamics()
    wrapper.step_physics()

To save the state after this, we can first get the state of all the variables necessary for restarting the model, and then write it out.  We need to provide `fv3gfs.util.write_state` the state dictionary as well as a path to the file.  Note each rank will write to its own file, so we should name the files uniquely per rank.

In [12]:
restart_state = wrapper.get_state(wrapper.get_restart_names())
filename = os.path.join(os.getcwd(), "RESTART", f"ten-step-run.rank{comm.rank}.nc")
fv3gfs.util.write_state(restart_state, filename)

We can see that we wrote some "restart" files in the `RESTART` directory.

In [13]:
if comm.rank == 0: print(os.listdir("RESTART/"))

[stdout:0] ['ten-step-run.rank0.nc', 'ten-step-run.rank1.nc', 'ten-step-run.rank2.nc', 'ten-step-run.rank3.nc', 'ten-step-run.rank4.nc', 'ten-step-run.rank5.nc']


We'll shut the model down now to illustrate how we can restart the model from that written state.

In [14]:
wrapper.cleanup()

## Restart the model from where we left off

Note at this point you will likely need to shutdown your container and restart it; when coming back to this notebook do not run any of the code above, instead pick back up down here.

To restart the model from where we left off, we can use `fv3gfs.util.read_state` to load in the checkpointed state from above.  We can then use `fv3gfs.wrapper.set_state` to force the state of the model to match that of restart state.  From there, we can run the model forward from the same place the previous simulation ended.

In [1]:
%%bash
# if you get a crash, add --debug to this command to put more info in logs
# logs are in /root/.ipython/profile_mpi/log
ipcluster start --profile=mpi -n 6 --daemonize
sleep 10  # command is asynchronous, so let's wait to avoid an error in the next cell

In [2]:
import ipyparallel as ipp
rc = ipp.Client(profile='mpi', targets='all', block=True)
dv = rc[:]
dv.activate()
dv.block = True

In [3]:
print("Running IPython Parallel on {0} MPI engines".format(len(rc.ids)))
print("Commands in the following cells will be executed in parallel (disable with %autopx)")
%autopx

Running IPython Parallel on 6 MPI engines
Commands in the following cells will be executed in parallel (disable with %autopx)
%autopx enabled


In [4]:
from mpi4py import MPI

comm = MPI.COMM_WORLD

In [5]:
import os
RUNDIR = "rundir_1D"
os.chdir(RUNDIR)

In [6]:
import fv3gfs.util

from fv3gfs import wrapper

In [7]:
wrapper.initialize()

In [8]:
filename = os.path.join(os.getcwd(), "RESTART", f"ten-step-run.rank{comm.rank}.nc")
state = fv3gfs.util.read_state(filename)
wrapper.set_state(state)

We can get the time of the model to show that its state reflects that it has been run forward since 2016-08-01 00:00:00 (the original start date of the simulation from above).

In [9]:
wrapper.get_state(["time"])

[0;31mOut[0:6]: [0m{'time': cftime.DatetimeJulian(2016, 8, 1, 2, 30, 0, 0)}

[0;31mOut[1:6]: [0m{'time': cftime.DatetimeJulian(2016, 8, 1, 2, 30, 0, 0)}

[0;31mOut[2:6]: [0m{'time': cftime.DatetimeJulian(2016, 8, 1, 2, 30, 0, 0)}

[0;31mOut[3:6]: [0m{'time': cftime.DatetimeJulian(2016, 8, 1, 2, 30, 0, 0)}

[0;31mOut[4:6]: [0m{'time': cftime.DatetimeJulian(2016, 8, 1, 2, 30, 0, 0)}

[0;31mOut[5:6]: [0m{'time': cftime.DatetimeJulian(2016, 8, 1, 2, 30, 0, 0)}

## Exercise: run a segmented simulation and check if it reproduced the unsegmented run

Note at this point you will likely need to shutdown your container and restart it; when coming back to this notebook do not run any of the code above, instead pick back up down here.

Starting from 2016-08-01 00:00:00 (the default start date for our example run directory) run the model 10 timestep in two segments.  Run one segment for five timesteps and the other segment another five.  Write out the state of the model after each of these segments.  Is the state of the model after the tenth timestep in the segmented run identical to that in the unsegmented run?

In [1]:
%%bash
# if you get a crash, add --debug to this command to put more info in logs
# logs are in /root/.ipython/profile_mpi/log
ipcluster start --profile=mpi -n 6 --daemonize
sleep 10  # command is asynchronous, so let's wait to avoid an error in the next cell

In [2]:
import ipyparallel as ipp
rc = ipp.Client(profile='mpi', targets='all', block=True)
dv = rc[:]
dv.activate()
dv.block = True

In [3]:
print("Running IPython Parallel on {0} MPI engines".format(len(rc.ids)))
print("Commands in the following cells will be executed in parallel (disable with %autopx)")
%autopx

Running IPython Parallel on 6 MPI engines
Commands in the following cells will be executed in parallel (disable with %autopx)
%autopx enabled


In [4]:
from mpi4py import MPI

comm = MPI.COMM_WORLD

In [5]:
import os
RUNDIR = "rundir_1D"
os.chdir(RUNDIR)

Note at this point you will likely need to shutdown your container and restart it; when coming back to this notebook do not run any of the code above, instead pick back up down here.

In [1]:
%%bash
# if you get a crash, add --debug to this command to put more info in logs
# logs are in /root/.ipython/profile_mpi/log
ipcluster start --profile=mpi -n 6 --daemonize
sleep 10  # command is asynchronous, so let's wait to avoid an error in the next cell

In [2]:
import ipyparallel as ipp
rc = ipp.Client(profile='mpi', targets='all', block=True)
dv = rc[:]
dv.activate()
dv.block = True

In [3]:
print("Running IPython Parallel on {0} MPI engines".format(len(rc.ids)))
print("Commands in the following cells will be executed in parallel (disable with %autopx)")
%autopx

Running IPython Parallel on 6 MPI engines
Commands in the following cells will be executed in parallel (disable with %autopx)
%autopx enabled


In [4]:
from mpi4py import MPI

comm = MPI.COMM_WORLD

In [5]:
import os
RUNDIR = "rundir_1D"
os.chdir(RUNDIR)

In [6]:
import os
import fv3gfs.util

from fv3gfs import wrapper