Skip to content

How to do a model analysis

Matt Graham edited this page Jan 25, 2023 · 2 revisions

Introduction

This document explains how to do a simple model analysis, beginning with thinking of a model "experiment" and ending with the presentation of results as a plot.

Pre-requisites

Your system is set-up correctly if you can achieve the following two things:

  • Run all tests from the tip of branch master without errors. If this not achieved, make sure you have run through all the steps on the Installation page.
  • At the command line run tlo batch-list. You should see a list of jobs. If not, make sure you have run all the steps on the Setting up Azure Batch page.

Overview of TLOmodel simulation structure

A high level description of the key object types in TLOmodel and how they relate to each other is given in the below entity relationship diagram

erDiagram
    Simulation ||--|{ Module : registers
    Module ||--|{ Event: schedules
    Module ||--|{ Parameter : defines
    Module ||--|{ Property : defines
    EventQueue ||--|{ Event: prioritizes
    Event }|--|| Population: updates
    Simulation ||--|| Population: contains
    Population ||--|{ Property: contains
    Simulation ||--|| EventQueue: controls
Loading

For the purposes of running a model analysis the three key object types we need to know about are the simulation, modules and parameters

erDiagram
    Simulation ||--|{ Module : registers
    Module ||--|{ Parameter : defines
    Simulation {
        int seed
        int initial_pop_size
        Date date
        Date end_date
        dict log_config 
    }
    Module {
        str name
        RandomState rng
    }
    Parameter {
        Type type
        str description
    }
Loading

Scenarios, draws and samples

While we can interact directly with Simulation objects, typically we instead create a Scenario which encapsulates multiple simulation runs across one or more sets of parameters corresponding to different settings of interest. By running simulations multiple times for each set of parameters we can capture statistical variability in the simulation outcomes due to the randomness that enters the simulation from drawing from (pseudo-)random number generators.

erDiagram
    Scenario ||--|{ Draw : contains
    Draw ||--|{ Sample: contains
    Scenario {
        int seed
        Date start_date
        Date end_date
        int pop_size
        dict log_configuration
        list modules
    }
    Draw {
        int draw_number
        int draw_seed
        dict parameters
    }
    Sample {
        int sample_number
        int simulation_seed
    }
Loading

The top-level Scenario object controls the global set up of the simulations - for example the simulation start and end dates, initial population size, set of modules to register and configuration for the loggers used to record the simulation output.

A Scenario will constitute one or more draws, with each draw corresponding to a specific set of model parameters (associated with each module) to run simulations for. Each module sets default values for the parameters it defines (often by reading in values from external resource files), with each draw defining overrides to these default values (with any parameters not specified left as their default values).

Each draw itself is associated with one or more samples or runs corresponding to running the simulation with the parameters for a particular draw with a specific seed for the random number generators used in the simulation. By having each run use a different random seed, we sample from the distribution of simulation outcomes / outputs given a set of parameter values.

For example for a scenario with number_of_draws = 2 and runs_per_draw = 3 we would have something like the following

graph TD;
    subgraph S[ ]
      Scenario --> d1[Draw 1];
      Scenario --> d2[Draw 2];
      subgraph D1[ ]
        d1 --> s11[Sample 1/1];
        d1 --> s12[Sample 1/2];
        d1 --> s13[Sample 1/3];
      end
      subgraph D2[ ]
        d2 --> s21[Sample 2/1];
        d2 --> s22[Sample 2/2];
        d2 --> s23[Sample 2/3];
      end
    end
Loading

Azure Batch

Azure Batch is a service offered as part of the Microsoft Azure cloud platform for running batches of jobs in parallel on a pool of compute nodes.

The TLOmodel package contains a command line interface (CLI) tool tlo for submitting jobs corresponding to the runs (samples) of a particular Scenario to Azure Batch, checking the status of jobs running on Azure Batch and retrieving the results of completed jobs.

On Azure Batch, a submitted job is executed in parallel across a collection nodes organized in to a pool.

graph TD
    subgraph P["Pool"]
      subgraph N1["Node 1"]
      end
      subgraph N2["Node 2"]
      end
      subgraph N3["Node 3"]
      end
      subgraph N4["Node 4"]
      end
      subgraph N...["..."]
      end
    end
Loading

An Azure Batch job consists of a list of one or more tasks which are distributed across the nodes in a pool. The tlo CLI creates one task per scenario sample and creates a pool size equal to the total number of samples with one sample running on each node.

graph TD
    subgraph P["Pool"]
      subgraph N1["Node 1"]
      end
      subgraph N2["Node 2"]
      end
      subgraph N3["Node 3"]
      end
      subgraph N4["Node 4"]
      end
      subgraph N...["..."]
      end
    end
    subgraph Tasks
      s11
      s12
      s13
      s21
      s...[...]
    end
    N1 -.- s11[Sample 1/1]
    N2 -.- s12[Sample 1/2]
    N3 -.- s13[Sample 1/3]
    N4 -.- s21[Sample 2/1]
Loading

An example model analysis

Motivation

Say we wish to understand the loss in health outcomes in a population through some consumables being unavailable. We could examine this in the model by running the model under two different settings:

  • A setting which mimics the real-world situation where some consumables are sometimes not available.
  • A counterfactual setting in which all consumables are always available.

The difference in the health outcomes in the model between these two scenarios will be an estimate of the effect of consumables sometimes not being available.

Code

The example code is given below and is also found on the branch hallett/example-analysis in src/scripts/analysis_example/.

Step 0: create a new Git branch

The file defining a scenario is accessed on the Azure Batch nodes by fetching a Git commit from the TLOmodel repository on GitHub. To be able to submit a scenario job to Azure Batch we therefore need to create a new local Git branch which will be used to track the files associated with the scenario on your computer, add and commit the changes to the scenario files to this branch, and push these changes to GitHub.

Typically we would want to create a branch from the latest commit on the master branch.

%%{init: { 'gitGraph': { 'mainBranchName': 'master'}} }%%
gitGraph
  commit
  commit
  branch user-prefix/name-of-branch
  checkout user-prefix/name-of-branch
  commit
Loading

In this case we would first switch to (checkout) the master locally. In PyCharm this can be done by going toGit > Branches... from the main menu bar (or clicking the branch selector on the status bar at the bottom of the interface), selecting the master branch from the Local branches section of the resulting list and then selecting the Checkout option.

Once we have checked out our local master branch we need to make sure it is up to date with any changes on the corresponding master branch on the remote repository on GitHub since we last updated. To do this select Git > Pull... from the main menu bar, ensure that the dialog box shows git pull origin master (pull is the name of the Git command to fetch changes from a remote branch and merge them in the local branch, origin is the default name for the remote repository) and the click the Pull button.

Now that our master branch is up to date with the latest commits on GitHub we are ready to create a new branch. To do select Git > New Branch... from the main menu bar. In the resulting dialog enter the new branch name in the text field. As a convention, it can be useful to give all your branches a common prefix separated from the rest of the branch name by a forward slash for example user-prefix/name-of-branch, as this will make it easier to find your branches in the PyCharm branch list and on GitHub. Once you have entered the branch name, ensure the Checkout branch tick box is ticked and then click the Create button.

For the purposes of this tutorial, rather than creating a new branch from master we will instead create a branch from the hallett/example-analysis branch as this already contains the example scenario files we will use here.

%%{init: { 'gitGraph': { 'mainBranchName': 'master'}} }%%
gitGraph
  commit
  commit
  branch hallett/example-analysis
  checkout master
  commit
  commit
  checkout hallett/example-analysis
  commit
  commit
  branch user-prefix/example-analysis
  commit
Loading

To do this in PyCharm, first select Git > Update Project... from the main menu bar (or click the ↙️ icon on the Git toolbar), ensure the Merge incoming changes into the current branch option is selected in the dialog box that appears, then click OK. This will ensure changes on all branches on the remote repository are fetched to the local repository.

Then select Git > Branches... from the main menu bar, and under the Remote branches section in the resulting list select the origin/hallet/example-analysis branch. In the New branch name field in the resulting dialog box enter the name {user-prefix}/example-analysis where {user-prefix} is replaced with a short prefix unique to you - for example your initials or surname, then click the Create button.

Step 1: create a scenario file

Once we have a create a new branch, the next step will typically be to create a Python file defining the scenario for the analysis we wish to run. Within the top-level TLOmodel directory the src (shorthand for source code) contains two sub-directories tlo (which contains the source code for the tlo Python package defining the model framework) and scripts. Any analysis specific files you create, for example defining a scenario or a script to analyze the results of a scenario run, should be placed in a relevantly named subdirectory of scripts. Here we will use the src/scripts/analysis_example directory already created on the hallett/example-analysis branch.

To navigate to the src/scripts/analysis_example we can use the tree navigator in the Project tab on the left-hand side of the interface. Within the analysis_example directory you should see two .py Python source files:

  • analysis_impact_of_consumables_availability.py
  • scenario_impact_of_consumables_availability.py

It is the latter of these two files we will be interested in for now, which defines the scenario class we will use to in our analysis. If we were creating a new analysis scenario from scratch we could instead right click on the relevant subdirectory within the src/scripts directory in the Projects tree explorer in PyCharm, and select New > Python file from the context file, and then give the file an appropriate name (typically all lower case with underscore characters _ as word separators).

We will now break down how we define a scenario programmatically for the analysis described above by explaining each part of the scenario_impact_of_consumables_availability.py file.

from tlo import Date, logging
from tlo.methods.fullmodel import fullmodel
from tlo.scenario import BaseScenario

We first import the names of objects we need to use from other Python modules (libraries). Here we import various names from the tlo package which contains the TLOmodel code -

  • Date - a type for representing dates in the simulation.
  • logging - a module which defines the types and functions for logging simulation output.
  • fullmodel - a function which returns a list of all Module instances that should be registered in a run of the full model.
  • BaseScenario - an abstract type that custom scenario classes should inherit from that implements some shared functionality.
class ImpactOfConsumablesAvailability(BaseScenario):

We define a new custom subclass of BaseScenario called ImpactOfConsumablesAvailability - the name we use for the scenario class should be descriptive of what the scenario is meant to help analyze. Python class names should generally be in camel case.

    def __init__(self):
        super().__init__(
            seed=0,
            start_date=Date(2010, 1, 1),
            end_date=Date(2015, 1, 1),
            initial_population_size=10_000,
            number_of_draws= 2,
            runs_per_draw=2,
        )

The initializer method __init__ of the class performs an initial set up of the class, including setting class attribute values. The self argument here will be the instance of the newly created class. The block starting super().__init__( line calls the __init__ method of the superclass of the object - here BaseScenario. The BaseScenario.__init__ method accepts several keyword arguments which allows us to set attributes which need to be defined for any scenario:

  • seed: The top-level integer seed to use for generating per run simulation seeds for random number generators (here zero),
  • start_date: Date to start simulation at (here January 1st 2010).
  • end_date: Date to end simulation at (here January 1st 2015).
  • initial_population_size: Number of individuals to initialize population with (here 10 000).
  • number_of_draws: Number of draws (distinct parameter sets) over which to run simulation (here two).
  • runs_per_draw: Number of independent model runs to perform per draw (here two).
    def log_configuration(self):
        return {
            'filename': 'impact_of_consumables_availability',
            'directory': './outputs',
            'custom_levels': {
                '*': logging.WARNING,
                'tlo.methods.demography': logging.INFO,
                'tlo.methods.healthburden': logging.INFO,
            }
        }

The log_configuration method should return a dictionary configuring the loggers used to record the simulation output. A structured logging system is used with a hierarchy of loggers for the different modules, and it possible to control the verbosity of the output of each logger by specifying custom levels - for more details on how to configure logging see the Structured logging wiki page. Here we specify the default logging level to be WARNING (a relatively high level which will give sparse output for most modules as only log messages at level ERROR or WARNING will be outputted), but specify a lower INFO logging level for two key modules of interest, Demography and HealthBurden, to ensure we record the output we need for our analysis from these modules. As a general principle it is good to keep log levels as high as possible while still retaining the log output needed for a specific analysis, as log files can grow very large for long simulations when log levels are set too low which can lead to simulations exceeding resources on nodes and jobs failing.

    def modules(self):
        return fullmodel(resourcefilepath=self.resources)

The modules method should return a list of modules (instances of subclasses of the tlo.Module class corresponding to self-contained model components) to be registered in the simulation. This list can be constructed explicitly by importing module classes from the relevant Python modules (with module here instead referring to a single importable Python file) in tlo.methods, which can be useful if you wish to run a simulation with only a subset of the available modules included. If you wish to run a scenario with all currently 'complete' modules, a helper function fullmodel is provided in tlo.methods.fullmodel, which will return a list of instances of the relevant modules. The fullmodel function has one required argument - the path to the directory containing the resource files to use when initializing the modules, which is here set to the resources attribute automatically defined for instances of BaseScenario as a relative path ./resources. There are also two optional arguments to fullmodel:

  • use_simplified_births: A boolean argument specifying whether to use the SimplifiedBirths module in place of the full pregnancy related modules. This reduces the computational cost (and so run time) of the simulation at the cost of using a much simpler and less realistic model of birth events.
  • module_kwargs: A dictionary mapping from module class names to dictionaries of keyword argument names and values to set for the module.
    def draw_parameters(self, draw_number, rng):
        return {
            'HealthSystem': {'cons_availability': ['default', 'all'][draw_number]}
        }

The draw_parameters method, should, given an integer draw number draw_number between 0 and number_of_draws - 1 (inclusive), and a seeded random number generator rng, return a dictionary mapping from module names to dictionaries specifying override values for the parameters in that module for this particular draw. This method is what controls the different parameter sets used for each draw of the scenario. The dictionaries specifying the parameter overrides for each module should map from string parameter names to the parameter value to use in place of the default for this draw.

In the example here, a single parameter cons_availability (controlling consumable availability) in the HealthSystem module is overridden, with draw number 0 using the string value 'default' (corresponding to the default behaviour of consumables not always being available) and draw number 1 using the string value 'all' (corresponding to all consumables always being available). Indexing into a list of values to use for a parameter for each draw number is a simple way to set parameter values when only one parameter is being changed across draws. An alternative approach for generating grids corresponding to Cartesian products of sets of values for multiple parameters is demonstrated below. The rng random number generator can also be used to generate random values for parameters to use within a Monte Carlo method.

Exercise: update the scenario to make it quicker to run for testing

Currently the scenario is set to run for 5 years with an initial population of 10 000 and with the full set of pregnancy related modules, with this configuration taking tens of minutes for each simulation run. To allow the scenario to be run more quickly for testing purposes, update the definition of the scenario so that

  • the initial population size is 1000,
  • the simulation is run for 1 months from January 1st 2010,
  • the use_simplified_births argument to fullmodel is set to True to use the SimplifiedBirths module in place of the full pregnancy related modules.

Once you have made the necessary changes, use the Commit tab in PyCharm (typically in the left sidebar) to add these changes to the staging area (select the scenario_impact_of_consumables_availability.py using the tick box within the Default Changelist tree view) and commit to your personal {user-prefix}/example-analysis branch (enter a descriptive commit message in the text field below the change list and then click the Commit button).

Step 2: run the scenario

To run scenarios we can use the tlo command line tool. PyCharm has a built-in terminal which can be used to run tlo commands - it should be available from the Terminal tab at the bottom of the interface or by going to View > Tool Windows > Terminal from the main menu bar.

We can run the scenario file we just edited

  • locally by running

    tlo scenario-run src/scripts/analysis_example/scenario_impact_of_consumables_availability.py
  • or on as a batch job on Azure by running

    tlo batch-submit 

To use the latter command, we need for the changes we made to the scenario file to be pushed to GitHub. Before we push these changes however we should test they work locally first.

Exercise: do test run of scenario and submit to Azure Batch

Use the tlo scenario-run command to run the edited scenario locally from the terminal in PyCharm. If the scenario runs without errors you should see output like the following (some details such as file paths will differ)

Found class ImpactOfConsumablesAvailability in src/scripts/analysis_example/scenario_impact_of_consumables_availability.py
{"uuid": "8159d34ec0", "type": "header", "module": "tlo.scenario", "key": "message", "level": "INFO", "columns": {"message": "str"}, "description": null}
{"uuid": "8159d34ec0", "date": "0000-00-00T00:00:00", "values": ["Loaded scenario using /tmp/tmpqlv7e9xs"]}
{"uuid": "8159d34ec0", "date": "0000-00-00T00:00:00", "values": ["Found 2 draws; 2 runs/draw"]}
{"uuid": "8159d34ec0", "date": "0000-00-00T00:00:00", "values": ["Running draw 0, sample 0"]}
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/0/impact_of_consumables_availability__2023-01-24T093713.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/0
Finished writing module-specific log files.
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/1/impact_of_consumables_availability__2023-01-24T093821.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/1
Finished writing module-specific log files.
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/0/impact_of_consumables_availability__2023-01-24T093927.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/0
Finished writing module-specific log files.
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/1/impact_of_consumables_availability__2023-01-24T094031.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/1
Finished writing module-specific log files.

Once the scenario has completed successfully locally, we can push the changes on the current branch to GitHub. To do this in PyCharm we can either select Git > Push... from the main menu bar or click the ↗️ button on the Git toolbar. This will then show a dialog which should show a single commit with the message you entered earlier on your {user-prefix}/example-analysis branch in the left pane, and a single file with changes (scenario_impact_of_consumables_availability.py) showing in the right pane. If this all looks correct click the Push button to push these changes to GitHub.

Once all changes on our local branch are pushed to GitHub we are ready to submit a job to Azure Batch using the tlo batch-submit command. Do this now.

Step 3: check job status

When running scenarios on Azure we can see a list of currently running (active) jobs by running

tlo batch-list --active

To show a list of recently completed jobs instead run

tlo batch-list --completed

For a job identifier {job-id} (which have the general format {stem_of_scenario_filename}-{timestamp}) we can get more details about the progress of the job's tasks by running

tlo batch-job --tasks {job-id}

Step 4: download results

Once a job has completed we can download the scenario outputs by running

tlo batch-download {job-id}

where {job-id} is the relevant identifier for the job. By default this command will look for jobs run under the username specified in your tlo.conf configuration file. To download the results from a job run by a different user we can instead run

tlo batch-download --username {username} {job-id}

where {username} is the relevant user name (typically an email) and {job-id} is the job identifier as previously.

Exercise: download results from a previous run

As a job submitted with the above scenario file will take a while to complete, for now we can use the results of a previous completed run by Tim (tbh03@ic.ac.uk) with job identifier scenario_impact_of_consumables_availability-2023-01-12T131247Z. Retrieve the results of this job by running

tlo batch-download --username tbh03@ic.ac.uk scenario_impact_of_consumables_availability-2023-01-12T131247Z

in the terminal from Pycharm.

Step 5: write an analysis script

Once we have the results from a scenario run we need to write code to extract the subset of the logged output we wish to use for our analysis, compute the summary statistics of interest and plot or otherwise visualise these. Typically in TLOmodel we place this analysis code in a separate script in the same directory as the Python file defining the scenario which generates the results to be analyzed. In the current example, the relevant script is src/scripts/analysis_example/scenario_impact_of_consumables_availability.py.

The script extracts and plots key summary statistics from the scenario results. Below we describe some of the key parts of the script.

import argparse
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from tlo.analysis.utils import (
    extract_params,
    extract_results,
    get_scenario_info,
    get_scenario_outputs,
    load_pickled_dataframes,
    make_age_grp_lookup,
    make_age_grp_types,
    summarize,
)

We first import the objects we will need. We will use the argparse module from the Python standard library to make a simple command line interface. Matplotlib, NumPy and Pandas are all third-party libraries that are key parts of the scientific Python ecosystem, providing plotting, numerical array and dataframe functionality respectively. With the tlo package the tlo.analysis.utils module defines various useful helper functions for use in analysis scripts we import here and will explain as we use below.

def extract_total_deaths(results_folder):

    def extract_deaths_total(df: pd.DataFrame) -> pd.Series:
        return pd.Series({"Total": len(df)})

    return extract_results(
        results_folder,
        module="tlo.methods.demography",
        key="death",
        custom_generate_series=extract_deaths_total,
        do_scaling=True
    )

We first define a function to get coarse total death summary statistics (accumulated across the whole population and simulation time). The key function we use here is extract_results, which parses the log output for a specific module (here tlo.methods.demography), extracts the values associated with a particular key (here "death") and applies a user provided function to generate the summary statistics of interest from the parsed log output dataframe (here the extract_deaths_total function). The do_scaling boolean argument when True, scales the values associated with each simulation sample (run) by a per-run scale-factor recorded in the log which adjusts for relative size of the initial simulated population.

def plot_summarized_total_deaths(summarized_total_deaths, param_strings):
    fig, ax = plt.subplots()
    number_of_draws = len(param_strings)
    statistic_values = {
        s: np.array(
            [summarized_total_deaths[(d, s)].values[0] for d in range(number_of_draws)]
        )
        for s in ["mean", "lower", "upper"]
    }
    ax.bar(
        param_strings,
        statistic_values["mean"],
        yerr=[
            statistic_values["mean"] - statistic_values["lower"], 
            statistic_values["upper"] - statistic_values["mean"]
        ]
    )
    ax.set_ylabel("Total number of deaths")
    fig.tight_layout()
    return fig, ax

The function here uses the Matplotlib bar function to generate a bar plot summarizing the difference in the total deaths summary statistics across the different scenario draws. The variation across the samples of each draw are shown as error bars.

def compute_difference_in_deaths_across_runs(total_deaths, scenario_info):
    deaths_difference_by_run = [
        total_deaths[0][run_number]["Total"] - total_deaths[1][run_number]["Total"]
        for run_number in range(scenario_info["runs_per_draw"])
    ]
    return np.mean(deaths_difference_by_run)

The function here gets the total deaths associated with the runs for each draw (set of parameters) and computes the mean across the runs using the NumPy mean function.

def extract_deaths_by_age(results_folder):

    def extract_deaths_by_age_group(df: pd.DataFrame) -> pd.Series:
        _, age_group_lookup = make_age_grp_lookup()
        df["Age_Grp"] = df["age"].map(age_group_lookup).astype(make_age_grp_types())
        df = df.rename(columns={"sex": "Sex"})
        return df.groupby(["Age_Grp"])["person_id"].count()

    return extract_results(
        results_folder,
        module="tlo.methods.demography",
        key="death",
        custom_generate_series=extract_deaths_by_age_group,
        do_scaling=True
    )

To give a finer grained view of the difference between the different settings, the function here extracts total deaths for each age group in the population again using the extract_results helper function on the logger outputs with the "death" key from the tlo.methods.demography module, but using a different function to extract the necessary data from the parsed logger output dataframe.

def plot_summarized_deaths_by_age(deaths_summarized_by_age, param_strings):
    fig, ax = plt.subplots()
    for i, param in enumerate(param_strings):
        central_values = deaths_summarized_by_age[(i, "mean")].values
        lower_values = deaths_summarized_by_age[(i, "lower")].values
        upper_values = deaths_summarized_by_age[(i, "upper")].values
        ax.plot(
            deaths_summarized_by_age.index, central_values,
            color=f"C{i}",
            label=param
        )
        ax.fill_between(
            deaths_summarized_by_age.index, lower_values, upper_values,
            alpha=0.5,
            color=f"C{i}",
            label="_"
        )
    ax.set(xlabel="Age-Group", ylabel="Total deaths")
    ax.set_xticks(deaths_summarized_by_age.index)
    ax.set_xticklabels(labels=deaths_summarized_by_age.index, rotation=90)
    ax.legend()
    fig.tight_layout()
    return fig, ax

This function uses the Matplotlib line plot function plot and filled area plot function fill_between to show the variation in the total deaths across age groups for the two different scenario settings, with the mean across the runs shown as a solid line and the variation across runs represented by the filled area.

if __name__ == "__main__":

This conditional if statement is a common pattern in Python scripts to allow the script to also be imported as a module without executing the script itself to allow reusing any functions or other objects defined in the module. The code inside this block will only be executed if running as a script.

    # Parse command line arguments
    parser = argparse.ArgumentParser(
        "Analyse scenario results for testing impact of consumables availability"
    )
    parser.add_argument(
        "--scenario-outputs-folder",
        type=Path,
        required=True,
        help="Path to folder containing scenario outputs",
    )
    parser.add_argument(
        "--show-figures",
        action="store_true",
        help="Whether to interactively show generated Matplotlib figures",
    )
    parser.add_argument(
        "--save-figures",
        action="store_true",
        help="Whether to save generated Matplotlib figures to results folder",
    )
    args = parser.parse_args()

Here we use the argparse module from the Python library to define some command line arguments that can be used to control the script and parse the arguments passed when invoking the script.

    # Find results_folder associated with a given batch_file and get most recent
    results_folder = get_scenario_outputs(
        "scenario_impact_of_consumables_availability.py", args.scenario_outputs_folder
    )[-1]

The get_scenario_outputs helper function allows getting all the scenario results associated with a given scenario script, ordered by the timestamp of when the scenario was submitted. Here we get the last set of results (that is the most recent).

    # Load log (useful for checking what can be extracted)
    log = load_pickled_dataframes(results_folder)

    # Get basic information about the results
    scenario_info = get_scenario_info(results_folder)

    # Get the parameters that have varied over the set of simulations
    params = extract_params(results_folder)

We use three helper functions from tlo.analysis.utils to get various information about the scenario run:

  • load_picked_dataframes is used to (lazily) load the dataframes corresponding to full log output. Having access to the full logging data can be useful when initially exploring the scenario outputs in the first stages of analysis.
  • get_scenario_info is used to get basic information about the scenario - specifically the number of draws number_of_draws and number of runs per draw runs_per_draw as a dictionary.
  • extract_params is used to get the overridden parameters for each draw of the scenario as a dataframe.
    # Create a list of strings summarizing the parameter values in the different draws
    param_strings = [f"{row.module_param}={row.value}" for _, row in params.iterrows()]

To give meaningful labels to the plots we construct strings of the form {module_name}:{parameter_name}={value} for each value of the (here) single parameter (HealthSystem:cons_availability) overridden in each draw of the scenario.

    # We first look at total deaths in the scenario runs
    total_deaths = extract_total_deaths(results_folder, True)

    # Compute and print the difference between the deaths across the scenario draws
    mean_deaths_difference_by_run = compute_difference_in_deaths_across_runs(
        total_deaths, scenario_info
    )
    print(f"Mean difference in total deaths = {mean_deaths_difference_by_run:.3g}")

Using the helper functions defined previously we extract the total death statistics across the scenario draws and runs and compute and print the mean differences across runs.

    # Plot the total deaths across the two scenario draws as a bar plot with error bars
    fig_1, ax_1 = plot_summarized_total_deaths(summarize(total_deaths), param_strings)

We then plot the total deaths for the two scenario draws as a bar plot with error bars to represent the variability across runs.

    # Now we look at things in more detail with an age breakdown
    deaths_by_age = extract_deaths_by_age(results_folder)

    # Plot the deaths by age across the two scenario draws as line plot with error bars
    fig_2, ax_2 = plot_summarized_deaths_by_age(summarize(deaths_by_age), param_strings)

Using the previously defined helper functions here we extract and plot a breakdown of total deaths per age group for the two scenario draws.

    if args.show_figures:
        plt.show()

    if args.save_figures:
        fig_1.savefig(results_folder / "total_deaths_across_scenario_draws.pdf")
        fig_2.savefig(results_folder / "death_by_age_across_scenario_draws.pdf")

Depending on the command line arguments passed we potentially show the interactive figure windows and/or save the figures to the results folder.

Step 6: run the analysis

To run the analysis script on the scenario results we downloaded earlier we can run

python src/scripts/analysis_example/analysis_impact_of_consumables_availability.py --scenario-outputs-folder outputs/tbh03@ic.ac.uk --show-figures

in the terminal in PyCharm. This will show the figures as interactive windows. To instead save the figures to files in the results folder we can run

python src/scripts/analysis_example/analysis_impact_of_consumables_availability.py --scenario-outputs-folder outputs/tbh03@ic.ac.uk --save-figures

Useful utility functions

The following functions in the tlo.analysis.utils modules can be useful for helping with analyzing the output from a scenario run

  • extract_params - gets the overridden parameters from scenario runs in the form of a dataframe summarizing the parameters that change across the draws.
  • extract_results - unpacks the results of the successful runs in the form of a dataframe with the column multi-index for the draw/run.
  • summarize- computes summary statistics. Finds mean value and 95% interval across the runs for each draw.
  • compute_mean_across_runs - computes the mean across scenario runs of a dictionary of counters keyed by draw and run.
  • plot_stacked_bar_chart - plots a stacked bar chart using count data binned over two levels of grouping.
  • get_grid- creates the arrays needed to plot a heatmap, returns grid as a dictionary.

Extensions

The above describes a very basic work flow. Here are some other things that you can do.

Create a grid of parameters to explore

The make_cartesian_parameter_grid helper function in tlo.scenario can be used to generate a parameter grid corresponding to the Cartesian product of a set of values for each of a collection of module specific parameters. This can be useful to create a scenario which uses draws corresponding to all combinations of a (small) collection of parameter values.

The function accepts a single dictionary argument, which should map from string module names to dictionaries mapping from string parameter names (for that module) to iterables (for example lists, sets, arrays) of parameter values to iterate across. It returns a list of nested dictionaries corresponding to the module specific parameter sets formed as the Cartesian product of all the sets of values specified for each parameter.

For example, if we wished to extend our previous analysis to also consider the impact of bed availability and its interaction with consumable availability, we might wish to run a scenario studying four different settings

  • A setting where both consumables and bed are sometimes not available.
  • A setting where consumables are always available but beds are sometimes not.
  • A setting where beds are always available but consumables are not.
  • A setting where both consumables and beds are always available.

We could use the make_cartesian_parameter_grid in a scenario to help generate the 'grid' of four different parameter sets corresponding to these settings as follows

from tlo import Date, logging
from tlo.methods.fullmodel import fullmodel
from tlo.scenario import BaseScenario, make_cartesian_parameter_grid


class ImpactOfConsumablesAndBedAvailability(BaseScenario):

    def __init__(self):
        self._parameter_grid = make_cartesian_parameter_grid(
            {
                "HealthSystem": {
                    "cons_availability": ["default", "all"], 
                    "beds_availability": ["default", "all"],
                }
            }
        )
        super().__init__(
            seed=0,
            start_date=Date(2010, 1, 1),
            end_date=Date(2015, 1, 1),
            initial_population_size=10_000,
            number_of_draws=len(self._parameter_grid),
            runs_per_draw=2
        )

    def log_configuration(self):
        return {
            'filename': 'impact_of_consumables_and_bed_availability',
            'directory': './outputs',
            'custom_levels': {
                '*': logging.WARNING,
                'tlo.methods.demography': logging.INFO,
                'tlo.methods.healthburden': logging.INFO,
            }
        }

    def modules(self):
        return fullmodel(resourcefilepath=self.resources)

    def draw_parameters(self, draw_number, rng):
        return self._parameter_grid[draw_number]

Nicer plots

A breadth of information including tutorials for beginner, intermediate and advanced levels can be found at matplotlib.org

References

Clone this wiki locally