Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computational research guide #888

Merged
merged 24 commits into from
Jul 28, 2020
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/sections/cli_features.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CLI feautures
# CLI features

## New workflow initialization

Expand Down
6 changes: 6 additions & 0 deletions docs/sections/cn_workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ attribute.
| `secrets` | **optional** A list of strings representing the names of secret variables to define<br>in the environment of the container for the step. For example,<br>`secrets: ["SECRET1", "SECRET2"]`. |
| `skip_pull` | **optional** A boolean value that determines whether to pull the image before<br>executing the step. By default this is `false`. If the given container<br>image already exist (e.g. because it was built by a previous step in<br>the same workflow), assigning `true` skips downloading the image from<br>the registry. |
| `dir` | **optional** A string representing an absolute path inside the container to use as the<br>working directory. By default, this is `/workspace`. |
| `options` | **optional** Container configuration options. For instance:<br>`options: {ports: {8888:8888}, interactive: True, tty: True}`. Currently only<br> supported for the docker runtime. See the parameters of `client.containers.runs()`<br> in the [Docker Python SDK](https://docker-py.readthedocs.io/en/stable/containers.html?highlight=inspect) for the full list of options |

### Referencing images in a step

Expand Down Expand Up @@ -400,6 +401,11 @@ question (see [here][engconf] for more).

[engconf]: ./cli_features#customizing-container-engine-behavior


Alternatively, to restrict a configuration to a specific step in a workflow, set the desired parameters in the step's `options`
**Note**: this is currently only supported for the Docker runtime


## Resource Managers

Popper can execute steps in a workflow through other resource managers
Expand Down
368 changes: 368 additions & 0 deletions docs/sections/guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,3 +147,371 @@ run.

[search]: https://hub.docker.com
[create]: https://docs.docker.com/get-started/part2/



## Computational research with Python

This guide explains how to use Popper to develop and run reproducible workflows
in for computational research in fields such as physics, machine learning or
ivotron marked this conversation as resolved.
Show resolved Hide resolved
bioinformatics
ivotron marked this conversation as resolved.
Show resolved Hide resolved

### Getting started

[TODO: introduction to reproducibility and major concepts in Popper?]
ivotron marked this conversation as resolved.
Show resolved Hide resolved

#### Pre-requisites

Basic knowledge of git, command line and Python. It is also
recommended to read through the rest of the
[documentation](https://popper.readthedocs.io/en/latest/sections/getting_started.html)
for Popper.
ivotron marked this conversation as resolved.
Show resolved Hide resolved

To adapt the recommendations of this guide to your own workflow, fork this
[template repository]() or use the [Cookiecutter template](). (TODO: fix links)
ivotron marked this conversation as resolved.
Show resolved Hide resolved

#### Case study

Thoughout this guide, the
[Flu Shot Learning](https://www.drivendata.org/competitions/66/flu-shot-learning/)
research competition on Driven Data is used as an example project for developing the workflow.
To help follow allong, see the final [repository]() for this workflow.
This example is from machine learning but knowledge of the field is not essential to this guide.

Initial project structure:
```
├── environment.yml <- The file defining the conda Python environmentt.
├── LICENSE
├── README.md <- The top-level README.
├── data
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── results
| ├── models <- Serialized models, predictions, model summaries.
| └── figures <- Graphics created during analysis.
├── paper <- Generated analysis as PDF, LaTeX.
└── src <- Source code for this project.
├── notebooks <- Jupyter notebooks.
├── get_data.sh
├── models.py
├── predict.py
├── evaluate_model.py
└── __init__.py <- Makes this a python module.
```

ivotron marked this conversation as resolved.
Show resolved Hide resolved
### Downloading data

A computational workflow should automate the acquisition of data to ensure
that the correct version of the data is used.
In our example, this can be done with a python script

```sh
#!/bin/sh
cd data/raw

wget "https://s3.amazonaws.com/drivendata-prod/data/66/public/test_set_features.csv"
wget "https://s3.amazonaws.com/drivendata-prod/data/66/public/training_set_labels.csv"
wget "https://s3.amazonaws.com/drivendata-prod/data/66/public/training_set_features.csv"

echo "Files downloaded:"
ls
ivotron marked this conversation as resolved.
Show resolved Hide resolved
```
Now, wrap this step using a Popper workflow. In `wf.yml`,
```yaml
steps:
- id: "dataset"
uses: "docker://jacobcarlborg/docker-alpine-wget"
runs: ["sh"]
args: ["src/get_data.sh"]
```
Remarks:
- it is important to ensure that the Docker images contains the necessary utilities.
For instance, a default Alpine image does not include `wget`


### Interactive development

Computational research usually has an exploratory phase.
To make it easier to adapt exploratory work to a final workflow, it is recommended
to do both in the same environment.

Computational notebooks are a great tool for exploratory work. This sections covers how to
launch a Jupyter notebook using Popper.

Add a new step to the workflow in `wf.yml`
```yml
- id: "notebook"
uses: "./"
args: ["sh"]
options:
ports:
8888/tcp: 8888
```

Remarks:
- `uses` is set to `./` (current directory), as this step uses an image built from the
`Dockerfile` in the local workspace directory
- `ports` is set to `{8888/tcp: 8888}` which will allow the host machine to connect to the notebook server in the container

In your local shell, execute the step in interactive mode
```sh
popper sh -f wf.yml jupyter
```
In the docker container's shell, run
```sh
jupyter lab --ip 0.0.0.0 --no-browser --allow-root
```
Skip this second step if you only need the shell interface

Remarks:
- `--ip 0.0.0.0` allows the user to access JupyterLab from outside the container (by default,
Jupyter only allows access from `localhost`)
- `--no-browser` tells jupyter to not expect to find a browser in the docker container
- `--allow-root` allows us to run JupyterLab as a root user (the default user in our Docker
image), which Jupyter does not enable by default

Copy and paste the generated link in the browser on your host machine to access the JupyterLab
environment.


### Package management

It can be difficult to guess in advance which software libraries will be needed.
Instead, we recommend updating the workflow requirements as you go using one of
the package managers available for Python.

#### conda

Conda is recommended for managing packages, due to its superior dependency
management and support for data analysis work.
While executing the `notebook` step interactively, extra packages can be installed as
needed using
```bash
conda install PACKAGE [PACKAGE ...]
```
Then save the resulting requirements using
``` bash
conda env export > environment.yml
```
The next time Popper executes this step, it will rebuild the Docker image with
these new requirements (This is done by copying `environment.yml` in our `Dockerfile`)

#### pip

The workflow described for `conda` can easily be adapted to `pip`.

```bash
pip install PACKAGE [PACKAGE ...]
pip freeze > requirements.txt
```
Modify the run command `RUN` in the provided `Dockerfile` to
```dockerfile
RUN pip install -r requirements.txt
```

### Models and visualization

Following the above advice, wrap your code for data processing, modeling and generating
figures

In this example generate model diagnostic plots and predictions on the
hold-out test set

Exploratory work yielded the following model
```python
from sklearn import impute, preprocessing, compose, pipeline, linear_model, multioutput

def _get_preprocessor(num_features , cat_features):

num_transformer = pipeline.Pipeline([
("scale", preprocessing.StandardScaler()),
("impute", impute.KNNImputer(n_neighbors = 10)),
])

cat_transformer = pipeline.Pipeline([
("impute", impute.SimpleImputer(strategy = "constant", fill_value = "missing")),
("encode", preprocessing.OneHotEncoder(drop = "first")),
] )

preprocessor = compose.ColumnTransformer(
[("num", num_transformer, num_features),
("cat", cat_transformer, cat_features)
])
return preprocessor

def get_lr_model(num_features, cat_features, C = 1.0):

model = pipeline.Pipeline([
("pre", _get_preprocessor(num_features, cat_features)),
("model", multioutput.MultiOutputClassifier(
linear_model.LogisticRegression(penalty="l1", C = C, solver = "saga")
)),
])
return model

```
A second script calls this model to generate predictions:

```python
import pandas as pd
import os
from models import get_lr_model

DATA_PATH = "data/raw"
PRED_PATH = "results/predictions"

if __name__ == "__main__":

X_train = pd.read_csv(os.path.join(DATA_PATH, "training_set_features.csv")).drop(
"respondent_id", axis = 1
)

X_test = pd.read_csv(os.path.join(DATA_PATH, "test_set_features.csv")).drop(
"respondent_id", axis = 1
)

y_train = pd.read_csv(os.path.join(DATA_PATH, "training_set_labels.csv")).drop(
"respondent_id", axis = 1
)

sub = pd.read_csv(os.path.join(DATA_PATH, "submission_format.csv"))

num_features = X_train.columns[X_train.dtypes != "object"].values
cat_features = X_train.columns[X_train.dtypes == "object"].values

model = get_lr_model(num_features, cat_features, 1)
model.fit(X_train, y_train)
preds = model.predict_proba(X_test)

sub["h1n1_vaccine"] = preds[0][:, 1]
sub["seasonal_vaccine"] = preds[1][:, 1]
sub.to_csv(os.path.join(PRED_PATH, "baseline_pred.csv"), index = False)
```

Add this script as a step in the Popper workflow. This must be after the `get_data`
step
```yaml
- id: "predict"
uses: "./"
args: "python src/predict.py"
```
The same Docker container as for the `jupyter` step is used.


Similarly, the script for generating model plots is added to the workflow
```python
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

import pandas as pd
import os
import numpy as np

from sklearn.model_selection import cross_validate
from models import get_lr_model

DATA_PATH = "data/raw"
FIG_PATH = "output/figures"

if __name__ == "__main__":
mpl.rcParams.update({"figure.autolayout": True, "figure.dpi": 150})
sns.set()

X_train = pd.read_csv(os.path.join(DATA_PATH, "training_set_features.csv")).drop(
"respondent_id", axis=1
)
y_train = pd.read_csv(os.path.join(DATA_PATH, "training_set_labels.csv")).drop(
"respondent_id", axis=1
)

num_features = X_train.columns[X_train.dtypes != "object"].values
cat_features = X_train.columns[X_train.dtypes == "object"].values

Cs = np.logspace(-2, 1, num = 10, base = 10)
means = []
stds = []
best_auc = 0
for C in Cs:
cv = cross_validate(
estimator = get_model(C),
X = X_train,
y = y_train,
cv = 5,
n_jobs = -1,
scoring = "roc_auc",
)
means.append(np.mean(cv["test_score"]))
stds.append(np.std(cv["test_score"]))
if means[-1] > best_auc:
best_C = C
best_auc = means[-1]

fig, ax = plt.subplots()
ax.plot(Cs, means)
ax.vlines(best_C, ymin = 0.82, ymax = 0.86, colors = "r", linestyle = "dotted")
ax.annotate("$C = 0.464$ \n ROC AUC = 0.843", xy = (0.5, 0.835))
ax.set_xscale("log")
ax.set_xlabel("$C$")
ax.grid(axis = "x")
ax.legend(["AUC", "best $C$"])
ax.set_title("AUC for different values of $C$")
fig.savefig(os.path.join(FIG_PATH, "lr_reg_performance.png"))
```

With the following step

```yaml
- id: "figures"
uses: "./"
args: "python src/evaluate_model.py"
```

### Building a paper using LaTeX

It is easy to wrap the generation of the final paper in a Popper workflow.
This is useful to ensure that the paper is always built with the most up-to-date data and figures.

```yaml
- id: "paper"
uses: "docker://blang/latex:ctanbasic"
args: ["pdflatex", "paper.tex"]
dir: "/workspace/paper"
```

Remarks:
- This step uses a basic LaTeX installation. For more sophisticated needs,
use a full [TexLive image](https://hub.docker.com/r/blang/latex/tags)
- `dir` is set to `workspace/paper` so that Popper looks for, and outputs files in the `paper` folder


### Conclusion

ivotron marked this conversation as resolved.
Show resolved Hide resolved
This is the final workflow
```yaml
steps:
- id: "dataset"
uses: "docker://jacobcarlborg/docker-alpine-wget"
runs: ["sh"]
args: ["src/get_data.sh"]

- id: "notebook"
uses: "./"
rgs: ["sh"]
options:
ports:
8888/tcp: 8888

- id: "predict"
uses: "./"
args: "python src/predict.py"

- id: "figures"
uses: "./"
args: "python src/evaluate_model.py"

- id: "paper"
uses: "docker://blang/latex:ctanbasic"
args: ["pdflatex", "paper.tex"]
dir: "/workspace/paper"
```