# Using dvc experiments to build a predictor of diabetes progression.

This notebook shows a toy example of how to use dvc experiments in model development to test out different model parameters in a modeling pipeline.

### Data

The toy dataset used here is included in `scikit-learn` and predicts the progression of diabetes for 442 patients one year after basline.

In [2]:
from sklearn.datasets import load_diabetes

In [3]:
data = load_diabetes()
print(data.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature va

### Model

The model will try to predict the disease progression from the provided variables. A `scikit-learn` [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet) model will be used, and model performance will be evaluated based on the R-squared (R2) value. An Elastic Net model is a linear regression model that balances two different types of regularization: Lasso (which penalizes the L1-norm or absolute values of the coefficients) and Ridge (which penalizes the L2-norm or squares of the coefficients).

Let's take a look at a simple model training script:

In [6]:
!cat train.py

import joblib
import yaml
from sklearn.datasets import load_diabetes
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split


# Load params.
with open("params.yaml") as f:
    params = yaml.safe_load(f)

# Load data.
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)

# Fit model.
regr = ElasticNet(alpha=params["alpha"], l1_ratio=params["l1_ratio"])
regr.fit(X_train, y_train)

# Evaluate model.
metrics = {}
r2 = regr.score(X_test, y_test)
metrics["r2"] = r2.item()
with open("metrics.yaml", "w") as f:
    yaml.dump(metrics, f)

# Save model.
joblib.dump(regr, "model.joblib")


Note that the script calls a few other files. `params.yaml` contains the model parameters. It looks like:

In [7]:
!cat params.yaml

alpha: 1
l1_ratio: 0.5


In addition, the `train.py` script writes out the R2 score to `metrics.yaml` and saves the model to `model.joblib`.

### DVC Pipeline

All of this can be tracked in `dvc.yaml` to establish a dvc pipeline:

In [9]:
!cat dvc.yaml

stages:
  train:
    cmd: python train.py
    params:
    - alpha
    - l1_ratio
    outs:
    - model.joblib
    metrics:
    - metrics.yaml


Let's run the pipeline as is for an initial experiment:

In [10]:
!dvc exp run

Running stage 'train':                                                core[39m>
> python train.py
Generating lock file 'dvc.lock'                                       core[39m>
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add params.yaml .gitignore dvc.yaml dvc.lock
                                                                      core[39m>
Reproduced experiment(s): exp-e1fcc
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp>

[0m

Looking at the output, this command ran the training script and reproduced experiment `exp-e1fcc`. Let's view the experiment results:

In [13]:
!dvc exp show --no-pager

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓            core[39m>
┃[1m [0m[1mExperiment   [0m[1m [0m┃[1m [0m[1mCreated [0m[1m [0m┃[1m [0m[1m      r2[0m[1m [0m┃[1m [0m[1malpha[0m[1m [0m┃[1m [0m[1ml1_ratio[0m[1m [0m┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │ 0.008472 │ 1     │ 0.5      │
│ main          │ 04:40 PM │        - │ 1     │ 0.5      │
│ └── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1     │ 0.5      │
└───────────────┴──────────┴──────────┴───────┴──────────┘
[0m

The table above shows the experiment results, as well as those at the tip of the current branch and workspace.

### Experimenting with parameter values

Next, let's try different different experiment parameters, especially since our initial R2 value is very weak.

`alpha` is a constant multiplier of the regularization term in Elastic Net. In other words, a higher `alpha` increases regularization. By setting `alpha=0`, Elastic Net becomes ordinary least squares regression, since the regularization term is set to have no weight. Let's try that:

In [14]:
!dvc exp run --params alpha=0

Running stage 'train':                                                core[39m>
> python train.py
  regr.fit(X_train, y_train)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
Updating lock file 'dvc.lock'                                         core[39m>

To track the changes with git, run:

	git add dvc.yaml dvc.lock params.yaml
                                                                      core[39m>
Reproduced experiment(s): exp-3ad02
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp>

[0m

NOTE: `scikit-learn` gives a warning when using Elastic Net with `alpha=0`, since the underlying algorithm used may not converge, so it's generally better to use `LinearRegression` if not using any regularization.

Let's compare to the initial experiment with the default parameters:

In [16]:
!dvc exp show --no-pager

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓            core[39m>
┃[1m [0m[1mExperiment   [0m[1m [0m┃[1m [0m[1mCreated [0m[1m [0m┃[1m [0m[1m      r2[0m[1m [0m┃[1m [0m[1malpha[0m[1m [0m┃[1m [0m[1ml1_ratio[0m[1m [0m┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │   0.3594 │ 0     │ 0.5      │
│ main          │ 04:40 PM │        - │ 1     │ 0.5      │
│ ├── exp-3ad02 │ 05:47 PM │   0.3594 │ 0     │ 0.5      │
│ └── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1     │ 0.5      │
└───────────────┴──────────┴──────────┴───────┴──────────┘
[0m

The R2 score without any regularization performs much better, which may not be surprising given the simplicity of the dataset.

Let's try smaller `alpha` values to see if there is any amount of regularization that improves R2:

In [17]:
%%bash
dvc exp run --params alpha=0.1
dvc exp show --no-pager

Running stage 'train':
> python train.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml params.yaml dvc.lock

Reproduced experiment(s): exp-77494
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp>

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃       r2 ┃ alpha ┃ l1_ratio ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │ 0.096341 │ 0.1   │ 0.5      │
│ main          │ 04:40 PM │        - │ 1     │ 0.5      │
│ ├── exp-77494 │ 05:54 PM │ 0.096341 │ 0.1   │ 0.5      │
│ ├── exp-3ad02 │ 05:47 PM │   0.3594 │ 0     │ 0.5      │
│ └── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1     │ 0.5      │
└───────────────┴──────────┴──────────┴───────┴──────────┘


The R2 score is still much worse than `alpha=0`. Let's try an even smaller `alpha`:

In [18]:
%%bash
dvc exp run --params alpha=0.01
dvc exp show --no-pager

Running stage 'train':
> python train.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock params.yaml

Reproduced experiment(s): exp-f3e7f
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp>

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃       r2 ┃ alpha ┃ l1_ratio ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │  0.32473 │ 0.01  │ 0.5      │
│ main          │ 04:40 PM │        - │ 1     │ 0.5      │
│ ├── exp-f3e7f │ 05:55 PM │  0.32473 │ 0.01  │ 0.5      │
│ ├── exp-77494 │ 05:54 PM │ 0.096341 │ 0.1   │ 0.5      │
│ ├── exp-3ad02 │ 05:47 PM │   0.3594 │ 0     │ 0.5      │
│ └── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1     │ 0.5      │
└───────────────┴──────────┴──────────┴───────┴──────────┘


That experiment was much closer to `alpha=0`, so let's try even smaller:

In [19]:
%%bash
dvc exp run --params alpha=0.001
dvc exp show --no-pager

Running stage 'train':
> python train.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock params.yaml dvc.yaml

Reproduced experiment(s): exp-20200
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp>

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃       r2 ┃ alpha ┃ l1_ratio ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │   0.3751 │ 0.001 │ 0.5      │
│ main          │ 04:40 PM │        - │ 1     │ 0.5      │
│ ├── exp-20200 │ 05:55 PM │   0.3751 │ 0.001 │ 0.5      │
│ ├── exp-f3e7f │ 05:55 PM │  0.32473 │ 0.01  │ 0.5      │
│ ├── exp-77494 │ 05:54 PM │ 0.096341 │ 0.1   │ 0.5      │
│ ├── exp-3ad02 │ 05:47 PM │   0.3594 │ 0     │ 0.5      │
│ └── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1     │ 0.5      │
└───────────────┴──────────┴──────────┴───────┴──────────┘


That experiment actually beat `alpha=0`, so let's see if an even smaller `alpha` is even better:

In [20]:
%%bash
dvc exp run --params alpha=0.0001
dvc exp show --no-pager

Running stage 'train':
> python train.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock params.yaml

Reproduced experiment(s): exp-86a4c
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

	dvc exp branch <exp>

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃       r2 ┃ alpha  ┃ l1_ratio ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │  0.35707 │ 0.0001 │ 0.5      │
│ main          │ 04:40 PM │        - │ 1      │ 0.5      │
│ ├── exp-86a4c │ 05:56 PM │  0.35707 │ 0.0001 │ 0.5      │
│ ├── exp-20200 │ 05:55 PM │   0.3751 │ 0.001  │ 0.5      │
│ ├── exp-f3e7f │ 05:55 PM │  0.32473 │ 0.01   │ 0.5      │
│ ├── exp-77494 │ 05:54 PM │ 0.096341 │ 0.1    │ 0.5      │
│ ├── exp-3ad02 │ 05:47 PM │   0.3594 │ 0      │ 0.5      │
│ └── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1      │ 0.5      │
└───────────────┴──────────┴

The R2 score is going back down, so it looks like `alpha=0.001` was best. Let's revert to that experiment by using the experiment id on the left column of the table:

In [22]:
!dvc exp apply -f exp-20200

Changes for experiment 'exp-20200' have been applied to your current workspace. 
[0m

Let's check what that did by looking at the parameters and metrics in the workspace:

In [23]:
%%bash
cat params.yaml
cat metrics.yaml

alpha: 0.001
l1_ratio: 0.5
r2: 0.3751029973603025


Everything in the workspace now matches the state of the applied experiment.

### Experimenting with multiple parameters

The `l1_ratio` is a mixing parameter, which controls whether to weight the L1 regularization term more or less than the L2 term. `l1_ratio=1` is equivalent to Lasso (L1) regression, and `l1_ratio=0` is equivalent to Ridge (L2) regression.

Let's queue up experiments for multiple `l1_ratio` values at once:

In [24]:
%%bash
dvc exp run --params l1_ratio=0 --queue
dvc exp run --params l1_ratio=0.2 --queue
dvc exp run --params l1_ratio=0.4 --queue
dvc exp run --params l1_ratio=0.6 --queue
dvc exp run --params l1_ratio=0.8 --queue
dvc exp run --params l1_ratio=1 --queue

Queued experiment '3a2c083' for future execution.
Queued experiment '5536299' for future execution.
Queued experiment '9b487f6' for future execution.
Queued experiment 'a68621b' for future execution.
Queued experiment '9015752' for future execution.
Queued experiment '80e504c' for future execution.


All of those experiments have been saved for future execution. Let's see what that looks like:

In [25]:
!dvc exp show --no-pager

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓           core[39m>
┃[1m [0m[1mExperiment   [0m[1m [0m┃[1m [0m[1mCreated [0m[1m [0m┃[1m [0m[1m      r2[0m[1m [0m┃[1m [0m[1malpha [0m[1m [0m┃[1m [0m[1ml1_ratio[0m[1m [0m┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │   0.3751 │ 0.001  │ 0.5      │
│ main          │ 04:40 PM │        - │ 1      │ 0.5      │
│ ├── exp-86a4c │ 05:56 PM │  0.35707 │ 0.0001 │ 0.5      │
│ ├── exp-20200 │ 05:55 PM │   0.3751 │ 0.001  │ 0.5      │
│ ├── exp-f3e7f │ 05:55 PM │  0.32473 │ 0.01   │ 0.5      │
│ ├── exp-77494 │ 05:54 PM │ 0.096341 │ 0.1    │ 0.5      │
│ ├── exp-3ad02 │ 05:47 PM │   0.3594 │ 0      │ 0.5      │
│ ├── exp-e1fcc │ 05:32 PM │ 0.008472 │ 1      │ 0.5      │
│ ├── *80e504c  │ 06:52 PM │        - │ 0.001  │ 1        │
│ ├── *9015752  │ 06:52 PM │        - │ 0.001  │ 0.8      │
│ ├── *a68621b  │ 06:52 PM │        - │ 0.001  │ 0.6      │
│ ├── *9b487f6  │ 0

Now let's run all of those at once, and in 4 different jobs to speed up our execution:

In [28]:
!dvc exp run --run-all -j 4

                                                                      core[39m>Running stage 'train':
> python train.py
Running stage 'train':
> python train.py
Running stage 'train':
> python train.py
Running stage 'train':
> python train.py
Generating lock file 'dvc.lock'                                       core[39m>
Updating lock file 'dvc.lock'
Generating lock file 'dvc.lock'                                       core[39m>
Updating lock file 'dvc.lock'
Generating lock file 'dvc.lock'                                       core[39m>
Updating lock file 'dvc.lock'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Running stage 'train':                                                core[39m>
> python train.py
Running stage 'train':                                                core[39m>
> python train.py
Generating lock file 'dvc.lock'                                       core[39m>
Updating lock file 'dvc.lock'
  model = cd_fast.enet_coordinate_descent(
Generatin

NOTE: When using `-j` to run multiple experiment jobs in parallel, the order of experiments may change from the queue order since there is no guarantee which experiments will complete first.

In [29]:
!dvc exp show --no-pager

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓           core[39m>
┃[1m [0m[1mExperiment   [0m[1m [0m┃[1m [0m[1mCreated [0m[1m [0m┃[1m [0m[1m      r2[0m[1m [0m┃[1m [0m[1malpha [0m[1m [0m┃[1m [0m[1ml1_ratio[0m[1m [0m┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ workspace     │ -        │   0.3751 │ 0.001  │ 0.5      │
│ main          │ 04:40 PM │        - │ 1      │ 0.5      │
│ ├── exp-60146 │ 06:57 PM │  0.37954 │ 0.001  │ 0.2      │
│ ├── exp-a70e0 │ 06:57 PM │  0.38037 │ 0.001  │ 0        │
│ ├── exp-a54b4 │ 06:57 PM │  0.37243 │ 0.001  │ 0.6      │
│ ├── exp-b23ef │ 06:57 PM │  0.37711 │ 0.001  │ 0.4      │
│ ├── exp-32e05 │ 06:57 PM │  0.36461 │ 0.001  │ 0.8      │
│ ├── exp-84d20 │ 06:57 PM │  0.35875 │ 0.001  │ 1        │
│ ├── exp-44136 │ 06:57 PM │        - │ 0.001  │ 0.5      │
│ ├── exp-86a4c │ 05:56 PM │  0.35707 │ 0.0001 │ 0.5      │
│ ├── exp-20200 │ 05:55 PM │   0.3751 │ 0.001  │ 0.5      │
│ ├── exp-f3e7f │ 0

The best experiment is `l1_ratio=0`, which is pure Ridge regression. Let's apply that experiment to the workspace and check the metrics:

In [44]:
%%bash
dvc exp apply -f exp-a70e0
dvc metrics show

Changes for experiment 'exp-a70e0' have been applied to your current workspace.
Path          r2
metrics.yaml  0.38037


Once you have applied the experiment you want to keep as part of your pipeline, commit to git and it will be preserved in both git and dvc.