# Example of versioning ML experiments using DVC

This notebook aims to be a guideline for versioning your ML projects using DVC, from a Jupyter notebook.

This notebook allows you to experiment as much as you like, and when you are in a state that you would like to preserve for future reference as a git commit, use the DVC cells to version all your relevant files. 

The cells marked with a green markdown box are responsible for creating a snapshot of your raw data, processed data, and trained models.

This snapshot is implemented as md5 hashes of the respective files saved as text in the `.dvc` files. The hashes in the .dvc files will be part of the git commit.

## Imports and global declarations

In [1]:
from sklearn import datasets
import sklearn
from sklearn import preprocessing
import joblib
from sklearn import metrics
from sklearn import model_selection
import numpy as np
import pickle
import pandas as pd
from sklearn.linear_model import LinearRegression
import json
import os

<div class="alert alert-block alert-success">
<h2>Download and version raw data</h2>
</div>

In [6]:
raw_data = datasets.fetch_california_housing(data_home="../data/raw")
# Save the raw input data for reproducibility
!dvc commit -f ../data/raw.dvc

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to ../data/raw


[0m                                                                            

## Data preprocessing

In [7]:
def to_dataframe(X, y):
    return pd.concat([
            pd.DataFrame(data=X, columns=raw_data.feature_names),
            pd.DataFrame(data=y, columns=['Value'])
        ],
        axis=1)

In [10]:
raw_df = to_dataframe(raw_data.data, raw_data.target)
raw_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [11]:
raw_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### Test train split

In [12]:
train_X, test_X, train_y, test_y = model_selection.train_test_split(raw_df[raw_df.columns[:-1]], raw_df['Value'])

In [13]:
train_X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
19665,4.4083,17.0,6.551181,0.996063,736.0,2.897638,37.51,-120.82
343,1.9167,36.0,7.506667,1.753333,482.0,3.213333,37.75,-122.19
14604,3.6371,34.0,6.002639,0.989446,1021.0,2.693931,32.81,-117.16
10559,5.0686,19.0,5.973737,0.993939,1639.0,3.311111,33.62,-117.7
18872,5.2957,18.0,6.483932,1.019849,3205.0,3.029301,38.09,-122.2


### Normalize feature columns by training data only

In [14]:
scaler = preprocessing.StandardScaler()
train_X_scaled = pd.DataFrame(scaler.fit_transform(train_X), index=train_X.index, columns=train_X.columns)
train_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
19665,0.283428,-0.926676,0.439312,-0.202711,-0.619374,-0.015934,0.885741,-0.630227
343,-1.024246,0.582858,0.811759,1.328249,-0.850159,0.012053,0.998126,-1.314789
14604,-0.121324,0.42396,0.225491,-0.216089,-0.360423,-0.033994,-1.315136,1.1986
10559,0.629975,-0.767778,0.214226,-0.207004,0.201093,0.020722,-0.935836,0.928773
18872,0.749164,-0.847227,0.413099,-0.154624,1.623961,-0.004262,1.157338,-1.319786


In [15]:
test_X_scaled = pd.DataFrame(scaler.transform(test_X), index=test_X.index, columns=test_X.columns)
test_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
16389,-0.154703,-0.291083,0.009588,-0.189152,-0.347702,-0.018647,1.138608,-0.84509
3885,0.660362,0.42396,0.184377,-0.072877,-0.59575,-0.033267,-0.659555,0.519036
2754,-1.251971,-0.370532,-0.458338,-0.350788,-0.304089,0.049368,-1.380694,2.033065
15011,-0.5953,-0.211634,-0.469069,-0.140707,0.709909,-0.079802,-1.333867,1.258562
11864,-0.900543,-0.608879,0.048317,0.100511,0.075706,-0.04418,2.187536,-0.835096


In [16]:
train_df = pd.concat([train_X_scaled, train_y], axis=1)
test_df = pd.concat([test_X_scaled, test_y], axis=1)

<div class="alert alert-block alert-success">
<h2>Optional: Version the processed data with DVC for efficiency and/or reproducibility</h2>
</div>

In [19]:
train_df.to_csv('../data/processed/california_households_train.csv', index_label='Index')
test_df.to_csv('../data/processed/california_households_test.csv', index_label='Index')
joblib.dump(scaler, '../data/processed/california_households_scaler.pkl')
!dvc commit -f ../data/process.dvc

[0m                                                                            

### Use this cell to reload processed data, after switching branches

In [21]:
train_df = pd.read_csv('../data/processed/california_households_train.csv', index_col=0)
test_df = pd.read_csv('../data/processed/california_households_train.csv', index_col=0)
scaler = joblib.load('../data/processed/california_households_scaler.pkl')

## Training

In [22]:
model = LinearRegression()
X = train_df[train_df.columns[:-1]]
y = train_df['Value']
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<div class="alert alert-block alert-success">
<h2>Save the trained model for reproducibility</h2>
</div>

In [23]:
joblib.dump(model, '../models/california_households.pkl')
!dvc commit -f ../models.dvc

[0m                                                                            

### Use this cell to reload the model, after switching branches

In [25]:
model = joblib.load('../models/california_households.pkl')

## Evaluate the model

In [26]:
predictions = model.predict(test_df[test_df.columns[:-1]])
truth = test_df['Value']
metrics_dict = {}
metrics_dict['R2'] = metrics.r2_score(truth, predictions)
metrics_dict['MAE'] = metrics.mean_absolute_error(truth, predictions)
metrics_dict['MSE'] = metrics.mean_squared_error(truth, predictions)
metrics_dict['median_absolute_error'] = metrics.median_absolute_error(truth, predictions)
metrics_dict['loss'] = metrics_dict['MSE']
pd.DataFrame(metrics_dict, index=[0])

Unnamed: 0,R2,MAE,MSE,median_absolute_error,loss
0,0.603582,0.535047,0.528588,0.420226,0.528588


<div class="alert alert-block alert-success">
<h2>Save the computed metrics for easy display in DVC and DAGsHub</h2>
</div>

In [28]:
with open('../metrics/metrics.json', 'w') as f:
    json.dump(metrics_dict, f, indent=2)
!dvc commit -f ../eval.dvc

[0m                                                                  core[39m>

<div class="alert alert-block alert-success">
<h2>Versioning section - use the following cells to create a full commit of your current state</h2>
</div>

### Make sure all data and models are committed to DVC
The output of the following cell should be: `Pipeline is up to date. Nothing to reproduce.`

If you get something else, then maybe you forgot to `dvc commit` earlier in the notebook.
We recommend to make sure that the current contents in the data and models directories are to your liking,
and if so, use the commit cell below to automatically commit all current files to DVC.

In [29]:
!dvc status

Data and pipelines are up to date.                                    core[39m>
[0m

In [30]:
# Use this if dvc status is not up-to-date and you're sure the current state is OK.
!dvc commit -f

[0m                                                                            

In [31]:
!dvc push

[31mERROR[39m: failed to push data to the cloud - Unable to find AWS credentials. <[36mhttps://error.dvc.org/no-credentials[39m>: Unable to locate credentials
[0m

In [2]:
import os
import sys
sys.path.append(os.getcwd() + '/..')