## Data Version Control (DVC) exercise - creating a pipeline

This exercise will use DVC to capture a data pipeline recreating the steps performed in training a periodic spline model on historic temperature range measurements in [_Lecture 22 (Data Pipelines)_](Lecture22_DataPipelines.ipynb). This exercise will assume you have some familiarity with using Git from a command-line. If you are not familiar with the Git command line interface you can run just the DVC commands without also tracking the changes with Git - in this case you will need [to pass the `--no-scm` option to `dvc init`](https://dvc.org/doc/command-reference/init#initializing-dvc-without-git) and ignore the exercise parts <span style="color: red;"> in red</span>.

Running the cell below creates a new `dvc-pipeline-example` directory in the current working directory and changes this to the working directory. This ensures the files we create to construct the example pipeline will be kept isolated.

In [1]:
import os
os.mkdir("dvc-pipeline-example")
os.chdir("dvc-pipeline-example")

To use DVC you will need to have [DVC installed](https://dvc.org/doc/install) and to be able to run the `dvc` command from a [shell](https://en.wikipedia.org/wiki/Shell_(computing)) (command-line interface to the operating system). <span style="color: red;">You will also need to have Git installed and be able to run `git` for the Git parts of the exercises.</span> If you are running the Jupyter notebook server from a Linux or MacOS system (or from a [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install) shell on Windows) then you can use the [`%%sh`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-bash) IPython cell magic command to run shell commands directly from cells in the notebook - just write `%%sh` as the first line of the cell followed by one or more lines corresponding to the commands to execute. On Windows, you can launch an Anaconda Prompt terminal and run the commands from there (ensuring `dvc` and `git` are installed in the activated `conda` environment and that you are within the `dvc-pipeline-example` directory).

## Initialising Git and DVC

As a first step 
  1. <span style="color: red;">Create a new empty Git repository in the `dvc-pipeline-example` directory.</span>
  2. Initialise a new DVC project in the `dvc-pipeline-example` directory.
  3. <span style="color: red;">Commit the files created by DVC to the current Git branch.</span>

In [2]:
%%sh
git init
dvc init
git commit -m "Initialize DVC"

hint: Using 'master' as the name for the initial branch. This default branch name


hint: is subject to change. To configure the initial branch name to use in all




hint: 


hint: 	git config --global init.defaultBranch <name>


hint: 


hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and


hint: 'development'. The just-created branch can be renamed via this command:


hint: 


hint: 	git branch -m <name>


Initialized empty Git repository in /home/runner/work/course_mlbd/course_mlbd/Lectures/dvc-pipeline-example/.git/


Initialized DVC repository.





You can now commit the changes to git.





+---------------------------------------------------------------------+


|                                                                     |


|        DVC has enabled anonymous aggregate usage analytics.         |


|     Read the analytics documentation (and how to opt-out) here:     |


|             <https://dvc.org/doc/user-guide/analytics>              |


|                                                                     |


+---------------------------------------------------------------------+





What's next?


------------


- Check out the documentation: <https://dvc.org/doc>


- Get help and share ideas: <https://dvc.org/chat>


- Star us on GitHub: <https://github.com/iterative/dvc>


[master (root-commit) 4a853be] Initialize DVC


 3 files changed, 6 insertions(+)


 create mode 100644 .dvc/.gitignore


 create mode 100644 .dvc/config


 create mode 100644 .dvcignore


## Importing the remote data file

  1. Use the `dvc import-url` command to import the gzipped CSV file `1800.csv.gz` containing the historical climate records for the year 1800 from the `noaa-ghcn-pds` Amazon Web Services (AWS) S3 bucket in to the local project.
  2. <span style="color: red;">Add the files generated by DVC on importing the data file to the staging area and commit these changes with Git.</span>
  
_Hint_: To avoid having provide Amazon Web Service credentials (which are required when using DVC's built in support for accesing AWS S3 storage) you can instead use HTTP to access data on the bucket using a URL of the form `http://{bucket-name}.s3.amazonaws.com/{path-to-file}` where `{bucket-name}` is the name of the (publicly accessible) S3 bucket to access and `{path-to-file}` is the path to the file to access on the bucket.

In [3]:
%%sh
dvc import-url http://noaa-ghcn-pds.s3.amazonaws.com/csv.gz/1800.csv.gz 1800.csv.gz
git add .gitignore 1800.csv.gz.dvc
git commit -m "Initial data import"

Importing 'http://noaa-ghcn-pds.s3.amazonaws.com/csv.gz/1800.csv.gz' -> '1800.csv.gz'





To track the changes with git, run:





	git add 1800.csv.gz.dvc .gitignore





To enable auto staging, run:





	dvc config core.autostage true


[master 0a9f8af] Initial data import


 2 files changed, 13 insertions(+)


 create mode 100644 .gitignore


 create mode 100644 1800.csv.gz.dvc


## Writing a Python script to prepare data for model training

Create a Python script `prepare.py` in the `dvc-pipeline-example` directory which recreates the preprocess, transform and serve stages of the example pipeline the [_Lecture 22 (Data Pipelines)_ notebook](Lecture22_DataPipelines.ipynb). That is the script should

  1. Extract the CSV data from the downloaded `1800.csv.gz` file.
  2. Select the records corresponding to temperature (`TMIN` and `TMAX`) measurements for station `ITE00100554`.
  3. Transform the extreme temperature measurements to temperature ranges in degrees Celsius.
  4. Write out the transformed data as a feature matrix (containing the day of year of the measurement as integers) and targets array (containing the temperature ranges in degrees Celsius) ready to feed in to a scikit-learn regression model.

The script should take as arguments the path to the input data file and the path to write the prepared training data to.

_Hint 1_: Most of the code you need is already in the [_Lecture 22 (Data Pipelines)_ notebook](Lecture22_DataPipelines.ipynb).  
_Hint 2_: The [`numpy.savez`](https://numpy.org/doc/stable/reference/generated/numpy.savez.html) function may be useful for writing NumPy array data to a file.  
_Hint 3_: You can use [the `%%writefile` cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-writefile) to create a Python script file from the contents of a cell within the notebook.

In [4]:
%%writefile prepare.py
"""Prepare historical temperature records data for model training"""

import argparse
import gzip
import numpy
import pandas
from pathlib import Path


def preprocess(gzipped_data_path):
    with gzip.open(gzipped_data_path, "rb") as f:
        input_data = pandas.read_csv(
            f, usecols=range(4), names=["station", "date", "quantity", "value"]
        )
    return input_data.query(
        'station == "ITE00100554" and quantity in ["TMIN", "TMAX"]'
    )


def transform(preprocessed_data):
    pivotted_data = preprocessed_data.pivot(
        index="date", columns="quantity", values="value"
    )
    pivotted_data.index = pandas.to_datetime(pivotted_data.index, format="%Y%m%d")
    return pandas.DataFrame(
        {"temperature_range": (pivotted_data.TMAX - pivotted_data.TMIN) / 10}
    )


def prepare_for_sklearn(transformed_data):
    return {
        "features": transformed_data.index.day_of_year.array[:, None],
        "targets": transformed_data.temperature_range.array
    }


if __name__ == "__main__":
    parser = argparse.ArgumentParser("Prepare training data")
    parser.add_argument("--input-data-file", type=Path, required=True)
    parser.add_argument("--output-file", type=Path, required=True)
    args = parser.parse_args()
    preprocessed_data = preprocess(args.input_data_file)
    transformed_data = transform(preprocessed_data)
    training_data = prepare_for_sklearn(transformed_data)
    numpy.savez(args.output_file, **training_data)


Writing prepare.py


## Adding data preparation pipeline stage to DVC

Add a stage to the DVC data pipeline corresponding to preparing the data for model training using the script you just created. The stage should

  * Be named `prepare`.
  * Have as dependencies the `prepare.py` script and the `1800.csv.gz` data file.
  * Have an output file containing the prepared training data feature matrix and target array.
  * Execute the `prepare.py` script using `python` as the pipeline command, passing in the input data file and output file name as arguments.

You can add the stage either by directly creating a `dvc.yaml` file or using the `dvc stage add` command. <span style="color: red;">Once you have created the stage add the relevant files for the stage to the Git staging area and create a new commit.</span>

In [5]:
%%sh
dvc stage add \
  -n prepare \
  -d prepare.py -d 1800.csv.gz \
  -o training_data.npz \
  python prepare.py --input-data-file 1800.csv.gz --output-file training_data.npz
git add .gitignore dvc.yaml prepare.py
git commit -m "Adding data preparation stage of pipeline"

Added stage 'prepare' in 'dvc.yaml'





To track the changes with git, run:





	git add dvc.yaml .gitignore





To enable auto staging, run:





	dvc config core.autostage true


[master e45ea95] Adding data preparation stage of pipeline


 3 files changed, 54 insertions(+)


 create mode 100644 dvc.yaml


 create mode 100644 prepare.py


## Writing a Python script to train scikit-learn regression model

Create a Python script `train.py` in the `dvc-example-pipeline` directory which recreates the model training stage of the example pipeline the [_Lecture 22 (Data Pipelines)_ notebook](Lecture22_DataPipelines.ipynb). That is the script should

  1. Create scikit-learn pipeline corresponding to [a periodic spline regression model with L2 regularization](https://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#periodic-splines).
  2. Fit the model to the training data feature matrix and targets array prepared by the previous pipeline stage.
  3. Serialize the trained model to a file.
  
The number of knots to use in the spline model, degree of the spline polynomial and [L2 regularization term $\alpha$ coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) should all be configurable parameters read from an external file (for example in JSON or YAML format) to allow them to be [captured as parameters of the pipeline stage](https://dvc.org/doc/start/data-management/metrics-parameters-plots#defining-stage-parameters). The script should accept as arguments the path to the file containing the prepared training data (output of the previous pipeline stage), the path to the parameters file and the path to output a serialization of the trained model to.

_Hint 1_: Most of the code you need is again in the [_Lecture 22 (Data Pipelines)_ notebook](Lecture22_DataPipelines.ipynb).  
_Hint 2_: The [`json` module in the Python standard library](https://docs.python.org/3/library/json.html) or [third-party PyYAML package](https://pyyaml.org/wiki/PyYAMLDocumentation) may be useful for reading from the parameter file.

In [6]:
%%writefile train.py
"""Train periodic spline model on prepared data"""

import argparse
import json
import numpy
import pickle
from pathlib import Path
from sklearn import pipeline, preprocessing, linear_model


def train_model(training_data, num_knots, spline_degree, ridge_alpha):
    knots = numpy.linspace(1, 365, num_knots)[:, None]
    model = pipeline.make_pipeline(
        preprocessing.SplineTransformer(
            degree=spline_degree, knots=knots, extrapolation="periodic"
        ),
        linear_model.Ridge(ridge_alpha)
    )
    model.fit(training_data["features"], training_data["targets"])
    return model


if __name__ == "__main__":
    parser = argparse.ArgumentParser("Train model")
    parser.add_argument("--training-data-file", type=Path, required=True)
    parser.add_argument("--parameters-file", type=Path, required=True)
    parser.add_argument("--output-file", type=Path, required=True)
    args = parser.parse_args()
    with open(args.parameters_file, "r") as f:
        parameters = json.load(f)
    training_data = numpy.load(args.training_data_file)
    model = train_model(training_data, **parameters["train_model"])
    with open(args.output_file, "wb") as f:
        pickle.dump(model, f)


Writing train.py


## Adding a model training pipeline stage

Add a stage to the DVC data pipeline corresponding to training the model using the script you just created. The stage should

  * Be named `train`.
  * Have as parameters the number of spline knots, spline polynomial degree and L2 regularization coefficient $\alpha$.
  * Have as dependencies the `train.py` script and training data file outputted by the previous `prepare` stage.
  * Have an output file corresponding to the serialized trained model.
  * Execute the `train.py` script using `python` as the pipeline command, passing in the training data file, parameters file and output file name as arguments.
  
You will also need to create the parameter file - for the parameter values we suggest using 7 spline knots, a spline degree of 3 and $\alpha$ L2 coefficient value of 0.01. <span style="color: red;">Once you have created the stage add the relevant files for the stage to the Git staging area and create a new commit.</span>

In [7]:
%%writefile params.json
{
    "train_model": {"num_knots": 7, "spline_degree": 3, "ridge_alpha": 1e-2}
}

Writing params.json


In [8]:
%%sh
dvc stage add \
  -n train \
  -p params.json:train_model.num_knots,train_model.spline_degree,train_model.ridge_alpha \
  -d train.py -d training_data.npz \
  -o model.pkl \
  python train.py --training-data-file training_data.npz --parameters-file params.json --output-file model.pkl
git add .gitignore dvc.yaml params.json train.py
git commit -m "Adding model training stage of pipeline"

Added stage 'train' in 'dvc.yaml'





To track the changes with git, run:





	git add .gitignore dvc.yaml





To enable auto staging, run:





	dvc config core.autostage true


[master a33a1f4] Adding model training stage of pipeline


 4 files changed, 51 insertions(+)


 create mode 100644 params.json


 create mode 100644 train.py


## Writing a Python script to plot model predictions

Create a Python script `plot.py` in the `dvc-example-pipeline` directory which plots the trained model's predictions as in the example pipeline the [_Lecture 22 (Data Pipelines)_ notebook](Lecture22_DataPipelines.ipynb). That is the script should

  1. Load the serialized train model and training data from files.
  2. Compute the model predictions on the training data input features.
  3. Plot the model predictions and training target values (daily temperature range in degrees Celsius) against the corresponding input feature values (day of the year) as separate line plots on the same axes using Matplotib.
  4. Save the generated plot figure to a file.

The script should accept as arguments the path to the file containing the  training data, the path to the file containing the serialized trained model and the path to output the saved figure to.

_Hint 1_: Most of the code you need is again in the [_Lecture 22 (Data Pipelines)_ notebook](Lecture22_DataPipelines.ipynb).  
_Hint 2_: The [Matplotlib `Figure.savefig` method](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.savefig) may be useful for saving the plot figure to a file.

In [9]:
%%writefile plot.py
"""Plot trained model predictions"""

import argparse
import matplotlib.pyplot
import numpy
import pickle
from pathlib import Path


def plot_data_and_model_predictions(training_data, model, **fig_kwargs):
    fig, ax = matplotlib.pyplot.subplots(**fig_kwargs)
    ax.plot(
        training_data["features"],
        training_data["targets"],
        model.predict(training_data["features"])
    )
    ax.set(xlabel="Day of year", ylabel="Daily temperature range / $^\circ C$")
    ax.legend(["Data", "Spline fit"])
    return fig, ax


if __name__ == "__main__":
    parser = argparse.ArgumentParser("Train model")
    parser.add_argument("--training-data-file", type=Path, required=True)
    parser.add_argument("--model-file", type=Path, required=True)
    parser.add_argument("--output-file", type=Path, required=True)
    args = parser.parse_args()
    training_data = numpy.load(args.training_data_file)
    with open(args.model_file, "rb") as f:
        model = pickle.load(f)
    fig, ax = plot_data_and_model_predictions(training_data, model, figsize=(8, 3))
    fig.tight_layout()
    fig.savefig(args.output_file)


Writing plot.py


## Adding a plotting pipeline stage

Add a stage to the DVC data pipeline corresponding to plotting the model predictions using the script you just created. The stage should

  * Be named `plot`.
  * Have as dependencies the `plot.py` script, and training data and serialized model files outputted by the previous stages.
  * Have an output file corresponding to the generated figure.
  * Execute the `plot.py` script using `python` as the pipeline command, passing in the training data file, trained model file and output file name as arguments.
  
<span style="color: red;">Once you have created the stage add the relevant files for the stage to the Git staging area and create a new commit.</span>

In [10]:
%%sh
dvc stage add \
  -n plot \
  -d plot.py -d training_data.npz -d model.pkl \
  -o predictions.pdf \
  python plot.py --training-data-file training_data.npz --model-file model.pkl --output-file predictions.pdf
git add .gitignore dvc.yaml plot.py
git commit -m "Adding prediction plotting stage of pipeline"

Added stage 'plot' in 'dvc.yaml'





To track the changes with git, run:





	git add .gitignore dvc.yaml





To enable auto staging, run:





	dvc config core.autostage true


[master 93d3c6e] Adding prediction plotting stage of pipeline


 3 files changed, 43 insertions(+)


 create mode 100644 plot.py


## Visualizing and running the pipeline

Visualize the DVC data pipeline you have created as a directed acyclic graph and test running the pipeline.


_Hint_: The [DVC Data Pipelines documentation](https://dvc.org/doc/start/data-management/data-pipelines) explains the commands you need here.

In [11]:
%%sh
dvc dag
dvc repro

    +-----------------+  


    | 1800.csv.gz.dvc |  


    +-----------------+  


              *          


              *          


              *          


        +---------+      


        | prepare |      


        +---------+      


         *        *      


       **          *     


      *             **   


+-------+             *  


| train |           **   


+-------+          *     


         *        *      


          **    **       


            *  *         


          +------+       


          | plot |       


          +------+       


'1800.csv.gz.dvc' didn't change, skipping


Running stage 'prepare':


> python prepare.py --input-data-file 1800.csv.gz --output-file training_data.npz


Generating lock file 'dvc.lock'


Updating lock file 'dvc.lock'





Running stage 'train':


> python train.py --training-data-file training_data.npz --parameters-file params.json --output-file model.pkl


Updating lock file 'dvc.lock'





Running stage 'plot':


> python plot.py --training-data-file training_data.npz --model-file model.pkl --output-file predictions.pdf


Updating lock file 'dvc.lock'





To track the changes with git, run:





	git add dvc.lock





To enable auto staging, run:





	dvc config core.autostage true


Use `dvc push` to send your updates to remote storage.
