# Using DVC for Data and Model version within a Git repo
Based on [RealPython's Tutorial](https://realpython.com/python-data-version-control/)

## Set up DVC and its environment

In [None]:
!conda create -n dvc python=3.8.2

In [None]:
#activate dvc
!conda config --add channels conda-forge
!python -m pip install dvc scikit-learn scikit-image pandas numpy

## Clone forked repo

In [None]:
!git clone https://github.com/carbaro/data-version-control.git 

In [3]:
!cd data-version-control

In [5]:
!echo %CD%

f:\AI\tools\DVC\realpython_tutorial


In [8]:
DVC = 'data-version-control'
!echo $DVC

data-version-control


## setup git and dvc

In [None]:
!git checkout -b "first_experiment"
!dvc init
!dvc config core.analytics false

## Setup "Remote" storage

In [None]:
!dvc remote add -d remote_storage F:\data\dvc\dvc_remote_realpy

## Track files

In [None]:
!dvc add data/raw/train
!dvc add data/raw/val
!git add --all
!git commit -m "First commit with setup and DVC files"

In [None]:
!dvc push
!git push --set-upstream origin first_experimen

## Downloading Files

In [None]:
# delete e.g. raw/val
!dvc checkout /data/raw/val.dvc #or
!dvc pull

## Train a Model

In [None]:
!python src/prepare.py
!python src/train.py
!dvc add model/model.joblib
!git add --all
!git commit -m "Trained SGD Classifier"


## Evaluate

In [None]:
!python src/evaluate.py

In [None]:
!git add --all
!git commit -m "Evaluate SGD accuracy"

## Versioning Dataset and Models 

In [None]:
!git push
!dvc push

### Tagging Commits (marks significant point in history of repo)

In [None]:
!git tag -a sgd-classifier -m "SGDClassifier with accuracy 67.06%"

In [None]:
!git push origin --tags

# Creating one git branch per experiment

In [None]:
!git checkout -b "sgd-100-iters"
!python src/train.py
!python src/evaluate.py

In [None]:
# save new model.joblib
!dvc commit
# confirm with y

In [None]:
!git add --all
!git commit -m "Change SGD max_iter to 100"

!git tag -a sgd-100-iter -m "Trained an SGD Classifier for 100 iterations"
!git push origin --tags

!git push --set-upstream origin sgd-100-iter
!dvc push


# Create Reproducible Pipelines

In [None]:
!git checkout -b sgd-pipeline

## define  pipeline stages
### Note: [RealPython Tutorial](https://realpython.com/python-data-version-control/#create-reproducible-pipelines) advises to use 

```bash
dvc run -n stage_name \
    -d dep1 -d dep2
    -o output1
    python script.py
```

### However, DVC (3.28.0) does not have an option called run. Instead, the method used was that advised on [DVC Docs](https://dvc.org/doc/user-guide/pipelines/defining-pipelines)

```bash
dvc stage add --name train \
                --deps src/model.py \
                --deps data/clean.csv \
                --outs data/predict.dat \
                python src/model.py data/clean.csv
```

In [1]:
flatten_stage_cmd = lambda cmd: cmd.replace('\\n    ', '').replace('\n','')

prepare_cmd = """
dvc stage add -n prepare 
    -d src/prepare.py -d data/raw 
    -o data/prepared/train.csv -o data/prepared/test.csv 
    python src/prepare.py
    """
train_cmd = """
dvc stage add -n train
    -d src/train.py -d data/prepared/train.csv 
    -o model/model.joblib 
    python src/train.py
    """
evaluate_cmd = """
dvc stage add -n evaluate
    -d src/evaluate.py -d model/model.joblib
    -M metrics/accuracy.json
    python src/evaluate.py
    """

flat_stage_cmds = [flatten_stage_cmd(cmd) for cmd in (prepare_cmd,train_cmd,evaluate_cmd)]

In [2]:
_=[print(fcmd) for fcmd in flat_stage_cmds]

dvc stage add -n prepare     -d src/prepare.py -d data/raw     -o data/prepared/train.csv -o data/prepared/val.csv     python src/prepare.py    
dvc stage add -n train    -d src/train.py -d data/prepared/train.csv     -o model/model.joblib     python src/train.py    
dvc stage add -n evaluate    -d src/evaluate.py -d model/model.joblib    -M metrics/accuracy.json    python src/evaluate.py    


### Since the actual pipline was not run, we now need to run the pipeline as an experiment

### See [DVC Docs](https://dvc.org/doc/user-guide/pipelines/running-pipelines) for details

In [None]:
!dvc exp run

# ------------------ TODO: use params to adjust iters ------------------------

## commit changes

In [None]:
!git add --all
!git commit -m "Rerun SGD as pipeline"
!dvc commit
!git push --set-upstream origin sgd-pipeline
!git tag -a sgd-pipeline -m "Trained SGD as DVC pipeline."
!git push origin --tags
!dvc push

I would like to make use of the DVC VSCode Extension, where I can track experiments and compare plots within the same UI 


(No need to start up tensorboard on the browser, manually delete unwanted folders, manage colours, etc.)

So, as an interlude, I've included the code necessary to integrate DVC Extension, which depends on  DVCLive. See details at the [Iterative Blog post on DVC Tracking](https://iterative.ai/blog/exp-tracking-dvc-python?tab=General-Python-API)


Install the extension and pick the correct (DVC) environment in Setup.

Following that, modify train.py to integrate DVC Live