# What is dvc?
- A data and ML experiment management tool

# Install dvc

In [2]:
# For mac
!brew install dvc #Any command that works at the command-line can be used in a notebook by prefixing it with the ! character
#pip install dvc

# For Windows
#choco install dvc
#pip install dvc

# For Linux
#pip install dvc

Running `brew update --auto-update`...
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 1 tap (homebrew/core).

You have [1m4[0m outdated formulae installed.
You can upgrade them with [1mbrew upgrade[0m
or list them with [1mbrew outdated[0m.

[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/aws-sdk-cpp/manifests/1.9.300[0m
Already downloaded: /Users/twileman/Library/Caches/Homebrew/downloads/8f4bb7d9ae75056beb99d7e10e3178ae15759fdb12df3a94d73e80e35f23c66e--aws-sdk-cpp-1.9.300.bottle_manifest.json
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/aws-sdk-cpp/blobs/sha256:75b03e[0m
Already downloaded: /Users/twileman/Library/Caches/Homebrew/downloads/dc2fb81391a0d7ad6fcea1efeb22dc87978b22b4f34a2f5f44f7d48cc5511d89--aws-sdk-cpp--1.9.300.arm64_monterey.bottle.tar.gz
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/brotli/manifests/1.0.9[0m
Already downloaded: /Users/twileman/Library/Caches/Homebrew/downloads/922ce7b351cec833f9bd2641f27d8ac0

# Initialize dvc
A few internal files are created that should be added to Git:
- .dvc/config: This is a configuration file. The config file can be edited by hand or with the dvc config command.
- .dvc/cache: Default location of the cache directory. The cache stores the project data in a special structure.
- .dvc/cache/runs: Default location of the run-cache.
- .dvc/plots: Directory for plot templates
- .dvc/tmp: Directory for miscellaneous temporary files
- and more...


In [None]:
! dvc init

# DVC's features can be grouped into functional components. You can explore them in two independent trails:
- Data Management Trail: 
    - Data and model versioning - The base layer of DVC for large files, datasets, and machine learning models. Use a regular Git workflow, but without storing large files in the repo (think "Git for data"). Data is stored separately, which allows for efficient sharing.

- Experiments Trail
    - Experiments versioning - Enable exploration, iteration, and comparison across many ML experiments. Track your experiments with automatic versioning and checkpoint logging. Compare differences in parameters, metrics, code, and data. Apply, drop, roll back, resume, or share any experiment.

# Get a sample dataset

In [None]:
! dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

# Use dvc add to start tracking a file or directory 
- DVC stores information about the added file in a special .dvc file named data/data.xml.dvc — a small text file with a human-readable format. This metadata file is a placeholder for the original data, and can be easily versioned like source code with Git:

In [None]:
! dvc add data/data.xml

In [None]:
! git add data/data.xml.dvc data/.gitignore
! git commit -m "Add raw data"

# Remote Storage
- dvc push uploads DVC-tracked data or model files to a remote directory so they can be retrieved on other environments later with dvc pull

In [1]:
# Set up storage location
! dvc remote add -d storage s3://mybucket/dvcstore
! git add .dvc/config
! git commit -m "Configure remote storage"

zsh:1: command not found: dvc
fatal: pathspec '.dvc/config' did not match any files
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mdeleted:    ../hyperopt_example.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m./[m
	[31m../hyperopt_example/[m

no changes added to commit (use "git add" and/or "git commit -a")


In [None]:
# Push to remote storage
! dvc push

dvc push copied the data cached locally to the remote storage we set up earlier. The remote storage directory should look like this:

.../dvcstore

└── 22

    └── a1a2931c8370d3aeedd7183606fd7f

# Pipelines
- DVC pipelines can be used to capture data pipelines so you can keep track of the data processes that produce a final result.; how is data filtered, transformed, or used to train ML models? 
- When you create a pipeline, a dvc.yaml file is generated. This file includes information about the command we want to run (python src/prepare.py data/data.xml), its dependencies, and outputs.
- DVC uses these metafiles to track the data used and produced by the stage, so there's no need to use dvc add on data/prepared manually.

In [None]:
! dvc stage add -n prepare \
                -p prepare.seed,prepare.split \
                -d src/prepare.py -d data/data.xml \
                -o data/prepared \
                python src/prepare.py data/data.xml

    -n prepare specifies a name for the stage. If you open the dvc.yaml file you will see a section named prepare.

    -p prepare.seed,prepare.split defines special types of dependencies — parameters. We'll get to them later in the Metrics, Parameters, and Plots page, but the idea is that the stage can depend on field values from a parameters file (params.yaml by default):

prepare:
  split: 0.20
  seed: 20170428

    -d src/prepare.py and -d data/data.xml mean that the stage depends on these files to work. Notice that the source code itself is marked as a dependency. If any of these files change later, DVC will know that this stage needs to be reproduced.

    -o data/prepared specifies an output directory for this script, which writes two files in it. This is how the workspace should look like after the run:

     .
     ├── data
     │   ├── data.xml
     │   ├── data.xml.dvc
    +│   └── prepared
    +│       ├── test.tsv
    +│       └── train.tsv
    +├── dvc.yaml
    +├── dvc.lock
     ├── params.yaml
     └── src
         ├── ...

    The last line, python src/prepare.py data/data.xml is the command to run in this stage, and it's saved to dvc.yaml, as shown below.


Once you added a stage, you can run the pipeline with dvc repro. Next, you can use dvc push if you wish to save all the data to remote storage (usually along with git commit to version DVC metafiles).