# Generate `dvc` ready data
This notebook I will show you how to create `dvc` ready training data for use in `with-context` custom Sagemaker containers. 

In [2]:
"""Get the MNIST dataset."""

import boto3
from torchvision import datasets, transforms

region = boto3.Session().region_name

datasets.MNIST.mirrors = [
    f"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/"
]

train_set = datasets.MNIST(
    "../data/",
    download=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    ),
)

Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz
Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:02<00:00, 4738816.33it/s]


Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz
Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 1495336.24it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz





Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 4044811.43it/s]


Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz
Downloading https://sagemaker-example-files-prod-us-east-1.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 1690525.23it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw






In the following cell, we run a bash script to upload this dataset to the tracked S3 bucket that DVC uses for version control. This will trigger an update in the `data.dvc` file to let DVC know that there is a new dataset. If this is the first time you're using the dataset, it will create the first version of the file. This works the same way a file commit works in git.

In [4]:
%%bash
dvc remote add -d mnist s3://with-context-sagemaker/datasets/mnist/
dvc add ../data/
dvc push

Setting 'mnist' as a default remote.


[?25l⠋ Checking graph



To track the changes with git, run:

	git add ../data.dvc

To enable auto staging, run:

	dvc config core.autostage true
Everything is up to date.


[?25h