# Generate `dvc` ready data
This notebook I will show you how to create `dvc` ready training data for use in `with-context` custom Sagemaker containers. 

In [4]:
"""Generate a dataset of documents from the 20 newsgroups dataset."""

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

data = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

documents = data['data']

In [5]:
"""Write this dataset to a file."""
with open('../data/training_data.txt', mode='w') as file:
    # Write the documents to the file
    for document in documents:
        file.write(
            document.replace('\n', '\\n') + '\n'
        )

In the following cell, we run a bash script to upload this dataset to the tracked S3 bucket that DVC uses for version control. This will trigger an update in the `data.dvc` file to let DVC know that there is a new dataset. If this is the first time you're using the dataset, it will create the first version of the file. This works the same way a file commit works in git.

In [6]:
%%bash
# dvc remote add -d bert-topic s3://with-context-sagemaker/datasets/bert-topic/
dvc add ../data/
dvc push

[?25l⠋ Checking graph



To track the changes with git, run:

	git add ../data.dvc

To enable auto staging, run:

	dvc config core.autostage true
1 file pushed


[?25h