# Prepare a dataset for use in a training task

**Goals**: show the steps necessary to download, preprocess, and register a dataset for use with AzureML. 

Working with AzureML registered datasets allows you to avoid bundling the dataset with the context you send to each training node (slow) or setting up a persistent mount on your DeepSpeed Docker images (more complex and brittle). Additionally, using registered datasets offers the opportunity to log what version of a dataset a model was trained with. This means you can monitor the effects of retraining over time as your data increases or changes. 

A note on terminology: I'll endeavor use the term dataset to refer to the AzureML dataset object stored on AzureML and data when referencing the same information stored locally.

Here we'll be working with the [CoLA](https://nyu-mll.github.io/CoLA/) dataset, on which we'll later be fine-tuning a model. We'll perform the following steps:

- Download the data appropriate for the training task
- Preprocess (tokenize) it
- Save it locally
- Upload the preprocessed data to the AzureML datastore where our AzureML datasets are located
- Register the resulting files as a new AzureML dataset

First, imports

In [1]:
import yaml
import azureml.core
import datasets
import transformers

  from .autonotebook import tqdm as notebook_tqdm


## Using an external config.yml

Now we'll load a `src/config.yml` file that tells us the task name and data locations, this is also used by the training functions and so keeps everything in one place. More information about it is available in the companion `Train model` notebook.

In [2]:
with open('src/config.yml', 'r') as f:
    config = yaml.safe_load(f)

## Load or mount data 

The data itself will be downloaded by the `datasets` libraries from [Huggingface](https://huggingface.co/). Loading this public dataset is quite simple but this is where you'd locally mount or otherwise access a proprietary dataset. 

In [3]:
data = datasets.load_dataset("glue", config['task'])

Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 48.5MB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 52.0MB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 45.5MB/s]
Downloading data: 100%|██████████| 377k/377k [00:00<00:00, 37.3MB/s]
Generating train split: 100%|██████████| 8551/8551 [00:00<00:00, 18672.50 examples/s]
Generating validation split: 100%|██████████| 1043/1043 [00:00<00:00, 32312.97 examples/s]
Generating test split: 100%|██████████| 1063/1063 [00:00<00:00, 30586.16 examples/s]


## Preprocess data

We will preprocess our data before we save it. In this case it means tokenizing the data to prepare it for the model specified in the `src/config.yml` file.

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(config['model'])
def tokenizer_function(examples):
    return tokenizer(examples['sentence'], padding="max_length", truncation=True)

data = data.map(tokenizer_function, batched=True)
data.save_to_disk(config['data_dir'])

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 18.1kB/s]
config.json: 100%|██████████| 483/483 [00:00<00:00, 269kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 14.6MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 58.1MB/s]
Map: 100%|██████████| 8551/8551 [00:01<00:00, 6947.50 examples/s]
Map: 100%|██████████| 1043/1043 [00:00<00:00, 9578.07 examples/s]
Map: 100%|██████████| 1063/1063 [00:00<00:00, 9646.73 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 8551/8551 [00:00<00:00, 50101.27 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1043/1043 [00:00<00:00, 13579.87 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1063/1063 [00:00<00:00, 11974.01 examples/s]


## Connect to AzureML workspace and upload data

Now we'll connect to the AzureML workspace and Azure datastore where our dataset will live. We instantiate a connection to the workspace using a configuration file that is automatically provided on the AzureML compute instances but which [we must create](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#workspace) if we run this notebook on our own desktop or laptop machine. 

[Azure datastores](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py) can be blob storage, file shares, datalakes, and more. Here we'll use the default, a file share, but any of the other backing services can be used after first [registering them with AzureML](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data#create-and-register-datastores) either through the studio UI or Python SDK.

In [5]:
workspace = azureml.core.Workspace.from_config()
datastore = workspace.get_default_datastore()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Upload data stored on local disk to the datastore. We are clobbering any prior versions in this upload, assign a UUID subdirectory or other versioning mechanism to retain prior versions of this data within the datastore.

In [6]:
datastore.upload(src_dir=config['data_dir'], 
                 target_path=config['data_dir'], 
                 overwrite=True
                )

"Datastore.upload" is deprecated after version 1.0.69. Please use "Dataset.File.upload_directory" to upload your files             from a local directory and create FileDataset in single method call. See Dataset API change notice at https://aka.ms/dataset-deprecation.


Uploading an estimated of 10 files
Uploading data/glue/cola/dataset_dict.json
Uploaded data/glue/cola/dataset_dict.json, 1 files out of an estimated total of 10
Uploading data/glue/cola/test/data-00000-of-00001.arrow
Uploaded data/glue/cola/test/data-00000-of-00001.arrow, 2 files out of an estimated total of 10
Uploading data/glue/cola/test/dataset_info.json
Uploaded data/glue/cola/test/dataset_info.json, 3 files out of an estimated total of 10
Uploading data/glue/cola/test/state.json
Uploaded data/glue/cola/test/state.json, 4 files out of an estimated total of 10
Uploading data/glue/cola/train/dataset_info.json
Uploaded data/glue/cola/train/dataset_info.json, 5 files out of an estimated total of 10
Uploading data/glue/cola/train/state.json
Uploaded data/glue/cola/train/state.json, 6 files out of an estimated total of 10
Uploading data/glue/cola/validation/data-00000-of-00001.arrow
Uploaded data/glue/cola/validation/data-00000-of-00001.arrow, 7 files out of an estimated total of 10
Upl

$AZUREML_DATAREFERENCE_5b1633259dfc44b08d94174913652951

## Register the data as a dataset

Register the dataset with associated metadata describing the task for which it is meant and the model used to tokenize it.

In [7]:
name = config['task']
description = f"Glue dataset for {config['task']}, tokenized for {config['model']}"
tags = {"task":config['task'], "model": config['model']}
path = datastore.path(config['data_dir'])

amldataset = azureml.core.Dataset.File.from_files(path)
amldataset = amldataset.register(workspace, name, description, tags, True) 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

And we'll need to know what version that created in order to use it going forward

In [8]:
amldataset.version

1

This shows us the steps to:

- Prepare data before moving it to Azure
- Push data to an Azure datastore accessible by AzureML
- Register the data with AzureML as a dataset