For ease of use, we advice to open this notebook in an Amazon SageMaker instance and use the conda_pytorch_latest_p36 kernel.

In [None]:
# Install the required libraries
!pip install datasets
!pip install py7zr

### Preparing the dataset

One way to prepare your dataset for training on Amazon SageMaker is to have your training, validation and test datasets saved separately. This enables to effectively decouple data preparation from training in an architecture and for example ensure that the same datasets can be reused by different models with the same split. In this example we download the [samsum dataset](https://arxiv.org/pdf/1911.12237.pdf) and prepare it for HuggingFace using the [datasets](https://github.com/huggingface/datasets) library. Any dataset containing text and summaries could work here.

We first import required packages and define the prefix where to save the data:

In [None]:
import os
import json
import io, boto3, sagemaker
import pandas as pd

from datasets import load_dataset, filesystems, DatasetDict


s3_resource = boto3.resource('s3')
session = sagemaker.Session()
session_bucket = session.default_bucket()

s3_prefix = 'samsum-dataset'

Download the samsum dataset using curl. If you would like to use your own custom dataset, you do not require to run this.

In [None]:
%%sh
mkdir corpus && cd corpus
curl https://arxiv.org/src/1911.12237v2/anc/corpus.7z --output corpus.7z
py7zr x corpus.7z
rm corpus.7z

In [None]:
# Converting the json files to jsonlines in order to save it in Hugging Face dataset format for optimal speed and efficiency

data_path = 'corpus/'

frames = []
for file in os.listdir(data_path):
    if file.endswith('.json'):
        with open(os.path.join(data_path, file)) as f:
            json_dict = json.load(f)
            with open(os.path.join(data_path, file.replace('.json', '.jsonl')), 'w') as f:
                f.write('\n'.join(map(json.dumps, json_dict)))


In [None]:
# TO USE WITH YOUR OWN CUSTOM DATASET PLEASE UNCOMMENT
# If you would like to use your own custom dataset (single CSV/JSON), you can use the datasets.Dataset.train_test_split() method  to shuffle and split your data. 
# The splits will be shuffled by default. You can deactivate this behavior by setting shuffle=False


# # For single JSON file
# dataset_json = load_dataset('json', data_files='path_to_your_file', split ='train') #

# # Replace type to 'csv' if you are using a single CSV file, the rest of the steps are exactly the same
# # dataset_csv = load_dataset('csv', data_files='path_to_your_file', split ='train') # path to your file


# # Split into 70% train, 30% test + validation
# train_test_validation = dataset_json.train_test_split(test_size=0.3)

# # Split 30% test + validation into half test, half validation
# test_validation = train_test_validation['test'].train_test_split(test_size=0.5)

# # Gather the splits  to have a single DatasetDict

# train_test_valid_dataset = DatasetDict({
#     'train': train_test_validation['train'],
#     'validation': test_validation['train'],
#     'test': test_validation['test'],})

In [None]:
# If you are using the samsum dataset that is already split, you can simply load the separate files

dataset = load_dataset('json', data_files={'train': ['corpus/train.jsonl'],
                                              'validation' : 'corpus/val.jsonl',
                                              'test': 'corpus/test.jsonl'})

In [None]:
dataset

In [None]:
print('DIALOGUE\n{dialogue}'.format(dialogue=dataset['train']['dialogue'][0]))
print('\nSUMMARY\n{summary}'.format(summary=dataset['train']['summary'][0]))

Finally we write the training, validation and test dataframes to separate CSVs and upload them to S3.

This will then be used in the 02_finetune_deploy.ipynb notebook for model training

##### Use the save_to_disk method to directly save your dataset to S3 in Hugging Face dataset format. The format is backed by the Apache Arrow format which enables processing of large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency.  You can use the load_to_disk method in your train script to directly load the dataset in the format it was saved.

In [None]:
s3 = filesystems.S3FileSystem()
dataset.save_to_disk(f's3://{session_bucket}/{s3_prefix}/train/', fs=s3)