# SageMaker Processing Script: HuggingFace

This notebook shows a very basic example of using SageMaker Processing to create train, test and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.

In a nutshell, we will create a `HuggingFaceProcessor` object, passing the HuggingFace Transformer version we want to use, as well as our managed infrastructure requirements.

For our use case, we will download a well-known datasets, publicly available online, called the [Amazon Customer Reviews dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html). This dataset is composed of 130+ million customer reviews. The data is available in TSV files in the `amazon-reviews-pds` S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters). Samples of the data are available in English and French, and we will use both in our demo.

In [None]:
!mkdir -p .data .output
!aws s3 cp s3://amazon-reviews-pds/tsv/sample_us.tsv .data/ 
!aws s3 cp s3://amazon-reviews-pds/tsv/sample_fr.tsv .data/ 

In [None]:
from sagemaker import Session

session = Session()
bucket = session.default_bucket()
key_prefix = 'frameworkprocessors/huggingface-example'

source_path = session.upload_data('.data', bucket=bucket, key_prefix=f'{key_prefix}/data')
source_path

## Create the script you'd like to use with Processing with your logic

This script is executed by Amazon SageMaker.

In the `main`, it does the core of the operations: it reads and parses arguments passed as parameters, unpacks the model file, then loads the model, preprocess, predict, postprocess the data. Remember to write data locally in the final step so that SageMaker can copy them to S3.

In [None]:
!pygmentize huggingface-processing.py

## Create the Sagemaker Processor 

Once the data has been uploaded to S3, we can now create the `HuggingFaceProcessor` object. We specify the version of the framework that we want to use, the python version, the role with the correct permissions to read the dataset from S3, and the instances we're planning on using for our processing job.

In [None]:
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

hfp = HuggingFaceProcessor(
    role=get_execution_role(), 
    instance_count=2,
    instance_type='ml.g4dn.xlarge',
    transformers_version='4.4.2',
    pytorch_version='1.6.0', 
    base_job_name='frameworkprocessor-hf'
)

All that's left to do is to `run()` the Processing job: we will specify our python script that contains the logic of the transformation in the `code` argument and its dependencies in the `source_dir` folder, the `inputs` and the `outputs` of our job.

Note: in the folder indicated in the `source_dir` argument, it is possible to have a `requirements.txt` file with the dependencies of our script. This file will make SageMaker Processing automatically install the packages specified in it by running the `pip install -r requirements.txt` command before launching the job itself.

In [None]:
hfp.run(
    code='huggingface-processing.py',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=source_path,
            destination='/opt/ml/processing/input/data/'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train/'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test/'),
        ProcessingOutput(output_name='val', source='/opt/ml/processing/output/val/'),
    ],
    logs=False
)

We can now check the results of our processing job, and list the outputs from S3.

In [None]:
output_path = hfp.latest_job.outputs[0].destination

!aws s3 ls --recursive $output_path