# Huggingface Sagemaker-sdk - Run a batch transform inference job with 🤗 Transformers


In the this lab, we will deploy one of the 10 000+ Hugging Face Transformers from the [Hub](https://huggingface.co/models) to Amazon SageMaker for batch inference. 

1. [Setup](#Setup)  
3. [Run Batch Transform Inference Job with a fine-tuned model using `jsonl`](#Run-Batch-Transform-Inference-Job-with-a-fine-tuned-model-using-jsonl)   
3. [Download Dataset](#Download-Dataset)
3. [Data Pre-Processing](#Data-Pre-Processing)
3. [Download pre-trained model](#Download-pre-trained-model)
3. [Package pre-trained model into .tar.gz format](#Package-pre-trained-model-into-.tar.gz-format)
3. [Upload model to s3](#Upload-model-to-s3)
3. [Run batch transform job for offline scoring](#Run-batch-transform-job-for-offline-scoring)

## Setup

In [None]:
!pip install torch

In [None]:
!pip install "sagemaker>=2.48.0" --upgrade
!pip install "transformers<=4.12.2" -q
!pip install ipywidgets -q

In [None]:
!pip install datasets

In [None]:
# restart kernel after installing the packages
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

In [None]:
import sagemaker
sagemaker.__version__

In [None]:
import torch
torch.__version__

## Run Batch Transform Inference Job with a fine-tuned model using `jsonl`

### Download Dataset
Download the `tweet_eval` dataset from the datasets library.

In [None]:
from datasets import load_dataset
dataset = load_dataset("tweet_eval", "sentiment")

In [None]:
tweet_text = dataset['validation'][:]['text']

## Data Pre-Processing

 The dataset contains ~2000 tweets. We will format the dataset to a `jsonl` file and upload it to s3. Due to the complex structure of text are only `jsonl` file supported for batch/async inference.

_**NOTE**: While preprocessing you need to make sure that your `inputs` fit the `max_length`._

In [None]:
import csv
import json
import sagemaker
from sagemaker.s3 import S3Uploader,s3_path_join

# get the s3 bucket
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
sagemaker_session_bucket = sess.default_bucket()

# datset files
dataset_jsonl_file="tweet_data.jsonl"
# data_json = {}
data_json = [] 
with open(dataset_jsonl_file, "w+") as outfile:
    for row in tweet_text:
        # remove @
        row = row.replace("@","")
        json.dump({
            'inputs': str(row)
        }, outfile)
        data_json.append({
            'inputs': str(row)
        })
        outfile.write('\n')

In [None]:
# uploads a given file to S3.
input_s3_path = s3_path_join("s3://",sagemaker_session_bucket,"batch_transform/input")
output_s3_path = s3_path_join("s3://",sagemaker_session_bucket,"batch_transform/output")
s3_file_uri = S3Uploader.upload(dataset_jsonl_file,input_s3_path)

print(f"{dataset_jsonl_file} uploaded to {s3_file_uri}")

The created file looks like this

```json
{"inputs": "Dark Souls 3 April Launch Date Confirmed With New Trailer: Embrace the darkness."}
{"inputs": "\"National hot dog day, national tequila day, then national dance day... Sounds like a Friday night.\"}
{"inputs": "When girls become bandwagon fans of the Packers because of Harry.   Do y'all even know who Aaron Rodgers is?  Or what a 1st down is?"}
{"inputs": "user I may or may not have searched it up on google"}
{"inputs": "Here's your starting TUESDAY MORNING Line up at  Gentle Yoga with Laura 9:30 am to 10:30 am..."}
{"inputs": "VirginAmerica seriously would pay $30 a flight for seats that didn't h...."}
{"inputs": "user F-Main, are you in the office tomorrow if I send over some Curtis proofs c/o you, for you and a few colleagues?\""},
{"inputs": "#US 1st Lady Michelle Obama speaking at the 2015 Beating the Odds Summit to over 130 college-bound students at the pentagon office."},
{"inputs": "Omg this show is so predictable even for the 3rd ep. Rui En\\u2019s ex boyfriend was framed for murder probably\\u002c by a guy."},
{"inputs": "\"What a round by Paul Dunne, good luck tomorrow and I hope you win the Open.\""},
{"inputs": "Irving Plaza NYC Blackout Saturday night. Got limited spots left on the guest list. Tweet me why you think you deserve them"}
....
```

## Download pre-trained model

We use the [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model running our batch transform job.


In [None]:
### Download Hugging Face Pretrained Model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
MODEL = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model.save_pretrained('model_token')
tokenizer.save_pretrained('model_token')

## Package pre-trained model into .tar.gz format

In [None]:
# package pre-trained model into .tar.gz format
!cd model_token && tar zcvf model.tar.gz * 
!mv model_token/model.tar.gz ./model.tar.gz

## Upload model to s3

In [None]:
# upload pre-trained model to s3 bucket
model_url = s3_path_join("s3://",sagemaker_session_bucket,"batch_transform/model")
print(f"Uploading Model to {model_url}")
model_uri = S3Uploader.upload('model.tar.gz',model_url)
print(f"Uploaded model to {model_uri}")

## Run batch transform job for offline scoring

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_uri, # configuration for loading model from Hub
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version='py36', # python version used
)

# create Transformer to run our batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=output_s3_path, # we are using the same s3 path to save the output with the input
    strategy='SingleRecord')

# starts batch transform job and uses s3 data as input
batch_job.transform(
    data=s3_file_uri,
    content_type='application/json',    
    split_type='Line')

In [None]:
import json
from sagemaker.s3 import S3Downloader
from ast import literal_eval
# creating s3 uri for result file -> input file + .out
output_file = f"{dataset_jsonl_file}.out"
output_path = s3_path_join(output_s3_path,output_file)

# download file
S3Downloader.download(output_path,'.')

batch_transform_result = []
with open(output_file) as f:
    for line in f:
        # converts jsonline array to normal array
        line = "[" + line.replace("[","").replace("]",",") + "]"
        batch_transform_result = literal_eval(line) 
        
# print results 
batch_transform_result[:3]