# AWS Machine Learning Purpose-built Accelerators Tutorial
## Learn how to use [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) with [Amazon SageMaker](https://aws.amazon.com/sagemaker/), to optimize your ML workload
## Part 1/3 - Preparing a SPAM/NOT SPAM dataset for text classification

**SageMaker studio Kernel: PyTorch 1.13 Python 3.9 CPU - ml.t3.medium** 

This exercise is part of a end2end tutorial that shows: 
  - 1) How to prepare a dataset for text classification with a SPAM/NOT SPAM dataset
  - 2) Finetune a Bert Base model for text classification (Binary: 0=NOT_SPAM 1=SPAM) using a Trn1 instance
  - 3) Compile & Deploy the trained model to an Inf2 instance

## 1) Install some required packages

In [None]:
%pip install -r requirements.txt

## 2) Download and visualize a sample of the raw dataset

In [None]:
import os
from datasets import load_dataset
from transformers import AutoTokenizer

max_sen_len=256
model_id='bert-base-uncased'
dataset_path=os.path.join('datasets', 'spam')

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = 0

In [None]:
dataset = load_dataset('Deysi/spam-detection-dataset')
dataset['test'].to_pandas().head()

## 3) Convert the dataset to padded tokens and visualize a sample

In [None]:
import torch
import numpy as np

labels = {'not_spam': 0, 'spam': 1}
def preprocess_function(examples):
    inp = tokenizer(examples["text"], padding='max_length', truncation=True, max_length=max_sen_len)
    inp['labels'] = [labels[e] for e in examples['label']]
    return inp

tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset['train'].column_names,
)
tokenized_dataset = tokenized_dataset.with_format("torch")

## save dataset to disk
tokenized_dataset['train'].save_to_disk(os.path.join(dataset_path,"train"))
tokenized_dataset['test'].save_to_disk(os.path.join(dataset_path,"eval"))

In [None]:
tokenized_dataset['train'].to_pandas().head()

## 4) Upload the dataset to S3

In [None]:
import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
print(f"Bucket: {bucket}")
sagemaker_session.upload_data(dataset_path, bucket=bucket, key_prefix=dataset_path)

## 5) (Optional) Check how the Collator will create batches for the model

In [None]:
import torch
from datasets import load_from_disk
from transformers import DefaultDataCollator

dataset = load_from_disk('datasets/spam/eval')
collator = DefaultDataCollator(return_tensors="pt")
it = iter(dataset)
batch = [next(it) for i in range(5)]
batch = collator(batch)
print("\n".join([f"{k}:\t{v.shape}" for k,v in batch.items()]))

## 6) Now it is time to finetune our model

[Open Training Notebook](02_ModelFineTuning.ipynb)