# About

This notebook is an implementation of the model training code in the `sagemaker_train` directory, which contains the following folders and scripts

*   `src` (directory)
  * `__init__.py`
  * `model.py`: entry point used to train the model; uses input arguments
      * includes `ClassifierDataset` class that converts a `csv` or `json` file into a torch dataset, which is used as input data for the model training step
      * `def main` is the main code used to train and evaluate the model using argparse
      * `def preprocess_data.py` splits the training dataset into train/test (default test size is 0.2)
  * `utils.py`: converts an object (`csv` or `json` file) into a Pandas dataframe, which is used in `model.py` as input data
*   `data` (directory)
  * includes training data files that can be used for the model
  * recommend using `wiki_attacks.csv` to train a decently-performing model
* `model` (directory): where the trained model will be saved
* `eval_results` (directory): where the evaluation results (`json) format) will be saved

# Prepare Environment

If running the code in notebook, be sure to install the following libraries

In [1]:
!pip install torch # version 1.10.0
!pip install transformers # version 4.15.0

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 12.0 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 49.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 67 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

Make sure the working directory is set to the `src` directory

In [8]:
import os

os.chdir('/content/drive/MyDrive/sm_training/sagemaker_train/src')
!pwd

/content/drive/MyDrive/sm_training/sagemaker_train/src


# Defining Training Arguments & Hyperparameters

The following parameters can be specified and passed into training script:

```
    # Hyperparameters from launch_training_job.py get passed in as command line args.
    parser.add_argument('--input_path', type=str)
    parser.add_argument('--train_size', type=float, default=.85)
    parser.add_argument('--adam_epsilon', type=float, default=1e-8)
    parser.add_argument('--epochs', type=int, default=2)
    parser.add_argument('--learning_rate', type=float, default=5e-5)
    parser.add_argument('--weight_decay', type=float, default=0.0)
    parser.add_argument('--max_data_rows', type=int, default=None)
    parser.add_argument('--max_sequence_length', type=int, default=128)
    parser.add_argument('--model_name', type=str, default='distilbert-base-uncased')
    parser.add_argument('--train_batch_size', type=int, default=16)
    parser.add_argument('--valid_batch_size', type=int, default=128)
    parser.add_argument('--file_type', type=str, default='csv') # specify whether input file is csv or json (has to be one of the two)
    parser.add_argument('--eval_dir', type=str, default='../eval_results') # set this to SM's model_dir path when using in SageMaker
    parser.add_argument('--model_dir', type=str, default='../model') # where trained model is saved when running the script locally (outside of SageMaker)
```

Training Arguments in the `model.py` script includes additional parameters, such as `warmup_steps` (default value of 500) and `logging_steps` (default value of 10). Be sure to change these values if you want to decrease/increase the logging frequency.

```
    training_args = TrainingArguments(
        output_dir=os.path.join(args.model_dir, "output"),
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.valid_batch_size,
        learning_rate=args.learning_rate,
        adam_epsilon=args.adam_epsilon,
        warmup_steps=500,
        weight_decay=args.weight_decay,
        logging_dir=os.path.join(args.model_dir, "logs"),
        logging_steps=10,
        evaluation_strategy="steps",
        load_best_model_at_end=True
    )
```

These are sample argument values
```
# model_name = 'distilbert-base-uncased'
model_name = 'distilroberta-base'
max_sequence_length = 128
input_file = '/content/drive/MyDrive/datasets/wiki_attacks.csv'
output_dir = 'results'
epochs = 2
train_batch_size = 32
valid_batch_size = 128
learning_rate = 5e-5
adam_epsilon = 1e-8
weight_decay = 0.0
logging_dir = 'logs'
```

# Begin Training Job

Run the following command to start the training in notebook

In [15]:
!python model.py --model_name distilroberta-base --input_path /content/drive/MyDrive/sm_training/sagemaker_train/data/wiki_attacks_sample.csv --file_type csv --test_size .15 --epochs 4 --train_batch_size 16 --model_dir /content/drive/MyDrive/sm_training/sagemaker_train/distilroberta_model --eval_dir /content/drive/MyDrive/sm_training/sagemaker_train/distilroberta_model/output

Data contains 1000 rows
{'input_ids': tensor([    0, 25194,   328,  1437, 20920,     6,     8,  2814,     7, 28274,
            6, 27785,    38,  1034,    47,   101,     5,   317,     8,  2845,
            7,   489,  8216,     4,  1773,    38,   192,    47,   348,   416,
           57,  2171,   259,     6,   905,   162,    95,   492,    47,    10,
          367,  5678,    14,    32,   460,  5616,    25,    10, 14732,  5135,
         4704,    35,  1009,  6179,     7,  3116,    10,   372,  1566,  1009,
         5404,  2838,     6,   714,     6,  2883,     6,     8,  3184, 35950,
         1009,  2264, 28274,    16,    45,  1009, 45445, 45836,  1009, 47681,
           18, 26266, 13565,  1986,  1009, 42124,  4583,    38,  1034,    47,
         2254,  5390,   259,     8,   145,    10, 40823, 34740,   811,     4,
          318,    47,    33,   143,  1142,     6,   192,     5,   244,  6052,
            6,  1606,    10,   864,    23,     5,  3375,  9296,    50,   619,
          481,     7,  139

# Inference Code (in progress)

Resources:
* https://github.com/aws-samples/amazon-sagemaker-bert-pytorch/blob/master/code/deploy_ei.py
* https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/master/src/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py
* https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/default_inference_handler.py

Import dependencies

Note: working directory should still be the `src` directory

In [24]:
import argparse
import os
import json

# import boto3
import numpy as np
import pandas as pd
from sklearn.metrics import precision_score, recall_score, average_precision_score, roc_auc_score, f1_score
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler
from transformers import AdamW, AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

from utils import read_object

Set variables for inference code

In [20]:
model_dir = '/content/drive/MyDrive/sm_training/sagemaker_train/distilroberta_model'

In [21]:
def model_fn(model_dir):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("================ objects in model_dir ===================")
    print(os.listdir(model_dir))
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    print("================ model loaded ===========================")
    return model.to(device)

In [None]:
model_fn(model_dir)

In [None]:
def input_fn(request_body, request_content_type):
    """An input_fn that loads a pickled tensor"""
    if request_content_type == "application/json":
        data = json.loads(request_body)
        print("================ input text ===============")
        print(data)
        
        if isinstance(data, str):
            data = [data]
        elif isinstance(data, list) and len(data) > 0 and isinstance(data[0], str):
            pass
        else:
            raise ValueError("Unsupported input type. Input type can be a string or an non-empty list. \
                             I got {}".format(data))
                       
        #encoded = [tokenizer.encode(x, add_special_tokens=True) for x in data]
        #encoded = tokenizer(data, add_special_tokens=True) 
        
        # for backward compatibility use the following way to encode 
        # https://github.com/huggingface/transformers/issues/5580
        input_ids = [tokenizer.encode(x, add_special_tokens=True) for x in data]
        
        print("================ encoded sentences ==============")
        print(input_ids)

        # pad shorter sentence
        padded =  torch.zeros(len(input_ids), MAX_LEN) 
        for i, p in enumerate(input_ids):
            padded[i, :len(p)] = torch.tensor(p)
     
        # create mask
        mask = (padded != 0)
        
        print("================= padded input and attention mask ================")
        print(padded, '\n', mask)

        return padded.long(), mask.long()
    raise ValueError("Unsupported content type: {}".format(request_content_type))
    

def predict_fn(input_data, model):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    input_id, input_mask = input_data
    input_id = input_id.to(device)
    input_mask = input_mask.to(device)
    print("============== encoded data =================")
    print(input_id, input_mask)
    with torch.no_grad():
        y = model(input_id, attention_mask=input_mask)[0]
        print("=============== inference result =================")
        print(y)
    return y