In this tutorial, we'll show how to fine-tune a simple sentiment analysis model using HuggingFace's Transformers on IMBD dataset.

IMDB: https://huggingface.co/transformers/custom_datasets.html

Deploy predict() function with Docker & AWS Lambda & AWS CDK: https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face/blob/main/app.py

With Docker & FastAPI: https://github.com/ramsrigouthamg/GPU_Docker_Deployment_HuggingFace_Summarization/blob/main/app.py
https://towardsdatascience.com/containerizing-huggingface-transformers-for-gpu-inference-with-docker-and-fastapi-on-aws-d4a83edede2f

# Data Exploration

In [1]:
FNAME = "data/imdb_kaggle.csv"

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv(FNAME)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [4]:
df.review.nunique()

49582

In [5]:
# Drop duplicates
df = df.drop_duplicates(subset="review")

In [6]:
def cat2num(value):
    return 1 if value == "positive" else 0
    
df["sentiment"] = df["sentiment"].apply(cat2num)

train = df[:40000]
val = df[40000:45000]
test = df[45000:]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["sentiment"] = df["sentiment"].apply(cat2num)


In [7]:
import time
import torch

In [8]:
def timeit(fn): 
    # *args and **kwargs are to support positional and named arguments of fn
    def get_time(*args, **kwargs): 
        start = time.time() 
        output = fn(*args, **kwargs)
        print(f"Time taken in {fn.__name__}: {time.time() - start:.3f} seconds.")
        return output  # make sure that the decorator returns the output of fn
    return get_time

In [9]:
# Convert to Torch dataset
class IMDbDataset(torch.utils.data.Dataset):
    @timeit
    def __init__(self, split):
        self.encodings = tokenizer(list(split.review), truncation=True, padding=True)
        self.labels = list(split.sentiment)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [10]:
small_train = df[:20]
small_eval = df[40000:40010]

# Data Tokenizer

In [11]:
base_model = "distilbert-base-uncased"

In [12]:
# Loading the BERT Classifier and Tokenizer along with Input module
from transformers import DistilBertTokenizerFast
# from transformers import InputExample, InputFeatures

tokenizer = DistilBertTokenizerFast.from_pretrained(base_model)

In [13]:
val_dataset = IMDbDataset(val)
test_dataset = IMDbDataset(test)
# train_dataset = IMDbDataset(train)
small_train_dataset = IMDbDataset(small_train)
small_eval_dataset = IMDbDataset(small_eval)

Time taken in __init__: 2.395 seconds.
Time taken in __init__: 2.019 seconds.
Time taken in __init__: 0.009 seconds.
Time taken in __init__: 0.004 seconds.


In [14]:
print(small_train_dataset[0]["input_ids"].shape)
print(small_train_dataset[0]["input_ids"][:10])
print(small_train_dataset[0]["input_ids"][407:417])

torch.Size([512])
tensor([  101,  2028,  1997,  1996,  2060, 15814,  2038,  3855,  2008,  2044])
tensor([2064, 2131, 1999, 3543, 2007, 2115, 9904, 2217, 1012,  102])


In [15]:
example = train.review[0]
print(len(example))
print(len(example.split()))

1761
307


In [16]:
encoding = tokenizer(example, truncation=True)
print(len(encoding["input_ids"]))
print(encoding["input_ids"][:10])
print(encoding["input_ids"][-10:])
print(encoding.keys())

417
[101, 2028, 1997, 1996, 2060, 15814, 2038, 3855, 2008, 2044]
[2064, 2131, 1999, 3543, 2007, 2115, 9904, 2217, 1012, 102]
dict_keys(['input_ids', 'attention_mask'])


In [17]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

In [18]:
training_args = TrainingArguments(
    output_dir="./results",          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir="./logs",            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained(base_model)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

In [19]:
def train(mode="small"):
    if mode == "small":
        train_data = small_train_dataset
        eval_data = small_eval_dataset
    else:
        train_data = train_dataset
        eval_data = eval_dataset
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=eval_data
    )

    # Run small_train_dataset (1000 samples) on a mac would take 37 minutes
    trainer.train()
    trainer.save_model("checkpoints/saved_model_v0")
    trainer.evaluate()

In [20]:
train(mode="small")

***** Running training *****
  Num examples = 20
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to checkpoints/saved_model_v0
Configuration saved in checkpoints/saved_model_v0/config.json
Model weights saved in checkpoints/saved_model_v0/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 10
  Batch size = 64


In [21]:
model = DistilBertForSequenceClassification.from_pretrained("checkpoints/saved_model_v0")

loading configuration file checkpoints/saved_model_v0/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.12.3",
  "vocab_size": 30522
}

loading weights file checkpoints/saved_model_v0/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at checkpoints/saved_model_v0

In [22]:
model.eval()
encoding = tokenizer("This movie sucks", padding=True, truncation=True, return_tensors="pt")
output = model(**encoding)
probs = torch.nn.functional.softmax(output["logits"])
pred = torch.argmax(probs).numpy()
sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
sentiment

  probs = torch.nn.functional.softmax(output["logits"])


'NEGATIVE'

In [23]:
small_train.sentiment[0]

1