# Sentiment Analysis with DistilBERT using Hugging Face | NLP with Hugging Face Tutorials


**GITHUB REPO:**

https://github.com/laxmimerit/NLP-Tutorials-with-HuggingFace

NLP Playlist: https://www.youtube.com/watch?v=NLvQ5oj-Sg4&list=PLc2rvfiptPSTGfTp0nhC71ksTY1p5ooCW

## What is Sentiment Analysis?
Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.

### Transformers Architecture

## DistilBERT
- The IMDB dataset contains 25,000 movie reviews labeled by sentiment for training a model and 25,000 movie reviews for testing it.

- DistilBERT is a smaller, faster and cheaper version of BERT. It has 40% smaller than BERT and runs 60% faster while preserving over 95% of BERT’s performance.

**Introduction to DistilBERT:** DistilBERT, short for "Distill and BERT," is a compact version of the renowned BERT (Bidirectional Encoder Representations from Transformers) model.

**Model Architecture:** It reduces the number of layers and attention heads, resulting in a smaller and faster model.

**Parameter Reduction:** One of DistilBERT's key features is its parameter reduction strategy, achieved by distillation. This involves training the model on a combination of teacher (BERT) and student (DistilBERT).

**Efficiency and Speed:** By reducing the model's size and complexity, DistilBERT achieves a significant speedup during both training and inference.

# Coding
- https://github.com/laxmimerit/preprocess_kgptalkie

In [1]:
# sentiment analysis with the pipeline
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")

data = ['i love you', 'i hate you']
sentiment_pipeline(data)

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

## Data Loading and Preprocessing

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import preprocess_kgptalkie as ps

In [3]:
df = pd.read_csv("./IMDB.csv")
df = df.sample(10_000)
df.head()

Unnamed: 0,review,sentiment
24316,This movie is simply incredible! I had expecte...,positive
19308,The views of Earth that are claimed in this fi...,negative
12128,I enjoy the show Surface very much. The show i...,positive
9523,While studying the differences between religio...,negative
26873,"I would have given this movie a 1, but I laugh...",negative


In [4]:
df.shape
df.isnull().sum()

review       0
sentiment    0
dtype: int64

## Data Preparation for ML

In [5]:
# custom dataset -> evaluation/compute metrics -> training arguments -> trainer -> training -> testing

In [6]:
import torch
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split

In [7]:
class CustomDataset(Dataset):
  def __init__(self, texts, labels, tokenizer, max_len=512):
    self.texts = texts
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    text = str(self.texts[idx])
    label = torch.tensor(self.labels[idx])

    encoding = self.tokenizer(text, truncation=True, padding="max_length",
                              max_length=self.max_len)

    return {
        'input_ids': encoding['input_ids'],
        'attention_mask': encoding['attention_mask'],
        'labels': label
    }


In [8]:
# prepare tokenizer and model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased'
# device = "cuda"
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
X = df['review'].tolist()

label2id = {'positive': 1, 'negative': 0}
id2label = {1: 'positive', 0: 'negative'}

y = df['sentiment'].map(label2id).tolist()

dataset = CustomDataset(X, y, tokenizer)

In [10]:
dataset[0].keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [11]:
train_dataset, test_dataset = train_test_split(dataset, test_size=0.2, random_state=42)

In [12]:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(example):
  labels = example.label_ids
  preds = example.predictions.argmax(-1)

  f1 = f1_score(labels, preds, average="weighted")
  acc = accuracy_score(labels, preds)

  return {'accuracy': acc, "f1": f1}

In [13]:
from transformers import Trainer, TrainingArguments
batch_size = 16
model_name = "distilbert_finetuned_setiment"

args = TrainingArguments(
    output_dir = "output",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size = batch_size,
    learning_rate = 2e-5,
    num_train_epochs = 1,
    evaluation_strategy = 'epoch'
)



In [14]:
trainer = Trainer(model=model,
                  args=args,
                  train_dataset = train_dataset,
                  eval_dataset = test_dataset,
                  compute_metrics=compute_metrics,
                  tokenizer = tokenizer)

  trainer = Trainer(model=model,


In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2935,0.232524,0.9135,0.913485


TrainOutput(global_step=500, training_loss=0.29353445434570313, metrics={'train_runtime': 448.9976, 'train_samples_per_second': 17.817, 'train_steps_per_second': 1.114, 'total_flos': 1059739189248000.0, 'train_loss': 0.29353445434570313, 'epoch': 1.0})

In [16]:
trainer.save_model(model_name)

## Model Testing

In [17]:
text = "i love this product"
pipe = pipeline('text-classification', model_name)
pipe(text)

Device set to use cuda:0


[{'label': 'LABEL_1', 'score': 0.9621106386184692}]

In [18]:
id2label

{1: 'positive', 0: 'negative'}

In [19]:
# load model
tok = AutoTokenizer.from_pretrained(model_name)
mod = AutoModelForSequenceClassification.from_pretrained(model_name)

In [20]:
def get_prediction(text):
  input_ids = tok.encode(text, return_tensors='pt')
  output = mod(input_ids)

  preds = torch.nn.functional.softmax(output.logits, dim=-1)

  prob = torch.max(preds).item()

  idx = torch.argmax(preds).item()
  sentiment = id2label[idx]

  return {'sentiment':sentiment, 'prob':prob}

In [35]:
text = "did understand the movie"
get_prediction(text)

{'sentiment': 'positive', 'prob': 0.6008546352386475}

In [29]:
text = "i hate this product"
get_prediction(text)

{'sentiment': 'negative', 'prob': 0.9248220920562744}