**Sentiment Analysis with DistilBERT using Hugging Face**


#### What is Sentiment Analysis?

Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.

DistilBERT

DistilBERT is a smaller, faster and cheaper version of BERT. It has 40% smaller than BERT and runs 60% faster while preserving over 95% of BERT’s performance.

What is DistilBERT: DistilBERT, short for "Distill and BERT," is a compact version of the renowned BERT (Bidirectional Encoder Representations from Transformers) model.

Model Architecture: It reduces the number of layers and attention heads, resulting in a smaller and faster model.

Parameter Reduction: One of DistilBERT's key features is its parameter reduction strategy, achieved by distillation. This involves training the model on a combination of teacher (BERT) and student (DistilBERT).

Efficiency and Speed: By reducing the model's size and complexity, DistilBERT achieves a significant speedup during both training and inference.

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/IMDB-Dataset.csv")
df = df.sample(10_000)
df.head()

Unnamed: 0,review,sentiment
47758,"As you all may know, JIGSAW did not make its w...",negative
19842,This love story between an American journalist...,negative
41593,Batman Returns is more Gothic and somber than ...,positive
5298,***SPOILERS*** ***SPOILERS*** Juggernaut is a ...,negative
4263,Hitchcock made at least 11 films about the ord...,positive


### Data Preprocessing

In [3]:
# Define a function
def preprocess_text(review):
    # Remove HTML tags
    review = re.sub(r'<.*?>', '', review)
    # Remove special characters and numbers
    review = re.sub(r'[^a-zA-Z\s]', '', review)
    return review

In [4]:
# Apply preprocessing to each text in the dataframe
df['review'] = df['review'].apply(preprocess_text)

### Creating Torch Dataset for Model
##### custom dataset -> evaluation/compute metrics -> training arguments -> trainer -> training -> testing

In [6]:
import torch
from torch.utils.data import Dataset

In [7]:
class CustomDataset(Dataset):

    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):

        text = str(self.texts[idx])
        label = torch.tensor(self.labels[idx])

        encoding = self.tokenizer(text, truncation=True, padding="max_length",
                                  max_length=self.max_len)

        return {
            'input_ids': encoding['input_ids'],
            'attention_mask': encoding['attention_mask'],
            'labels': label
        }

### Prepare Tokenizer and Model

In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased'
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
X = df['review'].tolist()

label2id = {'positive': 1, 'negative': 0}
id2label = {1: 'positive', 0: 'negative'}

y = df['sentiment'].map(label2id).tolist()

data = CustomDataset(X, y, tokenizer)

In [10]:
data[0].keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [11]:
train_dataset, test_dataset = train_test_split(data, test_size=0.2, random_state=42)

### Function for Evaluation

In [12]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(example):

    labels = example.label_ids
    preds = example.predictions.argmax(-1)

    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)

    return {'accuracy': acc, "f1": f1}

### Build Model

In [13]:
from transformers import Trainer, TrainingArguments

In [25]:
batch_size = 16

args = TrainingArguments(
    output_dir = "output",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size = batch_size,
    learning_rate = 2e-5,
    num_train_epochs = 3,
    evaluation_strategy = 'epoch'
)

In [26]:
trainer = Trainer(model=model,
                  args=args,
                  train_dataset = train_dataset,
                  eval_dataset = test_dataset,
                  compute_metrics=compute_metrics,
                  tokenizer = tokenizer

)

In [27]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.335,0.239815,0.904,0.903827
2,0.1755,0.259968,0.9155,0.915499
3,0.1027,0.305195,0.9195,0.9195


TrainOutput(global_step=1500, training_loss=0.20442059580485025, metrics={'train_runtime': 1175.254, 'train_samples_per_second': 20.421, 'train_steps_per_second': 1.276, 'total_flos': 3179217567744000.0, 'train_loss': 0.20442059580485025, 'epoch': 3.0})

### Saved Model

In [28]:
model_name= "distilbert_finetuned_setiment"

In [29]:
trainer.save_model(model_name)


### Interface

In [30]:
from transformers import pipeline

text = "i love this product"
pipe = pipeline('text-classification', model_name)
pipe(text)


[{'label': 'LABEL_1', 'score': 0.9907051920890808}]

In [31]:
id2label

{1: 'positive', 0: 'negative'}

In [32]:
tok = AutoTokenizer.from_pretrained(model_name)
mod = AutoModelForSequenceClassification.from_pretrained(model_name)

In [33]:
def get_prediction(text):
  input_ids = tok.encode(text, return_tensors='pt')
  output = mod(input_ids)

  preds = torch.nn.functional.softmax(output.logits, dim=-1)

  prob = torch.max(preds).item()

  idx = torch.argmax(preds).item()
  sentiment = id2label[idx]

  return {'sentiment':sentiment, 'prob':prob}

In [34]:
text = "I bought a mobile. Mobile is awesome, with its cool design, fast performance, great camera, and long battery life, it's way better than expected."
get_prediction(text)

{'sentiment': 'positive', 'prob': 0.9918524622917175}

### The End