# Sentiment Analysis using BERT

The following figure shows how we fine-tune the pre-trained BERT model for a sentiment analysis task:

![image.png](attachment:ff9c807a-0853-46e5-91d2-dccdd8da8e2b.png)    

As we can observe from the preceding figure, we feed the tokens to the pre-trained BERT model and get the embeddings of all the tokens. We take the embedding of the [CLS] token and feed it to a feedforward network with a softmax function and perform classification.

In [1]:
!pip install nlp==0.4.0
!pip install transformers==4.30.0

Collecting nlp==0.4.0
  Downloading nlp-0.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting dill (from nlp==0.4.0)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from nlp==0.4.0)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.9-py3-none-any.whl (119 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xxhash, dill, nlp
Successfully installed dill-0.3.9 nlp-0.4.0 xxhash-3.5.0
Collecting tran

In [21]:
!pip install dill==0.3.5.1

Collecting dill==0.3.5.1
  Using cached dill-0.3.5.1-py2.py3-none-any.whl.metadata (9.7 kB)
Using cached dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.8
    Uninstalling dill-0.3.8:
      Successfully uninstalled dill-0.3.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
multiprocess 0.70.16 requires dill>=0.3.8, but you have dill 0.3.5.1 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.5.1


In [32]:
!pip install --upgrade datasets

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill, multiprocess, datasets
  Attempting uninstall: dill
    Found existing installation: dill 0.3.5.1
    Uninstallin

# Import Necessary Libraries 

In [2]:
from transformers import BertForSequenceClassification, BertTokenizerFast,Trainer, TrainingArguments
from nlp import load_dataset
import torch
import numpy as np

# Loading the dataset

In [3]:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
dataset = load_dataset('csv', data_files='./imdbs.csv', split='train')

Downloading...
From: https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
To: /content/imdbs.csv
  0% 0.00/132k [00:00<?, ?B/s]100% 132k/132k [00:00<00:00, 59.5MB/s]




In [4]:
type(dataset)

In [5]:
import nlp
nlp.arrow_dataset.Dataset

# Creating train and test sets

In [6]:
dataset = dataset.train_test_split(test_size=0.3)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
dataset

{'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 70),
 'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 30)}

In [8]:
train_set = dataset['train']
test_set = dataset['test']

# Loading the BERT model and tokenizer

In [9]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification mo

In [10]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Preprocessing the dataset

In [11]:
sentence = 'I love Paris'

In [12]:
tokens = tokenizer.tokenize(sentence)
print(tokens)

['i', 'love', 'paris']


In [13]:
tokens = ['[CLS]'] + tokens + ['[SEP]']
print(tokens)

['[CLS]', 'i', 'love', 'paris', '[SEP]']


In [14]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[101, 1045, 2293, 3000, 102]


In [15]:
token_type_ids = [0, 0, 0, 0, 0]

In [16]:
attention_mask = [1, 1, 1, 1, 1]

# Preprocessing the dataset using a tokenizer

In [17]:
tokenizer('I love Paris')

{'input_ids': [101, 1045, 2293, 3000, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [18]:
tokenizer(['I love Paris', 'birds fly','snow fall'], padding = True, max_length=5)



{'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}

In [19]:
def preprocess(data):
    return tokenizer(data['text'], padding=True, truncation=True)

# Preprocessing the train and test sets

In [20]:
train_set = train_set.map(preprocess, batched=True, batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(test_set))

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

# Training the model

In [21]:
batch_size = 8
epochs = 2
warmup_steps = 500
weight_decay = 0.01

In [22]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    logging_dir='./logs',
)

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=test_set
)

In [24]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,No log,0.686378
2,No log,0.682015


TrainOutput(global_step=18, training_loss=0.6940735181172689, metrics={'train_runtime': 1276.8239, 'train_samples_per_second': 0.11, 'train_steps_per_second': 0.014, 'total_flos': 36835547750400.0, 'train_loss': 0.6940735181172689, 'epoch': 2.0})

In [25]:
trainer.evaluate()

{'eval_loss': 0.6820146441459656,
 'eval_runtime': 71.5173,
 'eval_samples_per_second': 0.419,
 'eval_steps_per_second': 0.056,
 'epoch': 2.0}