<a href="https://colab.research.google.com/github/abdulsamadkhan/Tutorial/blob/main/Using_Bert_Embedding_for_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers



In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [3]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [23]:
pd.set_option('max_colwidth', 100)
df.head(5)

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horro...",1
1,apparently reassembled from the cutting room floor of any given daytime soap,0
2,"they presume their audience wo n't sit still for a sociology lesson , however entertainingly pre...",0
3,"this is a visually stunning rumination on love , memory , history and the war between art and co...",1
4,jonathan parker 's bartleby should have been the be all end all of the modern office anomie films,1


For This example, we'll only use 2,000 sentences from the dataset

In [9]:
dataset = df[:2000]

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [10]:
dataset[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model.

In [11]:
from transformers import DistilBertModel, DistilBertTokenizer
# or for BERT
# from transformers import BertModel, BertTokenizer

# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
# model_class, tokenizer_class, pretrained_weights = (BertModel, BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [15]:
tokenized = dataset[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))


In [39]:
# Printing the first sentiment in the dataset
print("First sentiment in the dataset:", dataset[0][0])

# Extracting tokens and their corresponding numeric encodings
BP_tokens = tokenizer.convert_ids_to_tokens(tokenized[0])
numeric_encodings = tokenized[0]

# Displaying the result
print("\nTokenized Sequence:")
for token, encoding in zip(BP_tokens, numeric_encodings):
    print(f"{token:<15} : {encoding}")


First sentiment in the dataset: a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films

Tokenized Sequence:
[CLS]           : 101
a               : 1037
stirring        : 18385
,               : 1010
funny           : 6057
and             : 1998
finally         : 2633
transporting    : 18276
re              : 2128
imagining       : 16603
of              : 1997
beauty          : 5053
and             : 1998
the             : 1996
beast           : 6841
and             : 1998
1930s           : 5687
horror          : 5469
films           : 3152
[SEP]           : 102


In [48]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [49]:
np.array(padded).shape

(2000, 59)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [50]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

In [51]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [70]:
outputs = last_hidden_states
last_layer = outputs.last_hidden_state
embedding =last_layer[:,0,:].numpy()

In [72]:
embedding.shape

(2000, 768)

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [76]:
labels = dataset[1]
features = embedding

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [77]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [None]:
# parameters = {'C': np.linspace(0.0001, 100, 20)}
# grid_search = GridSearchCV(LogisticRegression(), parameters)
# grid_search.fit(train_features, train_labels)

# print('best parameters: ', grid_search.best_params_)
# print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [78]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

In [79]:
lr_clf.score(test_features, test_labels)

0.834

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [80]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.518 (+/- 0.00)
