**<span style="color:red">Warning</span>**: Do "File -> Save a copy in Drive" before you start modifying the notebook, otherwise your modifications will not be saved.

# BERT for Sentiment Analysis 

In [None]:
!pip install transformers
!pip install jupyter_black

In [None]:
import jupyter_black

jupyter_black.load(lab=False, line_length=100)

In [None]:
import transformers
import tensorflow as tf

## Downloading large review movie dataset (50000 reviews in train, 50000 reviews in test)

In [None]:
!wget https://thome.isir.upmc.fr/classes/RITAL/json_pol

--2023-02-17 13:34:28--  https://thome.isir.upmc.fr/classes/RITAL/json_pol
Resolving thome.isir.upmc.fr (thome.isir.upmc.fr)... 134.157.18.247
Connecting to thome.isir.upmc.fr (thome.isir.upmc.fr)|134.157.18.247|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66110051 (63M) [application/octet-stream]
Saving to: ‘json_pol.1’


2023-02-17 13:34:35 (11.8 MB/s) - ‘json_pol.1’ saved [66110051/66110051]



In [None]:
import json
from collections import Counter

# Loading json
with open("./json_pol",encoding="utf-8") as f:
    data = f.readlines()
    json_data = json.loads(data[0])
    train = json_data["train"]
    test = json_data["test"]
    

# Quick Check
counter_train = Counter((x[1] for x in train))
counter_test = Counter((x[1] for x in test))
print("Number of train reviews : ", len(train))
print("----> # of positive : ", counter_train[1])
print("----> # of negative : ", counter_train[0])
print("")
print(train[0])
print("")
print("Number of test reviews : ",len(test))
print("----> # of positive : ", counter_test[1])
print("----> # of negative : ", counter_test[0])

print("")
print(test[0])
print("")


<IPython.core.display.Javascript object>

Number of train reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

["The undoubted highlight of this movie is Peter O'Toole's performance. In turn wildly comical and terribly terribly tragic. Does anybody do it better than O'Toole? I don't think so. What a great face that man has!<br /><br />The story is an odd one and quite disturbing and emotionally intense in parts (especially toward the end) but it is also oddly touching and does succeed on many levels. However, I felt the film basically revolved around Peter O'Toole's luminous performance and I'm sure I wouldn't have enjoyed it even half as much if he hadn't been in it.", 1]

Number of test reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old 

## Getting the Tokenizer

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")


<IPython.core.display.Javascript object>

### Experiment the Tokenizer on the first train review

In [None]:
maxL = 512 # Max length of the sequence

string_tokenized = tokenizer.encode_plus(train[0][0], return_tensors="pt", 
                                        add_special_tokens=True,  # add '[CLS]' and '[SEP]'
                            max_length=maxL,  # set max length
                            truncation=True,  # truncate longer messages
                            #pad_to_max_length=True
                            padding='max_length',  # add padding
                            return_attention_mask=True)

<IPython.core.display.Javascript object>

The output of the tokenizer string_tokenized (class BatchEncoding) returns two elements:


*   string_tokenized['input_ids']: the index of each token in the dictionary
*   string_tokenized['attention_mask']: a binary mask (0 to ignore the token, 1 to consider it). This is because we need tensor a fixed length and we have reviews with a variable number of words



In [None]:
print("Index:\n", string_tokenized['input_ids'])
print("Mask:\n", string_tokenized['attention_mask'])

Index:
 tensor([[  101,  1996, 25672, 12083,  3064, 12944,  1997,  2023,  3185,  2003,
          2848,  1051,  1005,  6994,  2063,  1005,  1055,  2836,  1012,  1999,
          2735, 13544, 29257,  1998, 16668, 16668, 13800,  1012,  2515, 10334,
          2079,  2009,  2488,  2084,  1051,  1005,  6994,  2063,  1029,  1045,
          2123,  1005,  1056,  2228,  2061,  1012,  2054,  1037,  2307,  2227,
          2008,  2158,  2038,   999,  1026,  7987,  1013,  1028,  1026,  7987,
          1013,  1028,  1996,  2466,  2003,  2019,  5976,  2028,  1998,  3243,
         14888,  1998, 14868,  6387,  1999,  3033,  1006,  2926,  2646,  1996,
          2203,  1007,  2021,  2009,  2003,  2036, 15056,  7244,  1998,  2515,
          9510,  2006,  2116,  3798,  1012,  2174,  1010,  1045,  2371,  1996,
          2143, 10468,  7065, 16116,  2105,  2848,  1051,  1005,  6994,  2063,
          1005,  1055, 25567,  2836,  1998,  1045,  1005,  1049,  2469,  1045,
          2876,  1005,  1056,  2031,  5632, 

# Lets download a BERT model for word embedding

In [None]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")

<IPython.core.display.Javascript object>

In [None]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

**You can use the BERT model for directly predicting polarity.** Let us apply that on the first review which has been tokenized with string_tokenized.

In [None]:
# Some preliminary test
import torch
import numpy as np
b_input_ids = string_tokenized['input_ids']
b_input_mask = string_tokenized['attention_mask']

model.eval()

output = model(input_ids=b_input_ids,attention_mask=b_input_mask, output_hidden_states=True)
print(output.logits) # The output of the logit of the two classes (polarity pos/neg)  
last_hidden_states = output.hidden_states[-1] # The last layer before the class prediction: tensor of size nBatch (1 here) x MaxL (512) x temb (768)
print(last_hidden_states.shape) # length of sequence (512) * length of embedding (768)
print(last_hidden_states[0,0,1:10]) # The first 10 values (out of 768) of the first elements (=[CLS] TOKEN)
print(f" norm cls token = {np.linalg.norm(last_hidden_states.detach().numpy()[0,0,:])}") 

<IPython.core.display.Javascript object>

tensor([[ 4.4636, -4.0851]], grad_fn=<AddmmBackward0>)
torch.Size([1, 512, 768])
tensor([ 0.8941, -0.4308,  0.6871, -0.2124,  0.0930,  1.1323, -0.7455, -0.1118,
        -0.4200], grad_fn=<SliceBackward0>)
 norm cls token = 17.2490291595459


Logit = valeur d'un score avant qu'il soit probabilisé.

# Let's tokenize the whole dataset 

Maintenant, on refait pareil, mais pour l'ensemble du jeu de données.

# Let's create a 'TensorDataSet' **for the training samples**

Each element is a triplet composed of token word index, token mask, and label

In [None]:
# Converting input tokens to torch tensors 
inputs_tokens_train = torch.cat(inputs_tokens_train, dim=0)
attention_masks_train = torch.cat(attention_masks_train, dim=0)

# Converting labels to numpy then torch tensor
y_train = np.zeros((len(train),))
for i in range(len(train)):
  y_train[i] = train[i][1]
y_train = torch.from_numpy(y_train)

from torch.utils.data import TensorDataset, random_split, DataLoader, RandomSampler, SequentialSampler
train_dataset = TensorDataset(inputs_tokens_train, attention_masks_train, y_train)

<IPython.core.display.Javascript object>

# Let's do the same **for the test samples**

In [None]:
# Converting input tokens to torch tensors 
inputs_tokens_test = torch.cat(inputs_tokens_test, dim=0)
attention_masks_test = torch.cat(attention_masks_test, dim=0)
  
y_test = np.zeros((len(test),))
for i in range(len(test)):
    y_test[i] = test[i][1]
y_test = torch.from_numpy(y_test)

test_dataset = TensorDataset(inputs_tokens_test, attention_masks_test, y_test)

<IPython.core.display.Javascript object>

In [None]:
# If you need to clean GPU memory
import gc
gc.collect()
torch.cuda.empty_cache()

<IPython.core.display.Javascript object>

# Most important STEP

We want to extract the [CLS] representation (1st token of the last layer before logits) for each review, and store it in train and test.  

In [None]:
!pip install tqdm

In [None]:
from tqdm import tqdm

In [None]:
# create DataLoaders with samplers
tb = int(100)
train_dataloader = DataLoader(train_dataset, batch_size=tb,shuffle=False)
nbTrain = len(train) # nb d'exemples en train
f_train = np.zeros((nbTrain, temb)) # features de train (les classes CLS, ce qu'on stocke et ce qu'on va calculer)
nbtach = int(nbTrain/tb)
print(f"nb batches = {nbtach}")

# Comuting CLS features
model.cuda()
for idx, batch in enumerate(tqdm(train_dataloader)):
        # Unpack this training batch from our dataloader:
        # `batch` contains three pytorch tensors:
        #   [0]: input ids
        #   [1]: attention masks
        #   [2]: labels
        # if(idx%10==0):
        #     print(f"batch {idx} / {nbtach}")
        b_input_ids = batch[0].cuda()
        b_input_mask = batch[1].cuda()
        b_labels = batch[2].cuda().long()
        
        with torch.no_grad(): # on ne cherche pas à faire de l'apprentissage donc on ne calcule pas les gradients (pour aller plus vite et économiser de la mémoire)
            # forward propagation (evaluate model on training batch)
            output = model(input_ids=b_input_ids,
                           attention_mask=b_input_mask,
                           # labels=b_labels, 
                           output_hidden_states=True)
            last_hidden_states = output.hidden_states[-1] # WARNING: it is now a batch of size tbatch x nToken x embsize (100*512*768)
            # tb éléments d'entrainement
            # YOUR CODE HERE. Think in applying .detach().cpu().numpy()
            f_train[idx*tb:idx*tb+tb,:] = last_hidden_states.detach().cpu().numpy()[:,0,:]


<IPython.core.display.Javascript object>

nb batches=250
batch 0 / 250
batch 10 / 250
batch 20 / 250
batch 30 / 250
batch 40 / 250
batch 50 / 250
batch 60 / 250
batch 70 / 250
batch 80 / 250
batch 90 / 250
batch 100 / 250
batch 110 / 250
batch 120 / 250
batch 130 / 250
batch 140 / 250
batch 150 / 250
batch 160 / 250
batch 170 / 250
batch 180 / 250
batch 190 / 250
batch 200 / 250
batch 210 / 250
batch 220 / 250
batch 230 / 250
batch 240 / 250


# Extract [CLS] token in TEST

# Now save the embedding of each review into disk!

In [None]:
# Saving the features and labels
import pickle
# Open a file and use dump()
with open('train-data.pkl', 'wb') as file:
    # A new file will be created
    pickle.dump([f_train,y_train], file)

with open('test-data.pkl', 'wb') as file:
    # A new file will be created
    pickle.dump([f_test,y_test], file)  

<IPython.core.display.Javascript object>

In [None]:
import pickle
  
# Open the file in binary mode
with open('train-data.pkl', 'rb') as file:    
    # Call load method to deserialze
    [feature_train, ytrain] = pickle.load(file)

# Open the file in binary mode
with open('test-data.pkl', 'rb') as file:    
    # Call load method to deserialze
    [feature_test, ytest] = pickle.load(file)  
    

<IPython.core.display.Javascript object>

In [None]:
import numpy as np
print(feature_train.shape[0])
print(feature_test.shape)

print(ytrain)
print(ytest)
print(np.linalg.norm(feature_train[10]))

<IPython.core.display.Javascript object>

25000
(25000, 768)
tensor([1., 1., 1.,  ..., 0., 0., 0.], dtype=torch.float64)
tensor([1., 1., 1.,  ..., 0., 0., 0.], dtype=torch.float64)
15.1259890017741


# Finally: train a logistic regression model on top of extracted embeddings. Conclude on the performances of BERT for the sentiment classification task

Quasi 90% c'est plutôt pas mal hein, en transfert pur. On pourrait faire du fine tuning, reprendre BERT et l'optimiser sur oles labels de la tache cible et réoptimiser, on devrait trouver ENCORE mieux.

Modèle très lourd, très puissant mais le gap est surout marqué dans les taches bcp plus fines par rapport aux autres modèles. 