**BERT Model for Sentiment Analysis**

Sentiment Analysis is a major task in Natural Language Processing (NLP) field. It is used to understand the sentiments of the customer/people for products, movies, and other such things, whether they feel positive, negative, or neutral about it. It helps companies and other related entities to know about their products/services and helps them to work on the feedback to further improve it.

## **Import of the librairies**

In [None]:
!pip install -q transformers

In [None]:
import tensorflow as tf
import pandas as pd
import tensorflow_datasets as tfds
from transformers import TFBertForSequenceClassification

In [None]:
from google.colab import drive
drive = drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


## **Import of the dataset**

In [None]:
df_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DeepL2/dataset_textClassification/train.csv')

In [None]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,film-url,review,polarity
0,0,http://www.allocine.fr/film/fichefilm-135259/c...,Si vous cherchez du cinéma abrutissant à tous ...,0
1,1,http://www.allocine.fr/film/fichefilm-172430/c...,"Trash, re-trash et re-re-trash...! Une horreur...",0
2,2,http://www.allocine.fr/film/fichefilm-15105/cr...,"Et si, dans les 5 premières minutes du film, l...",0
3,3,http://www.allocine.fr/film/fichefilm-188629/c...,Mon dieu ! Quelle métaphore filée ! Je suis ab...,0
4,4,http://www.allocine.fr/film/fichefilm-23514/cr...,"Premier film de la saga Kozure Okami, ""Le Sabr...",1


In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  160000 non-null  int64 
 1   film-url    160000 non-null  object
 2   review      160000 non-null  object
 3   polarity    160000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 4.9+ MB


In [None]:
df_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DeepL2/dataset_textClassification/test.csv')

In [None]:
df_train = df_train[['review','polarity']]

In [None]:
df_test = df_test[['review','polarity']]

In [None]:
df_train.shape

(160000, 2)

In [None]:
df_test.shape

(20000, 2)

Here we split the train and test dataset to a small dataset by sampling the rows

In [None]:
df_train_small = df_train.sample(frac=1)[:10000]

In [None]:
df_test_small = df_test.sample(frac=1)[:2000]

We convert our dataframe(train and test) to tensorflow data in order to be fitted it to our model

In [None]:
# convert pandas df to tensorflow

ds_train = tf.data.Dataset.from_tensor_slices((df_train_small['review'], df_train_small['polarity'])).prefetch(10)
ds_train = ds_train.map(lambda x, y: (x,y))

ds_test = tf.data.Dataset.from_tensor_slices((df_test_small['review'], df_test_small['polarity'])).prefetch(10)
ds_test = ds_test.map(lambda x, y: (x,y))

In [None]:
type(ds_train)

tensorflow.python.data.ops.map_op._MapDataset

In [None]:
next(iter(ds_train))

(<tf.Tensor: shape=(), dtype=string, numpy=b'Si vous cherchez du cin\xc3\xa9ma abrutissant \xc3\xa0 tous les \xc3\xa9tages,n\'ayant aucune peur du clich\xc3\xa9 en castagnettes et moralement douteux,"From Paris with love" est fait pour vous.Toutes les productions Besson,via sa fili\xc3\xa8re EuropaCorp ont de quoi faire na\xc3\xaetre la moquerie.Paris y est encore une fois montr\xc3\xa9e comme une capitale exotique,mais attention si l\'on se dirige vers la banlieue,on y trouve tout plein d\'int\xc3\xa9gristes musulmans pr\xc3\xaats \xc3\xa0 faire sauter le caisson d\'une ambassadrice am\xc3\xa9ricaine.Naus\xc3\xa9eux.Alors on se dit qu\'on va au moins pouvoir appr\xc3\xa9cier la d\xc3\xa9connade d\'un classique buddy-movie avec le jeune agent aux dents longues oblig\xc3\xa9 de faire \xc3\xa9quipe avec un vieux lou compl\xc3\xa8tement timbr\xc3\xa9.Mais d\'un c\xc3\xb4t\xc3\xa9,on a un Jonathan Rhys-meyers fayot au possible,et de l\'autre un John Travolta en total d\xc3\xa9lire narcissi

## **Import of the BERT tokenizer**

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [None]:
"""
The encode_plus  function of the tokenizer class will tokenize the raw input,
add the special tokens, and pad the vector to a size equal to max length (that we can set).
"""
def convert_example_to_feature(review):
  return tokenizer.encode_plus(review,
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = max_length, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

In [None]:
# can be up to 512 for BERT
max_length = 512
batch_size = 8

In [None]:
"""
The following helper functions will help us to transform our raw data to an appropriate format ready to feed into the BERT model
"""
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

In [None]:
def encode_examples(ds, limit=-1):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []
  if (limit > 0):
      ds = ds.take(limit)
  for review, label in tfds.as_numpy(ds):
    bert_input = convert_example_to_feature(review.decode())
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    label_list.append([label])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

Now, Let’s form our train and test dataset

In [None]:
# train dataset
ds_train_encoded = encode_examples(ds_train).batch(batch_size)

# test dataset
ds_test_encoded = encode_examples(ds_test).batch(batch_size)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


## **Settings of the hyper-parameters and model initialization**

In [None]:
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 5
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

**Model training**

In [None]:
bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_test_encoded)

Epoch 1/5
Epoch 2/5
Epoch 3/5
 300/2500 [==>...........................] - ETA: 33:03 - loss: 0.1498 - accuracy: 0.9429

KeyboardInterrupt: ignored

**with 02 epoch we achieve over 93% accuracy on validation**

## **Test on random sample**

In [None]:
test_sentence = "la voiture a connu un accident"

predict_input = tokenizer.encode(test_sentence,

truncation=True,

padding=True,

return_tensors="tf")
tf_output = model.predict(predict_input)[0]
tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive'] #(0:negative, 1:positive)
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

Negative


In [None]:
# save model
model.save('/content/model_textClassification')

## **Reference:**

https://www.analyticsvidhya.com/blog/2021/12/fine-tune-bert-model-for-sentiment-analysis-in-google-colab/