# Sentiment classification - close to the state of the art

- The task of classifying sentiments of texts (for example movie or product reviews) has high practical significance in online marketing as well as financial prediction. This is a non-trivial task, since the concept of sentiment is not easily captured.

- For this assignment you have to use the larger [IMDB sentiment](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) benchmark dataset from Stanford, an achieve close to state of the art results.

- In the previous notebook we have used applied the Statical non pre trained models. In this note book we will use the BERT pre trained model for the model building.

- The Notebook is executed in the colab notebooks as it has GPU assist. Please kindly activate the GPU before the execution.






# Data download

In [1]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz
!ls

--2022-12-22 20:19:22--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-12-22 20:19:24 (34.2 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

aclImdb  aclImdb_v1.tar.gz  sample_data


# Alternative with tf.datasets

In [2]:
!pip install tensorflow-datasets > /dev/null

In [3]:
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd


#model
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences


In [4]:
# Connecting to the google drive to save the temporary files for the execution
from google.colab import drive
drive.mount('/content/drive')
from IPython.display import clear_output

Mounted at /content/drive


In [5]:
(ds_train,ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompletePCE480/imdb_reviews-train.tfrecord*...…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompletePCE480/imdb_reviews-test.tfrecord*...:…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompletePCE480/imdb_reviews-unsupervised.tfrec…

Dataset imdb_reviews downloaded and prepared to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [6]:
ds_info

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset.
    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num_shards=1>,
        '

In [7]:
df = tfds.as_dataframe(ds_train, ds_info)
df.head()

Unnamed: 0,label,text
0,0,"b""This was an absolutely terrible movie. Don't..."
1,0,b'I have been known to fall asleep during film...
2,0,b'Mann photographs the Alberta Rocky Mountains...
3,1,b'This is the kind of film for a snowy Sunday ...
4,1,"b'As others have mentioned, all the women that..."


In [8]:


[(train_features, label_batch)] = ds_train.take(1)
print(np.array(train_features))

b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."


# Data Extraction 

Converting the tensor dataset into a pandas dataframe for further operations

In [9]:
# didn't realise there is an function as_dataframe, below functions also works
def tensors_to_pandas(df):
  cleaned_Text = []
  label_Text = []
  for train_features, label_batch in  df.take(len(df)):
    cleaned_Text.append(train_features.numpy().decode('utf-8'))
    label_Text.append(label_batch.numpy())


  pandas_df = pd.DataFrame(columns=['Text','Label'])
  pandas_df['Text'] = cleaned_Text
  pandas_df['Label'] = label_Text
  return pandas_df

In [10]:
train_df = tensors_to_pandas(ds_train)
test_df = tensors_to_pandas(ds_test)

In [11]:
train_df.head()

Unnamed: 0,Text,Label
0,This was an absolutely terrible movie. Don't b...,0
1,"I have been known to fall asleep during films,...",0
2,Mann photographs the Alberta Rocky Mountains i...,0
3,This is the kind of film for a snowy Sunday af...,1
4,"As others have mentioned, all the women that g...",1


In [12]:
train_df.shape

(25000, 2)

In [13]:
test_df.head()

Unnamed: 0,Text,Label
0,There are films that make careers. For George ...,1
1,"A blackly comic tale of a down-trodden priest,...",1
2,"Scary Movie 1-4, Epic Movie, Date Movie, Meet ...",0
3,Poor Shirley MacLaine tries hard to lend some ...,0
4,As a former Erasmus student I enjoyed this fil...,1


In [14]:
test_df.shape

(25000, 2)

In [15]:
train_df['Label'].value_counts()

0    12500
1    12500
Name: Label, dtype: int64

In [16]:
test_df['Label'].value_counts()

1    12500
0    12500
Name: Label, dtype: int64

The Train and test datasets are equally split. Each dataset same number of  positive and negative reviews with 

# Transformers BERT 

In [17]:
!pip install transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 4.5 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 72.5 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 68.6 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [18]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [19]:
# BERT takes of max of 512 characters 
max_length = 512
batch_size = 6

In [20]:
def map_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

In [30]:
def encode_text_input(pandas_df):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids = []
  token_ids = []
  attention_mask = []
  label_list = []
 
  for text, label in zip(pandas_df.Text, pandas_df.Label):    
    bert_input = tokenizer.encode_plus(text,
                add_special_tokens = True, # adding the tags like CLS, SEP
                max_length = max_length, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )
  
    input_ids.append(bert_input['input_ids'])
    token_ids.append(bert_input['token_type_ids'])
    attention_mask.append(bert_input['attention_mask'])
    label_list.append([label])
    
  return tf.data.Dataset.from_tensor_slices((input_ids, attention_mask, token_ids, label_list)).map(map_to_dict)

In [28]:
semi_train = train_df[:22000]
valid_df = train_df[22000:]


In [None]:
# train dataset
ds_train_encoded = encode_text_input(semi_train).shuffle(10000).batch(batch_size)


# valid dataset
df_valid_encoded = encode_text_input(valid_df).batch(batch_size)



In [32]:
# test dataset
ds_test_encoded = encode_text_input(test_df).batch(batch_size)

In [33]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf

# setting the Learning Rate
learning_rate = 2e-5

# Trying with 1 epoch as it taking more time to run and to avoid overfitting
number_of_epochs = 1
# model initialization
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [34]:
# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)

# using sparse categorical cross entropy 
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [35]:
bert_history = bert_model.fit(ds_train_encoded, 
                         epochs=number_of_epochs, 
                         validation_data=df_valid_encoded)



In [36]:

# Predicting the sentiment for the Test files 
# Output will return Logits 
predict_values = bert_model.predict(ds_test_encoded)



In [37]:
# Using the Softmax to extract the probabilities from the logits 
final_results = tf.nn.softmax(predict_values[0], axis=1)

In [38]:
# Finally applying the argmax to get the final labels 
label = tf.argmax(final_results, axis=1)

In [39]:
from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(test_df['Label'], label.numpy() ))

              precision    recall  f1-score   support

           0       0.92      0.95      0.93     12500
           1       0.95      0.91      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000



In [40]:
# Saving the model file 
bert_model.save('/content/drive/MyDrive/NLP_FS_Assignments_2022/Final_assignment/hugging_face_BERT_v2')



Motivation:

Previously I have used the BERT models from the tensorflow HUB and I was not sucessful in acheiving the accuracy more than 90%. 

Coming to bert-base-uncased, As it is a movie dataset using uncased as it is  sentiment prediciton. As this is more of downstream task using the current transformer will give better results. The challenge with the model is that it might be overfitting atleast with one more epoch.

The model definelty out performs the previous models built and also it maintains good predictive balance both the sentiment classes.

# End of the Notebook