We use a transformer based model i.e. Bidirectional Encoder Representations from Transformers (BERT). It is a type of language model that uses deep learning to generate contextual representations of words in a sentence. BERT was developed by researchers at Google and has been shown to be very effective at a wide range of natural language processing tasks, such as sentiment analysis, text summarization, and named entity recognition. 

One of the key features of BERT is that it is a "bidirectional" model, which means that it takes into account the context of a word in both the left and right sides of the sentence when generating its representation. This is in contrast to other language models, which only consider the context on one side of the sentence. This bidirectional approach allows BERT to capture a more nuanced and accurate representation of the meaning of words in a sentence, which makes it well-suited for many natural language processing tasks. It uses the "Transformer" architecture, which is a type of neural network that is particularly well-suited for processing sequential data such as text. The Transformer architecture allows BERT to efficiently process long sequences of words and to capture the relationships between words at different positions in the sentence. 

We use the BERT model with another hidden layer with 1024 neurons with Relu as the activation function and an output layer with 5 neurons with Softmax as the activation function.


In [46]:
import numpy as np 
import pandas as pd 
from transformers import TFBertModel,  BertConfig, BertTokenizerFast, TFAutoModel
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import zipfile
import os
with zipfile.ZipFile('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip','r') as zip_ref:
    zip_ref.extractall("./sentiment-analysis-on-movie-reviews/")
with zipfile.ZipFile('/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip','r') as zip_ref:
    zip_ref.extractall("./sentiment-analysis-on-movie-reviews/")


/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv
/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip
/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip


In [49]:
#Import data
data=pd.read_table("/kaggle/working/sentiment-analysis-on-movie-reviews/train.tsv",sep='\t')
data=data[['Phrase','Sentiment']].copy()
dff=[len(i.split(" ")) for i in data.Phrase[:10]]

In [51]:
X_train, X_val, y_train, y_val = train_test_split(data.index.values, 
                                                  data.Sentiment.values, 
                                                  test_size=0.15, 
                                                  random_state=42)

data['data_type'] = ['not_set']*data.shape[0]

data.loc[X_train, 'data_type'] = 'train'
data.loc[X_val, 'data_type'] = 'val'


In [52]:
model_name = 'bert-base-cased'

# Max length of tokens
max_length = max(dff)+3

# Load transformers config 
config = BertConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

In [53]:
# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
attention_mask = Input(shape=(max_length,), name='attention_mask', dtype='int32') 
inputs = {'input_ids': input_ids, 'attention_mask': attention_mask}
bert = TFAutoModel.from_pretrained('bert-base-cased')
embeddings = bert.bert(inputs)[1]  # access pooled activations with [1]

x =Dense(1024, activation='relu')(embeddings)
y =Dense(5, activation='softmax', name='outputs')(x)
model = Model(inputs=inputs, outputs=y)

model.summary()

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
attention_mask (InputLayer)     [(None, 40)]         0                                            
__________________________________________________________________________________________________
input_ids (InputLayer)          [(None, 40)]         0                                            
__________________________________________________________________________________________________
bert (TFBertMainLayer)          TFBaseModelOutputWit 108310272   attention_mask[0][0]             
                                                                 input_ids[0][0]                  
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1024)         787456      bert[0][1]                 

In [54]:
y_senti = to_categorical(data[data.data_type=='train'].Sentiment)

# Tokenize the input 
x = tokenizer(
    text=data[data.data_type=='train'].Phrase.to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

train=tf.data.Dataset.from_tensor_slices((x['input_ids'], x['attention_mask'], y_senti))
def map_func(input_ids, masks, labels):
    # convert three-item tuple into a two-item tuple where the input item is a dictionary
    return {'input_ids': input_ids, 'attention_mask': masks}, labels

train = train.map(map_func)
batch_size = 32
train = train.shuffle(100).batch(batch_size, drop_remainder=True)

In [56]:
y_senti = to_categorical(data[data.data_type=='val'].Sentiment)

# Tokenize the input 
x = tokenizer(
    text=data[data.data_type=='val'].Phrase.to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

val = tf.data.Dataset.from_tensor_slices((x['input_ids'], x['attention_mask'], y_senti))
val = val.map(map_func)
val = val.shuffle(100).batch(batch_size, drop_remainder=True)

In [57]:
optimizer = Adam(lr=1e-5, decay=1e-6)
loss = CategoricalCrossentropy()
acc = CategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[acc])

In [58]:
# Fit the model
history = model.fit(
    train,
    validation_data=val,
    epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [59]:
model.save('./sentiment-analysis-on-movie-reviews/sentiment_model')

In [73]:
X_train, X_val, y_train, y_val = train_test_split(data.index.values, 
                                                  data.Sentiment.values, 
                                                  test_size=0.20, 
                                                  random_state=42)

In [74]:
test = data.loc[X_val]

In [75]:
x = tokenizer(
    text=test.Phrase.to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

In [76]:
items=tf.data.Dataset.from_tensor_slices((x['input_ids'],x['attention_mask']))

In [77]:
items

<TensorSliceDataset shapes: ((40,), (40,)), types: (tf.int32, tf.int32)>

In [78]:
def map_func(input_ids, masks):
    return {'input_ids': input_ids, 'attention_mask': masks}

items = items.map(map_func)
items = items.batch(32)

In [80]:
predictions=model.predict(items).argmax(axis=-1)

In [81]:
from sklearn.metrics import classification_report

report = classification_report(y_val,predictions)
print(report)

              precision    recall  f1-score   support

           0       0.58      0.53      0.55      1416
           1       0.62      0.68      0.65      5527
           2       0.81      0.81      0.81     15639
           3       0.67      0.59      0.63      6707
           4       0.55      0.68      0.61      1923

    accuracy                           0.72     31212
   macro avg       0.65      0.66      0.65     31212
weighted avg       0.72      0.72      0.72     31212

