- **Author:** Aisling Towey
- **Date:** 25th August 2021

# Overview

The dataset used in this analysis contains 50k IMDB movie reviews labelled as either positive or negative. It can be downloaded at https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.
    
This aim of this analysis is to create a supervised text classification model that predicts with confidence whether a movie review is positive or negative. This is a very clean and balanced dataset meaning the focus here will not be on preprocessing but rather building and training the model.

Transfer learning is used for this sentiment model meaning a bert-base-uncased model is loaded in from the Huggingface transformers library and used as a starting point for the model. Bert was pretrained on unlabelled text to understand the general facets of language and can be fine tuned for other tasks such as text classification. Bert models have helped achieve state of the art results in recent years due to its bidirectional training meaning it considers context from both left and right of each token. Bert also uses the transformer architecture and attention mechanism to focus on tokens with more importance. The keras library is used to fine tune the model in this code.

# Import Modules

In [1]:
# !pip install sklearn
# !pip install pandas
# !pip install tensorflow
# !pip install transformers
# !pip install ipywidgets
# !jupyter nbextension enable --py widgetsnbextension

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.losses import CategoricalCrossentropy, BinaryCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy, BinaryAccuracy
from tensorflow.keras.models import model_from_json, Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, CSVLogger 
from transformers import BertTokenizerFast, DistilBertTokenizerFast, AutoTokenizer, AutoConfig, TFAutoModel
from sklearn.model_selection import train_test_split
import tensorflow as tf
from transformers import TFAutoModel, AutoConfig
import arrow
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import resample
from tensorflow.keras.layers import Input, Dropout, Dense, BatchNormalization
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from pprint import pprint as pp

runtime_id = arrow.utcnow().isoformat()[0:10]
logname = 'training_' + arrow.utcnow().isoformat()[0:16]
tf.test.is_gpu_available()

True

# Data Preprocessing

First get a general overview of the data.

In [3]:
data = pd.read_csv('imdb_data.csv')
print(data.info())
print(f'\nClass split: \n{data["sentiment"].value_counts()}')
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
None

Class split: 
positive    25000
negative    25000
Name: sentiment, dtype: int64


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


As there is a lot of rows in this dataset we are going to take a sample of the data to speed up training time. We will then split the data into a training, validation and test set.

In [4]:
# shuffle the data before spliting into train, val and test sets incase it was ordered in some way
data = data.sample(frac=1).reset_index(drop=True)
data = data.sample(8000)
# set up training, val and test sets
train_df, test_df = train_test_split(data, test_size = 0.2, stratify = data.sentiment, random_state = 12)
train_df, val_df = train_test_split(train_df, test_size = 0.1, stratify = train_df.sentiment, random_state =12)

# train_df.to_csv("train_df.csv", index = False)
# val_df.to_csv("val_df.csv", index = False)
# test_df.to_csv("test_df.csv", index = False)

print(f"Number of rows in training data: ", len(train_df))
print(f"Number of rows in testing data: ", len(test_df))
print(f"Number of rows in validation data: ", len(val_df))

Number of rows in training data:  5760
Number of rows in testing data:  1600
Number of rows in validation data:  640


In [5]:
train_df['sentiment'].value_counts()

positive    2902
negative    2858
Name: sentiment, dtype: int64

Now we can load the tokenizer and tokenize the text in the training and validation sets.

In [6]:
MODEL_NAME = 'bert-base-uncased'
MAX_LENGTH = 120 # train_df['customer_query'].str.split().str.len().mean() - to check average length
config = AutoConfig.from_pretrained(MODEL_NAME)
config.output_hidden_states = True
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = MODEL_NAME, config = config)
transformer_model = TFAutoModel.from_pretrained(MODEL_NAME, config = config)
transformer_model = transformer_model.layers[0]
        
encoder = OneHotEncoder()
train_df['label'] = pd.Categorical(train_df['sentiment'])
val_df['label'] = pd.Categorical(val_df['sentiment'])

# one hot encode the labels
train_df['one_hot'] = encoder.fit_transform(np.array(train_df['label']).reshape(-1, 1)).toarray().tolist()
val_df['one_hot'] = encoder.fit_transform(np.array(val_df['label']).reshape(-1, 1)).toarray().tolist()
        
# Prepare labels for the model
y_train = tf.convert_to_tensor(train_df['one_hot'].tolist())
y_val = tf.convert_to_tensor(val_df['one_hot'].tolist())

# Tokenize the training set queries
x_train = tokenizer(
    text=train_df['review'].to_list(),
    add_special_tokens=True,
    max_length=MAX_LENGTH,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

# Tokenize the validation set queries
x_val = tokenizer(
    text=val_df['review'].to_list(),
    add_special_tokens=True,
    max_length=MAX_LENGTH,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

topic_list = list(encoder.categories_[0])
topic_dict = {v:k for v,k in enumerate(topic_list)}

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


# Build Model

Now we can prepare the training architecture which includes loading in the bert model and adding two dropout and two dense layers, the final one with two units to equal the number of classes we are trying to classify into.

In [7]:
# model
NUM_EPOCHS = 5
BATCH_SIZE = 32
TRAIN_BERT_LAYER = True
KERNEL_INITIALIZER = "random_normal"
NUMBER_OF_CLASSES = 2 # exclude other from this count
LEARNING_RATE = 5e-5
OPTIMIZER = Adam(learning_rate=LEARNING_RATE)
LOSS = BinaryCrossentropy(from_logits = True)
METRIC = BinaryAccuracy('accuracy')
         
# callbacks
SAVE_BEST_MODEL = ModelCheckpoint('best_model.h5', save_best_only=True, monitor='val_loss', mode='min', save_weights_only=False)
STOP_EARLY = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
CSV_LOGGER = CSVLogger('keras_multilabel_log_{}.log'.format(runtime_id), append=False)

# predictions
PROBABILITY_THRESHOLD = 0.5

In [8]:
# build the model input
input_ids = tf.keras.layers.Input(shape=(MAX_LENGTH,), name='input_ids', dtype='int32')
attention_mask = tf.keras.layers.Input(shape=(MAX_LENGTH,), name='attention_mask', dtype='int32') 
inputs = [input_ids, attention_mask]

# load the Transformers BERT model as a layer in a Keras model
bert_model = transformer_model(inputs)[1]
dropout = Dropout(0.1)
pooled_output = dropout(bert_model, training=False)

# build the model output
pooled_output = Dense(units=100)(pooled_output)
pooled_output = tf.keras.layers.Dropout(0.2)(pooled_output)
model_output = Dense(units=NUMBER_OF_CLASSES, kernel_initializer=KERNEL_INITIALIZER)(pooled_output)

# combine it all in a model object
model = Model(inputs=inputs, outputs=model_output)

# we can only train the layers after the bert layer if we want but it seems to work better training all
for layer in model.layers[:3]:
    layer.trainable = True

print(model.summary())

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 120)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 120)]        0                                            
__________________________________________________________________________________________________
bert (TFBertMainLayer)          ((None, 120, 768), ( 109482240   input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
dropout_37 (Dropout)            (None, 768)          0           bert[0][1]            

# Train Model

Time to train the model! We have added callbacks here to save the model after every epoch if the validation loss is lower than the previous epoch.

In [9]:
# Compile the model
model.compile(
    optimizer = OPTIMIZER,
    loss = LOSS, 
    metrics = METRIC)

# Fit the model
model.fit(
     x=[x_train['input_ids'], x_train['attention_mask']],
    y=y_train,
    validation_data=([x_val['input_ids'], x_val['attention_mask']], y_val),
    callbacks=[STOP_EARLY, CSV_LOGGER],
    callbacks=[SAVE_BEST_MODEL, STOP_EARLY, CSV_LOGGER],
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS)

# # Save the final model
# model.save('training_outputs/models/final_model.h5', include_optimizer=False) 
# model.load_weights('training_outputs/models/best_model.h5')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f1c08478470>

# Make Predictions on Test Data

Now we can use the model to get some performance metrics on the unseen test set.

If you have a multilabel problem where there are multiple topics to be trained and one query can fall under multiple topics a sigmoid activation is used to ensure each topic has a prediction probability between 0 and 1 rather than all topic prediction probabilities adding to 1. If you want the prediction probabilites to sum to 1 you can use a softmax activation function. We will go with softmax here as the reviews have been labelled as either positive or negative although in theory a review could be both positive and negative.

In [12]:
results_list = []
model_name =  'bert-base-uncased'
config = AutoConfig.from_pretrained(model_name)
config.output_hidden_states = True
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)        
 
test_x = tokenizer(
text=test_df['review'].to_list(),
add_special_tokens=True,
max_length=MAX_LENGTH,
truncation=True,
padding=True, 
return_tensors='tf',
return_token_type_ids = False,
return_attention_mask = True,
verbose = True)
 
y_pred = model.predict([test_x['input_ids'], test_x['attention_mask']])
softmax_pred = tf.nn.softmax(y_pred)
max_prob = softmax_pred.numpy().max(axis=1)
test_df['max_prob'] = max_prob
softmax_pred = np.array(softmax_pred).tolist()
test_df['softmax_pred'] = softmax_pred

argmax = y_pred.argmax(axis=1)
test_df['argmax'] = argmax
test_df['topic_prediction']= test_df['argmax'].map(topic_dict)
        
targets = test_df['sentiment'].tolist()
predictions = test_df['topic_prediction'].tolist()

# output of results for csv
results_list.append({
                "test_accuracy": accuracy_score(targets, predictions),
                "precision_macro": round(precision_score(targets, predictions, average='macro'), 3),
                "recall_macro": round(recall_score(targets, predictions, average='macro'), 3),
                "f1_score_macro": round(f1_score(targets, predictions, average='macro'), 3),
                "classiciation_report": classification_report(targets, predictions, digits=3),
                "confusion_matrix": confusion_matrix(targets, predictions),
                "max_length": MAX_LENGTH,
                "num_epochs": NUM_EPOCHS,
                "batch_size": BATCH_SIZE,
                "learning_rate": LEARNING_RATE  
                 })

# save results to csv
results_test_df = pd.DataFrame(results_list)
results_test_df.to_csv("multilabel_results_{}.csv".format(runtime_id),
                                 index=False, columns=["test_accuracy",  "precision_macro", "recall_macro", 
                                                       "f1_score_macro", "classiciation_report", "confusion_matrix",
                                                      "max_length", "num_epochs", "batch_size", "learning_rate"])
result_dict = next(item for item in results_list)
pp(f'Test set accuracy: {accuracy_score(targets, predictions)}')
pp(result_dict.get('confusion_matrix'))
pp(result_dict.get('classiciation_report'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

'Test set accuracy: 0.8475'
array([[680, 114],
       [130, 676]])
('              precision    recall  f1-score   support\n'
 '\n'
 '    negative      0.840     0.856     0.848       794\n'
 '    positive      0.856     0.839     0.847       806\n'
 '\n'
 '    accuracy                          0.848      1600\n'
 '   macro avg      0.848     0.848     0.847      1600\n'
 'weighted avg      0.848     0.848     0.847      1600\n')


The results above, with a test set accuracy score of 85% and a f1 score around 0.83 are pretty good considering we only used a sample of the dataset and have additional data available for improved training. Other parameters can also be changed to make improvements such as the number of epochs, learning rate, batch size etc. We can look at examples in the test set the model predicted incorrectly below.

In [13]:
pd.set_option('display.max_colwidth', -1)
test_df['prediction_correct'] = np.where(test_df['topic_prediction']==test_df['sentiment'], "correct", "incorrect")
test_df.loc[test_df['sentiment'] != test_df['topic_prediction']].sort_values(['sentiment'])[0:4]

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,review,sentiment,max_prob,softmax_pred,argmax,topic_prediction,prediction_correct
27401,"The Blob starts with one of the most bizarre theme songs ever, sung by an uncredited Burt Bacharach of all people! You really have to hear it to believe it, The Blob may be worth watching just for this song alone & my user comment summary is just a little taste of the classy lyrics... After this unnerving opening credits sequence The Blob introduces us, the viewer that is, to Steve Andrews (Steve McQueen as Steven McQueen) & his girlfriend Jane Martin (Aneta Corsaut) who are parked on their own somewhere & witness what looks like a meteorite falling to Earth in nearby woods. An old man (Olin Howland as Olin Howlin) who lives in a cabin also sees it & goes to investigate, he finds a crater & a strange football sized rock which splits open when he unwisely pokes it with a stick. Laying in the centre of the meteorite is a strange jelly like substance which sticks to the stick, if you know what I mean! It then slides up the stick & attachés itself to the old man's hand. Meanwhile Steve & Jane are quietly driving along minding their own business when the old man runs out in front of Steve's car, Steve being a decent kinda guy decides to take the old man to Dr. T. Hallan (Alden 'Stephen' Chase as Steven Chase) at the local surgery. Dr. Hallan says he doesn't know what the substance on the old man's hand is but it's getting bigger & asks Steve to go back where he found him & see if he can find out what happened. Steve agrees but doesn't come up with anything & upon returning to Dr. Hallan's surgery he witnesses the blob devouring him. The town's police, Lieutenant Dave (Earl Rowe) & the teenage hating Sergeant Jim Bert (John Benson) unsurprisingly don't believe a word of it & end up suspecting Steve & his mates Al (Anthony Franke), Tony (Robert Fields) & someone called 'Mooch' Miller (James Bonnet) of playing an elaborate practical joke on the police department. However as the blob continues to eat it's way through the town Steve sets about finding proof of it's existence & convincing the police about the threat it posses not just to their town but the entire world!<br /><br />Directed Irvin S. Yeaworth Jr. & an uncredited Russell S. Doughton Jr. I was throughly disappointed by this, the original 1958 version of The Blob. The script by Kay Linaker as Kate Phillips & Theodore Simonson is an absolute bore & extremely dull not making the most of it's strongest aspects. The Blob focuses on the tiresome dramatics & conflicts between the teenagers & police, in fact the majority of The Blob is spent on Steve trying to convince the police of the blob's existence. For most of the film the blob itself almost seems inconsequential & somewhat forgotten. It only has two or three scenes for the fist hour & a bit until the less than exciting climax when the adults & teenagers have to work together to defeat the blob & have a new found appreciation of each other afterwards, yuck! Why couldn't the blob just eat the lot of 'em? No explanation is given for what the blob is or it's origins other than it came from space, how long did it take them to come up with that? The dialogue is clunky & silly as well, as are people's actions & decision making, I love the part when a nurse named Kate (Lee Paton as Lee Payton, did anyone use their real name in this thing?) is confronted by the blob, she throws some acid over it & calmly proclaims ""Doctor, nothing will stop it!"", how does she know 'nothing' will stop it exactly? There's no blood or violence so don't worry about that, the special effects on the blob itself aren't too bad considering but it barely has any screen time & moves very slowly, a bit like the film in general actually. The acting is terrible, McQueen is supposed to be a teenager when in reality he was 28 years old & it shows, he looks old enough to be his own dad! Same thing goes for most of the other 'teenage' cast members & everyone generally speaking are wooden & unconvincing in their roles. Technically The Blob is very basic, dark static photography, dull direction & forgettable production values. The Blob is one of those films that probably sounds good on paper & is well known as being a 'classic' but is in actual fact a huge disappointment when finally seen. This is one case when the remake The Blob (1988) is definitely better than the original. The original Blob is slow & boring & the remake isn't, the original Blob contains no blood or gore & the remake does, the original Blob has incredibly poor acting & casting decisions & the remake doesn't & the original Blob itself gets very little screen time eating only three or four people throughout the entire film & the remake features the blob all the way through & it virtually eats an entire town. The choice is an easy one, the remake every time as it's a better film in every respect. I'll give the film two stars & give that wonderful main theme song one on it's own. Definitely not the classic many seem to make out.",negative,0.97961,"[0.020390067249536514, 0.9796099662780762]",1,positive,incorrect
13566,"This is the first recorded effort to put sound with a movie, and a the oldest that, obviously, is still in existence. This historic piece of film is the opening segment in the ""More Treasures Of The Natural Archives"" DVD.<br /><br />It's only a 15-second clip of a man playing a violin in front of a huge recording cylinder. Next to him are two men dancing. Near the end, another man walks on the stage. William Dickson, the director of this experiment, is the violin player. This ""movie"" had several titles over the years but the sound experiment was not really a success. It took over 30 years from this point to the synchronize sight and sound to the point where something could be issued to the public for entertainment. However, this was a start, no matter how primitive it came off. <br /><br />For more of the technical information and history of this film process, see the other review here by ""Boba Fett1138.""",negative,0.999329,"[0.0006713325274176896, 0.9993287324905396]",1,positive,incorrect
17377,"Some people like to tell you that Deep Space 9 is the best of all the Star Trek shows, because it stresses character development and continuity, and features a more complex background and ongoing plots. In some ways this makes it more satisfying, but in many ways the show fails entirely.<br /><br />The series starts out as a soap opera on a space station, with two entire seasons of generic science fiction stories balanced with banal subplots about the characters. The characters are a good bunch, and most of the actors are decent, but I think the writers tried too hard to make them ""normal"". By ""normal"" they actually mean ""ordinary and tedious"".<br /><br />At the end of Season Two we are introduced to the Dominion, who hang around menacingly for a while before finally going to war with the good guys in Season Five. This is the main ""story arc"" of the show, but it only takes up a fraction of the entire series. We still get lame stand-alone episodes, heroes still get stranded on weird planets for forty-five minutes, and there's an awful lot of low-brow comedy featuring the greedy, goofy Ferengi. A lot of episodes are merely dull, and some are unwatchable.<br /><br />The Dominion, DS9's main villains, are bent on galactic domination for the convenient reason that, well, they just don't like anyone. The entire war is presented with a naive lack of moral complexity and imagination. Impressively pyrotechnic space battles appear with great frequency from Season Three onwards, but these are carried out in ludicrously simplistic ways, such as two huge fleets of super-advanced starships flying right at each other and blasting away. The writers of DS9 (including the talented Ronald D. Moore, later of ""Battlestar Galactica"" fame) spiced up their monotonous show by starting a war, but at heart it is still a pedantic soap.<br /><br />DS9 remains a very frustrating experience. The continuous story is too flat and obvious to be really gripping, and the characters never truly develop in interesting ways. ""Babylon 5"" and ""Battlestar Galactica"" both fulfilled the promise made by DS9, and did everything much better. For Star Trek, stick with the original and the Next Generation.",negative,0.999938,"[6.19053389527835e-05, 0.9999381303787231]",1,positive,incorrect
25188,"VAMPYRES <br /><br />Aspect ratio: 1.85:1<br /><br />Sound format: Mono<br /><br />A motorist (Murray Brown) is lured to an isolated country house inhabited by two beautiful young women (Marianne Morris and Anulka) and becomes enmeshed in their free-spirited sexual lifestyle, but his hosts turn out to be vampires with a frenzied lust for human blood...<br /><br />Taking its cue from the lesbian vampire cycle initiated by maverick director Jean Rollin in France, and consolidated by the success of Hammer's ""Carmilla"" series in the UK, Jose Ramon Larraz' daring shocker VAMPYRES pushed the concept of Adult Horror much further than British censors were prepared to tolerate in 1974, and his film was cut by almost three minutes on its original British release. It isn't difficult to see why! Using its Gothic theme as the pretext for as much nudity, sex and bloodshed as the film's short running time will allow, Larraz (who wrote the screenplay under the pseudonym 'D. Daubeney') uses these commercial elements as mere backdrop to a languid meditation on life, death and the impulses - sexual and otherwise - which affirm the human condition.<br /><br />Shot on location at a picturesque country house during the Autumn of 1973, Harry Waxman's haunting cinematography conjures an atmosphere of grim foreboding, in which the desolate countryside - bleak and beautiful in equal measure - seems to foreshadow a whirlwind of impending horror (Larraz pulled a similar trick earlier the same year with SYMPTOMS, a low-key thriller which erupts into a frenzy of violence during the final reel). However, despite its pretensions, VAMPYRES' wafer-thin plot and rough-hewn production values will divide audiences from the outset, and while the two female protagonists are as charismatic and appealing as could be wished, the male lead (Brown, past his prime at the time of filming) is woefully miscast in a role that should have gone to some beautiful twentysomething stud. A must-see item for cult movie fans, an amusing curio for everyone else, VAMPYRES is an acquired taste. Watch out for silent era superstar Bessie Love in a brief cameo at the end of the movie.",negative,0.995948,"[0.0040516238659620285, 0.9959483742713928]",1,positive,incorrect


# Predict on Individual Query (Get Ready for Production)

If we were to put this model into production we may want to individually predict the sentiment of each movie review as soon as they are submitted on the website. To do this we first need to load the trained model and prepare the tokenizer.

In [14]:
# model = tf.keras.models.load_model('model.h5')

# Name of the BERT model to use
model_name = 'bert-base-uncased'

# Load transformers config and set output_hidden_states to False
config = AutoConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

max_length = 120

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [17]:
sentence = "This was a brilliant movie, I really enjoyed it and would recommend it to anyone."

Now tokenize and pad the sentence we want to predict the sentiment of.

In [18]:
# this padding wont work as the longest in the batch is probably less than 100
# therefore we pad separately below
sentence_tokens = tokenizer(
    text=sentence,
    add_special_tokens=True,
    padding=True, 
    max_length=max_length,
    truncation=True,
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

attention_mask_padded = pad_sequences(sentence_tokens['attention_mask'], maxlen=max_length, padding="post")
tokens_padded = pad_sequences(sentence_tokens['input_ids'], maxlen=max_length, padding="post")

Finally get the model prediction and run it through the softmax function to get prediction probabilites and the predicted topic.

In [19]:
text_prediction = model.predict([tokens_padded, attention_mask_padded])
probabilities = tf.nn.softmax(text_prediction)
probabilities_array = list(np.array(probabilities)[0])
topic_list = list(topic_dict.values())

return_array = []
for topic, probability in list(zip(topic_list, probabilities_array)):
    return_array.append(
        {"topic": topic, "confidence": probability}
    )
print(return_array)
print(f'Predicted topic: {np.vectorize(topic_dict.get)(text_prediction.argmax(axis=1))}')

[{'topic': 'negative', 'confidence': 5.8824266e-06}, {'topic': 'positive', 'confidence': 0.99999416}]
Predicted topic: ['positive']


# Additional Code

In some text classification tasks you may want to predict a certain amount of classes and everything else that does not fall into these classes can be considered "other". "Other" topic samples should be included in the data, however a model will not be trained for the "other" topic and these labels will be removed from the one hot encodings. This way any text that does not have a topic with a prediction probability greater than a predefined threshold (eg 0.8) can be classified as "other". The function below can be used to remove the "other" label.

In [20]:
def remove_others_encoding(one_hot_column, other_index):

    """
    Function to remove the "other" topic index from the one hot encoded labels
    :return:
    """
    one_hot_column = one_hot_column[:other_index] + one_hot_column[other_index+1:]
    return one_hot_column

other_index = train_df['label'].cat.categories.to_list().index('other')
train_df['one_hot_correct'] = train_df.apply(lambda x: self._remove_others_encoding(x['one_hot'], other_index),axis=1)
val_df['one_hot_correct'] = val_df.apply(lambda x: self._remove_others_encoding(x['one_hot'], other_index),axis=1)

If the training set is imbalanced the below code is one way it can be balanced, this again assumes there is an "other" class in the data and this is the largest class. If this is not the case it can be easily changed.

In [None]:
topics = list(train_df['label'].drop_duplicates())
topics.remove("other")

# balance the train dataset
df = pd.DataFrame()
for topic in topics:
    resampled_data = resample(train_df.loc[train_df['label']==f"{topic}"],
                             replace=True,     # sample with replacement
#                                  n_samples=14000,
                             n_samples=len(train_df.loc[train_df['label']=='other']),
                             random_state=11) # reproducible results
    df = df.append(resampled_data.loc[resampled_data['label']==f"{topic}"])
df = df.append(train_df.loc[train_df['label']=="other"])
train_df = df.sample(frac = 1)

# Useful Links

https://towardsdatascience.com/multi-label-multi-class-text-classification-with-bert-transformer-and-keras-c6355eccb63a

https://towardsdatascience.com/working-with-hugging-face-transformers-and-tf-2-0-89bf35e3555a