IMDB MOVIE REVIEW SENTIMENT ANALYSIS WITH TENSORFLOW AND BERT

1) Connecting Kaggle IMDB Review Dataset By Using Kaggle API

In [1]:
! pip install -q kaggle

In [2]:
from google.colab import files

In [13]:
files.upload()

{}

In [4]:
! mkdir ~/.kaggle

In [5]:
! cp kaggle.json ~/.kaggle/

In [6]:
! chmod 600 ~/.kaggle/kaggle.json

In [7]:
! kaggle datasets list

ref                                                             title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
meirnizri/covid19-dataset                                       COVID-19 Dataset                                      5MB  2022-11-13 15:47:17           9518        281  1.0              
mattop/alcohol-consumption-per-capita-2016                      Alcohol Consumption Per Capita 2016                   4KB  2022-12-09 00:03:11            985         35  1.0              
michals22/coffee-dataset                                        Coffee dataset                                       24KB  2022-12-15 20:02:12           1095         41  1.0              
thedevastator/jobs-dataset-from-glassdoor                   

In [10]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
100% 25.7M/25.7M [00:01<00:00, 35.5MB/s]
100% 25.7M/25.7M [00:01<00:00, 22.1MB/s]


In [12]:
! unzip imdb-dataset-of-50k-movie-reviews.zip

Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


2) Importing necessary packages

In [15]:
#!pip install bert-tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-tensorflow
  Downloading bert_tensorflow-1.0.4-py2.py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 3.1 MB/s 
Installing collected packages: bert-tensorflow
Successfully installed bert-tensorflow-1.0.4


In [3]:
# Regular imports
import numpy as np
import pandas as pd
import tqdm # for progress bar
import math
import random
import re

from sklearn.model_selection import train_test_split

# Tensorflow Import
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import BertTokenizer


# pd.set_option('max_rows', 99999)
# pd.set_option('max_colwidth', 400)


In [4]:
movie_reviews = pd.read_csv("IMDB Dataset.csv")

In [5]:
movie_reviews.head(20)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


To be able to use the text, we have to prepare it accordingly. In the first step, we create a function that removes the line breaks and other HTML leftovers from the text. In this step, we also filter out other text impurities using Regular Expressions.

In [7]:
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
  return TAG_RE.sub('', text)

def preprocess_tags(sen):
  # Removing html tags
  sentence = remove_tags(sen)

  # Remove punctuations and numbers
  sentence = re.sub('[^a-zA-Z]',' ', sentence)

  # Single character removal
  sentence = re.sub(r's+[a-zA-Z]s+',' ', sentence)

  # Removing multiple spaces
  sentence = re.sub(r's+', ' ', sentence)

  return sentence


In [8]:
movie_reviews['review'] = movie_reviews['review'].apply(preprocess_tags)
movie_reviews['sentiment'] = movie_reviews['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

In [12]:
# ds = tf.data.Dataset.from_tensor_slices((
#     dict(movie_reviews['review']),
#     movie_reviews['sentiment'],

# ))

ds = (
    tf.data.Dataset.from_tensor_slices(
        (
            movie_reviews['review'].values,
            tf.cast(movie_reviews['sentiment'].values, tf.int32)
        )
    )
)

In [21]:
ds.__len__()

<tf.Tensor: shape=(), dtype=int64, numpy=50000>

Now we need to apply BERT tokenizer to use pre-trained tokenizer and then prepare data for BERT model.

In [15]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [16]:
max_length = 512
def convert_to_feature(review):
    return tokenizer.encode_plus(review,
                    add_special_tokens=True,
                    max_length=max_length,
                    pad_to_max_length=True,
                    return_attention_mask=True)

Transforming raw data to suitable form for BERT Model

In [17]:
def map_to_dict(input_ids,attention_mask, token_type_ids, label):
    return {
        "input_ids":input_ids,
        "attention_mask":attention_mask,
        "token_type_ids":token_type_ids
    },label

In [18]:
def encode_reviews(ds, limit=-1):
    input_ids_list=[]
    token_type_ids_list=[]
    attention_mask_list=[]
    label_list=[]
    
    if limit >0:
        ds = ds.take(limit)
    
    for review, label in tfds.as_numpy(ds):
        bert_input = convert_to_feature(review.decode())
        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append(label)
    
    return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_to_dict)
    


Spliting dataset as train and test.

In [32]:
#ds_size = print([i for i,_ in enumerate(ds)][-1] + 1)
ds_size = int(ds.__len__())
print(ds_size)

50000


In [33]:
train_size = int(ds_size*0.8)
test_size = int(ds_size*0.2)
print(train_size)
print(test_size)

ds_train = ds.take(train_size)
ds_test = ds.skip(train_size).take(test_size)

40000
10000


Now we need to create out train and test datasets.

In [26]:
batch_size=6
ds_train_encoded = encode_reviews(ds_train).shuffle(10000).batch(batch_size)
ds_test_encoded = encode_reviews(ds_test).shuffle(10000).batch(batch_size)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Now we need to initialize BERT model for sentiment analysis.