<h1><center>Sentiment Analysis Classifier with BERT</center></h1>

This project is based on Manning's book "Transfer Learning for NLP" (chapter 2).
The goal here is:

1. Curate a dataset with reviews from imdb classic dataset
2. Create a Pandas dataframe from iy
3. Create a simple bag-of-words model from the above content. Simple because it is based on term frequency (tf) only.
4. Choose one baseline classifier from Logistic Regression and Gradient Boosting Machine
5. Accuracy is the metric of choice as the dataset is balanced and consists of two classes
6. Train a SentimentAnalysis classifier based on BERT embeddings

But before starting let's make sure we have the correct libraries versions installed, namelly tensorflow and bert-tensorflow.

In [4]:
import sys
!{sys.executable} -m pip install -r requirements.txt



Now we will import required Python libraries and the dataset. To download the dataset I will use the bash script get_aclImdb.sh. It downloads and extracts the compressed archive into ./data/aclImdb. It requires execution privilege (sudo chmod +x get_aclImdb.sh).

In [1]:
import pandas as pd 
import numpy as np 
import pickle as pck 
import os.path
from os import path

# download dataset
!./get_aclImdb.sh

Downloading aclImdb.
--2020-11-25 10:52:25--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2020-11-25 10:52:56 (2.59 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

aclImdb downloaded and extracted into ./data/aclImdb.


# Hyperparameters

In [16]:
max_tokens = 256 # maximum number of tokens per review
max_chars = 20 # maximum size of a token.

Three helper methods to tokenize, remoce stopwords, remove puntuation and convert to lowercase

In [7]:
def tokenize(text):

    if text==None or text=='' or type(text)=='list':
        tokens = ""
    else:
        tokens = text.split(' ')[: max_tokens]

    return tokens

In [14]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

def stopword_removal(tokens):

    filtered_tokens = [token for token in tokens if token not in stopwords]
    filtered_tokens = filter(None, filtered_tokens)
    
    return filtered_tokens


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/baosiek/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
import re

def clean_tokens(tokens):

    cleaned_tokens = []
    for token in tokens:
        token = token.lower()
        token = re.sub(r'[\W\d]', "", token)[:max_chars]
        cleaned_tokens.append(token)

    return cleaned_tokens

In [10]:
def shuffle_data(reviews_np, sentiment_list):

    shuffle_index = np.random.permutation(len(sentiment_list))
    reviews_np = reviews_np[shuffle_index]
    sentiments_np = np.asarray(sentiment_list)[shuffle_index]

    return reviews_np, sentiments_np

In [11]:
def load_dataset(path):

    reviews, sentiments = [], []
    for folder, sentiment in (('neg', 0), ('pos', 1)):
        folder = os.path.join(path, folder)
        for name in os.listdir(folder):
            with open(os.path.join(folder, name), 'r') as reader:
                text = reader.read()

            text = tokenize(text)
            text = stopword_removal(text)
            text = clean_tokens(text)

            reviews.append(text)
            sentiments.append(sentiment)

    reviews_np = np.array(reviews)
    reviews_np, sentiments_np = shuffle_data(reviews_np, sentiments)

    return reviews_np, sentiments_np

In [22]:
train_path = os.path.join('data/aclImdb', 'train')
raw_data, raw_label = load_dataset(train_path)

[list(['i', 'wonder', 'audiences', 'day', 'thought', 'first', 'laying', 'eyes', 'walter', 'jack', 'palance', 'blackie', 'certainly', 'looks', 'like', 'one', 'else', 'time', 'skulllike', 'face', 'flattened', 'nose', 'elongated', 'body', 'even', 'remains', 'unsettling', 'presence', 'and', 'could', 'appropriate', 'emergence', 'dingy', 'new', 'orleans', 'slums', 'appear', 'fester', 'like', 'plague', 'blackie', 'loosing', 'city', 'im', 'sorry', 'scenesbr', 'br', 'the', 'movie', 'skillfully', 'assembled', 'morgues', 'black', 'humor', 'widmark', 'douglas', 'interplay', 'untouristy', 'locations', 'battles', 'among', 'officialsall', 'woven', 'tensely', 'realistic', 'thriller', 'menace', 'unlike', 'others', 'time', 'even', 'widmarks', 'domestic', 'scenes', 'put', 'woman', 'bel', 'geddes', 'marquee', 'manage', 'disruptive', 'director', 'kazan', 'certainly', 'shows', 'aptitude', 'helming', 'studio', 'fox', 'product', 'matter', 'may', 'felt', 'commercial', 'aspectbr', 'br', 'widmark', 'solid', 'low