<h1><center>Sentiment Analysis Classifier with BERT</center></h1>

This project is based on Manning's book "Transfer Learning for NLP" (chapter 2).
The goal here is:

1. Curate a dataset with reviews from imdb classic dataset
2. Create a Pandas dataframe from iy
3. Create a simple bag-of-words model from the above content. Simple because it is based on term frequency (tf) only.
4. Choose one baseline classifier from Logistic Regression and Gradient Boosting Machine
5. Accuracy is the metric of choice as the dataset is balanced and consists of two classes
6. Train a SentimentAnalysis classifier based on BERT embeddings

But before starting let's make sure we have the correct libraries versions installed, namelly tensorflow and bert-tensorflow.

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt



Now we will import required Python libraries and the dataset. To download the dataset I will use the bash script get_aclImdb.sh. It downloads and extracts the compressed archive into ./data/aclImdb. It requires execution privilege (sudo chmod +x get_aclImdb.sh).

In [2]:
import pandas as pd 
import numpy as np 
import pickle as pck 
import os.path
from os import path

# download dataset
!./get_aclImdb.sh

aclImdb already downloaded.


# Hyperparameters

In [3]:
max_tokens = 256 # maximum number of tokens per review
max_chars = 20 # maximum size of a token.
n_samples = 1000 # number of training instances

Three helper methods to tokenize, remoce stopwords, remove puntuation and convert to lowercase

In [18]:
def load_dataset_tfidf(path):

    # Reviews will be loaded in sequence, i.e., first all negative s followed by all positives.
    reviews, sentiments = [], []
    for folder, sentiment in (('neg', 0), ('pos', 1)):
        folder = os.path.join(path, folder)
        for name in os.listdir(folder):
            with open(os.path.join(folder, name), 'r') as reader:
                text = reader.read()

            reviews.append(text)
            sentiments.append(sentiment)

    return reviews, sentiments

In [57]:
# Loaded dataset comes tokenized, lowercased and without both stopwords and punctuations

train_path = os.path.join('data/aclImdb', 'train')
reviews_train, sentiments_train = load_dataset_tfidf(train_path)
print(f'Number of training reviews: {len(reviews_train)} and labels: {len(sentiments_train)}')

test_path = os.path.join('data/aclImdb', 'test')
reviews_test, sentiments_test = load_dataset_tfidf(test_path)
print(f'Number of testing reviews: {len(reviews_test)} and labels: {len(sentiments_test)}')

Number of training reviews: 25000 and labels: 25000
Number of testing reviews: 25000 and labels: 25000


As seen previously this dataset may be huge to be digested in some environments. So let's limit the train dataset to the hyperparameter n_samples. The idea is to have a dataset with evenly distributed samples from negative and positive labels.

# BOW for Logistic Regression

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit(reviews_train)

In [63]:
import random

def shuffle_data(data, label):

    data_tmp = np.array(data)
    label_tmp = np.array(label)
    index_shuffle = np.random.permutation(data_tmp.shape[0])
    data_tmp = data_tmp[index_shuffle]
    label_tmp = label_tmp[index_shuffle]

    return data_tmp.tolist(), label_tmp.tolist

In [64]:
reviews_train, sentiments_train = shuffle_data(reviews_train, sentiments_train)


In [41]:
train_x = vectorizer.transform(reviews_train)
train_y = np.array(sentiments_train)

# Logistic Regression 

In [65]:
from sklearn.linear_model import LogisticRegression

def fit(train_x, train_y):

    model = LogisticRegression()

    try:
        model.fit(train_x, train_y)
    except:
        pass

    return model

In [66]:
model = fit(train_x, train_y)

In [67]:
test_x = vectorizer.transform(reviews_test)
test_y = np.array(sentiments_test)

In [68]:
predictions = model.predict(test_x)

In [70]:
from sklearn.metrics import accuracy_score

accuracy_score = accuracy_score(test_y, predictions)
print(f'LogisticRegression accucary is: {accuracy_score: .4f}')

LogisticRegression accucary is:  0.8832
