In [1]:
import os
import random
import numpy as np

## Load dataset

I will use the [IMDb movie reviews sentiment analysis dataset](https://ai.stanford.edu/%7Eamaas/data/sentiment/) which contains **movie reviews** posted by people on the [IMDb website](https://www.imdb.com/), as well as the corresponding labels ("positive” or “negative”) indicating whether the reviewer liked the movie or not. There are 50,000 movie reviews divided into 25,000 reviews for training and 25,000 reviews for testing. The training and test sets are balanced, meaning they contain the same number of positive and negative reviews.

The data samples may be in a specific order. A simple best practice to ensure the model is not affected by data order is to always first shuffle the data.

In [10]:
def load_dataset(data_path, seed=123):

    """Loads the IMDb movie reviews sentiment analysis dataset.

    Arguments
    - data_path: string, path to the data directory.
    - seed: int, seed for randomizer.

    Returns
    - A tuple of training and validation data.
    - Number of training samples: 25000
    - Number of test samples: 25000
    - Number of categories: 2 (0 - negative, 1 - positive)

    References
    - Download and uncompress archive from:
    http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    """

    imdb_data_path = os.path.join(data_path, 'aclImdb')

    # Load the training data
    train_texts = []
    train_labels = []
    for category in ['pos', 'neg']:
        train_path = os.path.join(imdb_data_path, 'train', category)
        for fname in sorted(os.listdir(train_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(train_path, fname), encoding='utf-8') as f:
                    train_texts.append(f.read())
                train_labels.append(0 if category == 'neg' else 1)

    # Load the validation data
    test_texts = []
    test_labels = []
    for category in ['pos', 'neg']:
        test_path = os.path.join(imdb_data_path, 'test', category)
        for fname in sorted(os.listdir(test_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(test_path, fname), encoding='utf-8') as f:
                    test_texts.append(f.read())
                test_labels.append(0 if category == 'neg' else 1)

    # Shuffle the training data and labels
    random.seed(seed)
    random.shuffle(train_texts)
    random.seed(seed)
    random.shuffle(train_labels)

    return ((train_texts, np.array(train_labels)),
            (test_texts, np.array(test_labels)))

## Explore dataset

In [11]:
train_data, test_data = load_dataset(r"C:\Users\danie\Desktop\Projects\sentiment-analysis")

In [13]:
print(f"Number of training samples: {len(train_data[0])}")
print(f"Number of test samples: {len(test_data[0])}")

Number of training samples: 25000
Number of test samples: 25000


Let's see how our data looks like and check if the sentiment label corresponds to the sentiment of the review in a random sample:

In [27]:
train_data[0][5]

'A pointless movie with nothing but gratuitous violence. The only fun I had was playing "spot the location", as much of it was filmed in my home town of Regina, Saskatchewan. I like to support locally produced films but this one was a major disappointment.'

The expected sentiment (negative) matches the sample’s label:

In [28]:
train_data[1][5]

0

## Chose a model

[Google Developers](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) has created the following model selection algorithm and flowchart attempting to significantly simplify the process of selecting a text classification model. For a given dataset, our goal is to find the algorithm that achieves close to maximum accuracy while minimizing computation time required for training.

**Model selection algorithm:**
1. Calculate the *number of samples* to *number of words per sample* ratio.
2. If this ratio is less than 1500, tokenize the text as n-grams and use a simple multi-layer perceptron (MLP) model to classify them (left branch in the flowchart below):
- Split the samples into word n-grams and convert the n-grams into vectors.
- Score the importance of the vectors and then select the top 20K using the scores.
- Build an MLP model.
3. If the ratio is greater than 1500, tokenize the text as sequences and use a sepCNN model to classify them (right branch in the flowchart below):
- Split the samples into words and select the top 20K words based on their frequency.
- Convert the samples into word sequence vectors.
- If the ratio is less than 15K, using a fine-tuned pre-trained embedding with the sepCNN model will likely provide the best results.
4. Measure the model performance with different hyperparameter values to find the best model configuration for the dataset.

**Flowchart:**

In the flowchart below, the yellow boxes indicate data and model preparation processes. Grey boxes and green boxes indicate choices they considered for each process. Green boxes indicate their recommended choice for each process.

<div style="width: 700px; overflow: hidden;">
    <img src="https://developers.google.com/static/machine-learning/guides/text-classification/images/TextClassificationFlowchart.png" width="100%" alt="Your Image">
</div>

**Summary:**

Models can be broadly classified into two categories:

- **n-gram models:** Models that just see text as “bags” (sets) of words. Types of n-gram models include logistic regression, simple multi-layer perceptrons (MLPs, or fully-connected neural networks), gradient boosted trees and support vector machines.

- **Sequence models:** Models that use word ordering information. Types of sequence models include convolutional neural networks (CNNs), recurrent neural networks (RNNs), and their variations. 


The ratio of *number of samples* to *number of words per sample* correlates with which model performs well. 
- When the value for this ratio is **small (<1500)**, small multi-layer perceptrons that take **n-grams** as input perform better or at least as well as sequence models. MLPs are simple to define and understand, and they take much less compute time than sequence models. 
- When the value for this ratio is **large (>= 1500)**, a **sequence** model is a better option.

In [29]:
def get_num_words_per_sample(sample_texts):
    """Gets the median number of words per sample given corpus.

    Arguments
    - sample_texts: list, sample texts.

    Returns
    - int, median number of words per sample.
    """
    num_words = [len(s.split()) for s in sample_texts]
    return np.median(num_words)

In [33]:
get_num_words_per_sample(train_data[0])

174.0

In the case of our IMDb review dataset, the ratio of *number of samples* to *number of words per sample* is less than 1500 so we should choose a n-gram model.

## Data pre-procesing

Machine learning algorithms take numbers as inputs. This means that we will need to convert the texts into numerical vectors. There are two steps to this process:

- **Tokenization:** Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).

Vectorization: Define a good numerical measure to characterize these texts.

Let’s see how to perform these two steps for both n-gram vectors and sequence vectors, as well as how to optimize the vector representations using feature selection and normalization techniques.

In [40]:
train_data[1][4]

1

In [41]:
train_data[0][4]

'Although it\'s most certainly politically incorrect to be entertained by a drunk, there\'s such a charm to Dudley Moore\'s portrayal of lovable lush, Arthur Bach one can\'t help but feel for this unique and wonderful character. How can you not be entertained by that infectious laugh and giggle and utter silliness. Although I\'m not really a Liza Minnelli fan, she was really excellent as Linda Marolla and I couldn\'t picture anyone else in that role. Sir John Gielgud was the heart of the film and deserved his Oscar. The rest of the cast also excellent and that great tune "Arthur\'s Theme", wow. Truly this was one of the Best Comedies of the 1980s. Great films get better with each viewing and that is the case with "Arthur."'