In [47]:
import os
import random
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

## Load dataset

I will use the [IMDb movie reviews sentiment analysis dataset](https://ai.stanford.edu/%7Eamaas/data/sentiment/) which contains **movie reviews** posted by people on the [IMDb website](https://www.imdb.com/), as well as the corresponding labels ("positive” or “negative”) indicating whether the reviewer liked the movie or not. There are 50,000 movie reviews divided into 25,000 reviews for training and 25,000 reviews for testing. The training and test sets are balanced, meaning they contain the same number of positive and negative reviews.

The data samples may be in a specific order. A simple best practice to ensure the model is not affected by data order is to always first shuffle the data.

In [10]:
def load_dataset(data_path, seed=123):

    """Loads the IMDb movie reviews sentiment analysis dataset.

    Arguments
    - data_path: string, path to the data directory.
    - seed: int, seed for randomizer.

    Returns
    - A tuple of training and validation data.
    - Number of training samples: 25000
    - Number of test samples: 25000
    - Number of categories: 2 (0 - negative, 1 - positive)

    References
    - Download and uncompress archive from:
    http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    """

    imdb_data_path = os.path.join(data_path, 'aclImdb')

    # Load the training data
    train_texts = []
    train_labels = []
    for category in ['pos', 'neg']:
        train_path = os.path.join(imdb_data_path, 'train', category)
        for fname in sorted(os.listdir(train_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(train_path, fname), encoding='utf-8') as f:
                    train_texts.append(f.read())
                train_labels.append(0 if category == 'neg' else 1)

    # Load the validation data
    test_texts = []
    test_labels = []
    for category in ['pos', 'neg']:
        test_path = os.path.join(imdb_data_path, 'test', category)
        for fname in sorted(os.listdir(test_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(test_path, fname), encoding='utf-8') as f:
                    test_texts.append(f.read())
                test_labels.append(0 if category == 'neg' else 1)

    # Shuffle the training data and labels
    random.seed(seed)
    random.shuffle(train_texts)
    random.seed(seed)
    random.shuffle(train_labels)

    return ((train_texts, np.array(train_labels)),
            (test_texts, np.array(test_labels)))

## Explore dataset

In [11]:
train_data, test_data = load_dataset(r"C:\Users\danie\Desktop\Projects\sentiment-analysis")

In [13]:
print(f"Number of training samples: {len(train_data[0])}")
print(f"Number of test samples: {len(test_data[0])}")

Number of training samples: 25000
Number of test samples: 25000


Let's see how our data looks like and check if the sentiment label corresponds to the sentiment of the review in a random sample:

In [27]:
train_data[0][5]

'A pointless movie with nothing but gratuitous violence. The only fun I had was playing "spot the location", as much of it was filmed in my home town of Regina, Saskatchewan. I like to support locally produced films but this one was a major disappointment.'

The expected sentiment (negative) matches the sample’s label:

In [28]:
train_data[1][5]

0

## Chose a model

[Google Developers](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) have created the following model selection algorithm and flowchart attempting to significantly simplify the process of selecting a text classification model. For a given dataset, our goal is to find the algorithm that achieves close to maximum accuracy while minimizing computation time required for training.

Models can be broadly classified into two categories:

- **N-gram models (Models that just see text as “bags” (sets) of words):** With n-gram vector representation, we discard a lot of information about word order and grammar. This representation is used in conjunction with models that don’t take ordering into account, such as logistic regression, simple multi-layer perceptrons (MLPs, or fully-connected neural networks), gradient boosted trees and support vector machines.

- **Sequence models (Models that use word ordering information):** For some text samples, word order is critical to the text’s meaning. Models such as convolutional neural networks (CNNs), and recurrent neural networks (RNNs) can infer meaning from the order of words in a sample.


The ratio of *number of samples* to *number of words per sample* correlates with which model performs better. 
- When the value for this ratio is **small (<1500)**, small multi-layer perceptrons that take **n-grams** as input perform better or at least as well as sequence models. MLPs are simple to define and understand, and they take much less compute time than sequence models. 
- When the value for this ratio is **large (>= 1500)**, a **sequence** model is a better option.

<br>

**Model selection algorithm:**
1. Calculate the *number of samples* to *number of words per sample* ratio.
2. If this ratio is less than 1500, tokenize the text as n-grams and use a simple multi-layer perceptron (MLP) model to classify them (left branch in the flowchart below):
- Split the samples into word n-grams and convert the n-grams into vectors.
- Score the importance of the vectors and then select the top 20K using the scores.
- Build an MLP model.
3. If the ratio is greater than 1500, tokenize the text as sequences and use a sepCNN model to classify them (right branch in the flowchart below):
- Split the samples into words and select the top 20K words based on their frequency.
- Convert the samples into word sequence vectors.
- If the ratio is less than 15K, using a fine-tuned pre-trained embedding with the sepCNN model will likely provide the best results.
4. Measure the model performance with different hyperparameter values to find the best model configuration for the dataset.

<br>

**Flowchart:**

In the flowchart below, the yellow boxes indicate data and model preparation processes. Grey boxes and green boxes indicate choices they considered for each process. Green boxes indicate their recommended choice for each process.

<div style="width: 700px; overflow: hidden;">
    <img src="https://developers.google.com/static/machine-learning/guides/text-classification/images/TextClassificationFlowchart.png" width="100%" alt="Your Image">
</div>

Let's compute the *number of samples* to *number of words per sample* ratio:

In [29]:
def get_num_words_per_sample(sample_texts):
    """Gets the median number of words per sample given corpus.

    Arguments
    - sample_texts: list, sample texts.

    Returns
    - int, median number of words per sample.
    """
    num_words = [len(s.split()) for s in sample_texts]
    return np.median(num_words)

In [46]:
25000 / get_num_words_per_sample(train_data[0])

143.67816091954023

In the case of our IMDb review dataset, the ratio of *number of samples* to *number of words per sample* is less than 1500 so we should choose a n-gram model.

## Data pre-procesing

Machine learning algorithms take numbers as inputs. This means that we will need to convert the texts into numerical vectors. There are two steps to this process:

- **1. Tokenization:** Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).
- **2. Vectorization:** Define a good numerical measure to characterize these texts.

Let’s see how to perform these two steps for both **n-gram vectors** and **sequence vectors**, as well as how to optimize the vector representations using feature selection and normalization techniques.

### N-gram vectors

In an n-gram vector, text is represented as a **collection of unique n-grams: groups of n adjacent tokens** (typically, words).

<br>

**1. Tokenization**

First, we have to split (tokenize) the text samles into word unigrams and bigrams. Thus, we will determines the "vocabulary" of the dataset. In the case of the text sample *'The mouse ran up the clock'*.

- The word unigrams (n = 1) are ['the', 'mouse', 'ran', 'up', 'clock']

- The word bigrams (n = 2) are ['the mouse', 'mouse ran', 'ran up', 'up the', 'the clock']

<br>

**2. Vectorization**

Once we have split our text samples into n-grams, we need to turn these n-grams into numerical vectors that our machine learning models can process. In the case of the text samples *'The mouse ran up the clock'* and *'The mouse ran down'*.

- The indexes assigned to the unigrams and bigrams would be {'clock': 0, 'down': 1, 'mouse': 2, 'mouse ran': 3, 'ran': 4, 'ran down': 5, 'ran up': 6, 'the': 7, 'the clock': 8, 'the mouse': 9, 'up': 10, 'up the': 11}


Once indexes are assigned to the n-grams, we typically vectorize the text samples using one-hot encoding, count encoding, or **Tf-idf encoding**. This last option is recommend for vectorizing n-grams. In the case of the  text sample *'The mouse ran up the clock'*.
- The vectorization using Tf-idf encoding would be [0.33, 0, 0.23, 0.23, 0.23, 0, 0.33, 0.47, 0.33, 0.23, 0.33, 0.33]

<br>

**3. Feature selection**

When we convert all of the texts in a dataset into word uni+bigram tokens, we may end up with tens of thousands of tokens. Not all of these tokens/features contribute to label prediction so we can drop certain tokens, for instance those that occur extremely rarely across the dataset. We can also measure feature importance (how much each token contributes to label predictions), and only include the most informative tokens. Two commonly used functions to calculate feature importance are **f_classif** and chi2. In addition, it has been noticed that accuracy peaks at around 20,000 features for many datasets.

<br>

The following code:
- Tokenize text samples into word unigrams + bigrams.
- Vectorize using tf-idf encoding.
- Select only the top 20,000 features from the vector of tokens by discarding tokens that appear fewer than 2 times and using f_classif to calculate feature importance.

Vectorization parameters:

In [None]:
# Range (inclusive) of n-gram sizes for tokenizing text
NGRAM_RANGE = (1, 2)

# Whether text should be split into word or character n-grams ('word' or 'char')
TOKEN_MODE = 'word'

# Limit on the number of features. We use the top 20K features
TOP_K = 20000

# Minimum document/corpus frequency below which a token will be discarded
MIN_DOCUMENT_FREQUENCY = 2

In [48]:
def ngram_vectorize(train_texts, train_labels, val_texts):
    """Vectorizes texts as n-gram vectors:
    
    1 text = 1 tf-idf vector the length of vocabulary of unigrams + bigrams.

    Arguments:
    - train_texts: list, training text strings.
    - train_labels: np.ndarray, training labels.
    - val_texts: list, validation text strings.

    Returns:
    - x_train, x_val: vectorized training and validation texts
    """

    # Create keyword arguments to pass to the 'tf-idf' vectorizer.
    kwargs = {
        'ngram_range': NGRAM_RANGE,
        'dtype': 'int32',
        'strip_accents': 'unicode',
        'decode_error': 'replace',
        'analyzer': TOKEN_MODE,
        'min_df': MIN_DOCUMENT_FREQUENCY,
    }
    vectorizer = TfidfVectorizer(**kwargs)

    # Learn vocabulary from training texts and vectorize training texts.
    x_train = vectorizer.fit_transform(train_texts)

    # Vectorize validation texts.
    x_val = vectorizer.transform(val_texts)

    # Select top 'k' of the vectorized features.
    selector = SelectKBest(f_classif, k=min(TOP_K, x_train.shape[1]))
    selector.fit(x_train, train_labels)
    x_train = selector.transform(x_train).astype('float32')
    x_val = selector.transform(x_val).astype('float32')
    return x_train, x_val

### Sequence vectors

In a sequence vector, text is represented as a **sequence of tokens, preserving order**.

<br>

**1. Tokenization**

Text can be represented as either a sequence of characters, or a sequence of words. Using word-level representation provides better performance than character tokens. Using character tokens makes sense only if texts have lots of typos.

<br>

**2. Vectorization**

Once we have converted our text samples into sequences of words, we need to turn these sequences into numerical vectors. The example below shows the indexes assigned to the unigrams generated for two texts, and then the sequence of token indexes to which the first text is converted.

- Texts: 'The mouse ran up the clock' and 'The mouse ran down'
- Index assigned for every token: {'the': 1, 'mouse': 2, 'ran': 3, 'up': 4,'clock': 5, 'down': 6}.
- Sequence of token indexes: 'The mouse ran up the clock' = [1, 2, 3, 4, 1, 5]

Note that 'the' occurs most frequently, so the index value of 1 is assigned to it. Also some libraries reserve index 0 for unknown tokens, as is the case here.

To vectorize the token sequences we can use one-hot encoding, or **word embeddings**. This last option is recommend for vectorizing sequences since words have meaning(s) associated with them. As a result, we can represent word tokens in a dense vector space (~few hundred real numbers), where the location and distance between words indicates how similar they are semantically.


<div style="width: 1200px; overflow: hidden;">
    <img src="https://developers.google.com/static/machine-learning/guides/text-classification/images/WordEmbeddings.png" width="100%" alt="Your Image">
</div>

Sequence models often have such an embedding layer as their first layer. This layer learns to turn word index sequences into word embedding vectors during the training process, such that each word index gets mapped to a dense vector of real values representing that word’s location in semantic space.

<div style="width: 1200px; overflow: hidden;">
    <img src="https://developers.google.com/static/machine-learning/guides/text-classification/images/EmbeddingLayer.png" width="100%" alt="Your Image">
</div>

