# Text Classification

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_files

from sklearn.pipeline import Pipeline

from sklearn.decomposition import TruncatedSVD 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD 

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import MultinomialNB

In [None]:
rng = np.random.RandomState(2)

## Movie Reviews

Researchers at Stanford University obtained 50,000 movie reviews from IMDB. They ensured an even number of positive and negative reviews. The positive reviews were ones where the author of the review had given the movie a rating of at least 7 out of 10. Negative reviews were ones where the author of the review had given the movie at most 4 out of 10.

Question. Later in the Notebook, we will learn binary classifiers that achieve more than 80% accuracy on this dataset. Why does this figure tell us little about how our classifier would perform 'in the wild'? 

We are using 25,000 of the reviews.

They do not come as a nice CSV file. Each review is in a separate file. The labels come from the director structure: all the positive reviews are in a folder `pos`; all the negative ones are in a folder `neg`.

scikit-learn has a function for reading in files from such a structure - `load_files`.

In [None]:
import os
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_dir = "./drive/My Drive/Colab Notebooks/" # You may need to change this, depending on where your notebooks are on Google Drive
else:
    base_dir = "."
dataset_dir = os.path.join(base_dir, "datasets")

In [None]:
reviews = load_files(os.path.join(dataset_dir, "reviews"), encoding="utf-8", random_state=rng)

The result -which we have stored in `reviews`- is like a dictionary. `reviews.data` gives us a list of the reviews; `reviews.target` gives us a NumPy array of the class labels as integers; `reviews.target_names` maps the integers back to the class names. By default, the function shuffles the data.

In [None]:
len(reviews.data), len(reviews.target)

In [None]:
reviews.data[3] # Let's look at one example of a review

In [None]:
reviews.target

In [None]:
reviews.target_names # So 0 is neg and 1 is pos

In [None]:
reviews.target.sum() / len(reviews.target) # Confirms that half the dataset are positive

Note `load_files` asks you for the `encoding` of the file. If you don't give one, it reads the file in as bytes instead of Unicode - and then a lot of other things won't work. For modern files, `encoding="utf-8"` will probably work.

In [None]:
pd.Series(reviews.data).duplicated().sum() 

Question. There are 96 reviews that duplicate other reviews - too few to bother about. Really, I should delete them. Why?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(reviews.data, reviews.target, test_size=0.2, stratify=reviews.target, random_state=rng)

## Tokenizing

The review we displayed earlier contained URLs and HTML tags. We don't want these to be tokens. Before we train any models, let's look at the kind of tokens that we'll be getting.

In [None]:
vectorizer = TfidfVectorizer(stop_words="english")
vectorizer.fit(X_train)
vectorizer.get_feature_names_out(), len(vectorizer.get_feature_names_out())

We can augment the sklearn preprocessor to strip away certain tokens, e.g. URLs, HTML and things starting with a number.

In [None]:
import re

class MovieReviewVectorizer(TfidfVectorizer):
    
    def build_preprocessor(self):
        preprocess = super().build_preprocessor()
        return lambda doc: (preprocess(self._strip_numerics(self._strip_urls(self._strip_html(doc)))))

    def _strip_urls(self, s):
        return re.sub(r"http\S+", "", s) 

    def _strip_html(self, s):
        return re.sub(r"<.*>", "", s)

    def _strip_numerics(self, s):
        return re.sub(r"\d\S+", "", s)

In [None]:
vectorizer = MovieReviewVectorizer(stop_words="english")
vectorizer.fit(X_train)
vectorizer.get_feature_names_out()

Rather than making our preprocessor better and better, we can just keep a subset of the tokens: the ones with highest term frequency.

In [None]:
vectorizer = MovieReviewVectorizer(stop_words="english", max_features=10000)
vectorizer.fit(X_train)
vectorizer.get_feature_names_out()

In effect, this is a form of feature selection - using a filter method, where the scoring function that does the filtering is term frequency.

## Model Selection

In [None]:
ss = ShuffleSplit(n_splits=1, test_size=0.25, random_state=rng)

In [None]:
def check_fit(model, X_train, y_train, cv, metric):
    scores = cross_validate(model, X_train, y_train, cv=cv, scoring=metric, return_train_score=True, n_jobs=-1)
    return scores["train_score"].mean(), scores["test_score"].mean()

In [None]:
logistic = Pipeline([
    ("vectorizer", MovieReviewVectorizer(stop_words="english", max_features=10000)),
    ("predictor", LogisticRegression(penalty=None, random_state=rng))])

In [None]:
train_acc, val_acc = check_fit(model=logistic, 
            X_train=X_train, y_train=y_train, 
            cv=ss, metric="accuracy")

train_acc, val_acc

Question: Are we underfitting or overfitting?

I tried a few variants: retaining stop-words instead of discarding them, discarding a customized set of stop-words, count vectorization instead of TF-IDF, and so on. Their effect on validation accuracy was modest.

In [None]:
logistic_svd = Pipeline([
    ("vectorizer", MovieReviewVectorizer(stop_words="english", max_features=10000)),
    ("svd", TruncatedSVD(300)),
    ("predictor", LogisticRegression(penalty=None, random_state=rng))])

In [None]:
train_acc, val_acc = check_fit(model=logistic_svd, 
            X_train=X_train, y_train=y_train, 
            cv=ss, metric="accuracy")

train_acc, val_acc

Singular Value Decomposition (SVD) is what we use for this kind of dataset - in place of PCA.

## n-grams

What kind of features would we get if we allowed bigrams? This time we won't discard stop-words.

In [None]:
vectorizer = MovieReviewVectorizer(ngram_range=(2,2), max_features=10000)
vectorizer.fit(X_train)
vectorizer.get_feature_names_out()

In the model, we'll allow both unigrams and bigrams.

In [None]:
logistic_bigrams_svd = Pipeline([
    ("vectorizer", MovieReviewVectorizer(ngram_range=(1,2), max_features=10000)),
    ("svd", TruncatedSVD(300)),
    ("predictor", LogisticRegression(penalty=None, random_state=rng))])

In [None]:
train_acc, val_acc = check_fit(model=logistic_bigrams_svd, 
            X_train=X_train, y_train=y_train, 
            cv=ss, metric="accuracy")

train_acc, val_acc

## Naive Bayes

In [None]:
naive_bayes = Pipeline([
    ("vectorizer", CountVectorizer(stop_words="english", max_features=1000)),
    ("predictor", MultinomialNB())])

In [None]:
train_acc, val_acc = check_fit(model=naive_bayes, 
            X_train=X_train, y_train=y_train, 
            cv=ss, metric="accuracy")

train_acc, val_acc

SVD does not make sense for Naive Bayes. But you'll see that I was able to filter more aggressivley (`max_features=1000`).