## Class 3: Neural Networks


In this noteboook, we'll walk through an implementation of multi-class text classification using, first simple Logistic Regression (which, you'll recall, is simply a single-neuron neural network) and then a Feed-Forward Neural Network.

### Baseline model: Logistic Regression + BOW

Our task is classify movie plots by genre using, first, a baseline model consisting of

- Bag of words

- TF-IDF

### Challenger model 1: Logisic Regression + word2vec

Then, we'll show how the pretrained word2vec embeeddings introduced last week can be used to improve the accuracy of this simple baseline model

### Challenger Model 2: Neural-Network + word2vec

Finally we'll show how a simple FFN with word2vec embeddings added to the first layer can give us our best performance on this task.


## Classification task

Our classification task is, given a movie title, to predict the likley genre of that movie -- *Comedy*, *Documentary*, or *Drama.*


## Dataset
The data for this notebook is contained in `movie_lens_1k_three_genres.csv` which should be placed in the same directory as the notebook. This is a modified version of the  MovieLens dataset.

In [None]:
!pip install gensim

In [None]:
from smart_open import smart_open
import pandas as pd
import numpy as np
import gensim
import nltk
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from gensim.models import Word2Vec
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model

%matplotlib inline

In [None]:
import gensim

## Exploring the data



In [None]:
import pandas as pd
balanced_genre_df = pd.read_csv("movie_lens_1k_three_genres.csv")

In [None]:
balanced_genre_df.genres.value_counts().plot(kind="bar", rot=0)

Train/test split of 90/10

In [None]:
train_data, test_data = train_test_split(balanced_genre_df, test_size=0.1, random_state=42)

## Model evaluation approach
We will use confusion matrices to evaluate all classifiers

In [None]:
my_tags = ["Drama", "Comedy", "Documentary"]

In [None]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(my_tags))
    target_names = my_tags
    plt.xticks(tick_marks, target_names, rotation=45)
    plt.yticks(tick_marks, target_names)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def evaluate_prediction(predictions, target, title="Confusion matrix"):
    print('accuracy %s' % accuracy_score(target, predictions))
    cm = confusion_matrix(target, predictions)
    print('confusion matrix\n %s' % cm)
    print('(row=expected, col=predicted)')

    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plot_confusion_matrix(cm_normalized, title + ' Normalized')

In [None]:
def predict(vectorizer, classifier, data):
    data_features = vectorizer.transform(data['title'])
    predictions = classifier.predict(data_features)
    target = data['genres']
    evaluate_prediction(predictions, target)

## Baseline: bag of words, TF-IDF

Let's start with some simple baselines before diving into more advanced methods.

### Bag of words

We remove stop-words and limit our vocabulary to 3k most frequent words.

In [None]:
nltk.download('punkt_tab')

In [None]:
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

In [None]:
# training
count_vectorizer = CountVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=3000)
train_data_features = count_vectorizer.fit_transform(train_data['title'])

In [None]:

logreg = linear_model.LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(train_data_features, train_data['genres'])

In [None]:
count_vectorizer.get_feature_names_out()[500:510]

In [None]:
%%time

predict(count_vectorizer, logreg, test_data)

36% isn't great but,as a sanity check, let's look at the most informative features.

In [None]:
def most_influential_words(vectorizer, genre_index=0, num_words=10):
    features = vectorizer.get_feature_names_out()
    max_coef = sorted(enumerate(logreg.coef_[genre_index]), key=lambda x:x[1], reverse=True)
    return [features[x[0]] for x in max_coef[:num_words]]

In [None]:
# words for the Comedy genre
comedy_tag_id = 1
print(my_tags[comedy_tag_id])
most_influential_words(count_vectorizer, comedy_tag_id)

In [None]:
train_data_features[0]

### TF-IDF
.
Let's modify our count-based features with TF-iDF-weights. These adjust for document length, word frequency and, most importantly for the frequency of a particular word in a particular document.


In [None]:
tf_vect = TfidfVectorizer(
    min_df=2, tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english')
train_data_features = tf_vect.fit_transform(train_data['title'])

logreg = linear_model.LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(train_data_features, train_data['genres'])

In [None]:
tf_vect.get_feature_names_out()[500:510]

In [None]:
predict(tf_vect, logreg, test_data)

We're doing better: 42%

In [None]:
most_influential_words(tf_vect, comedy_tag_id)

# word2vec-based Logistic Regression via Averaging Word Vectors

Aside from their usefulness for lexical similarity tasks, word2vec-based vector representations can be used in place of BOW-based features -- this boostraps word weights with the distributional information encoded in word2vec.


In [None]:
import gensim.downloader as api
import numpy as np

# Load a pre-trained Word2Vec model
# This might take some time and download a large file (around 1.5 GB for word2vec-google-news-300)
print("Loading pre-trained Word2Vec model...")

model = api.load("word2vec-google-news-300")

Example vocabulary

In [None]:
from itertools import islice
list(islice(model.key_to_index, 12000, 12020))

In [None]:
import logging

def word_averaging(wv, words):
    all_words, mean = set(), []

    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.key_to_index: # Use key_to_index
            mean.append(wv.vectors[wv.key_to_index[word]]) # Use vectors and key_to_index
            all_words.add(wv.key_to_index[word]) # Use key_to_index

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,) # Use vector_size

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, review) for review in text_list ])

For word2vec we apply a different tokenization scheme. We want to preserve case as the vocabulary distingushes lower and upper case.

In [None]:
def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

In [None]:
test_tokenized = test_data.apply(lambda r: w2v_tokenize_text(r['title']), axis=1).values
train_tokenized = train_data.apply(lambda r: w2v_tokenize_text(r['title']), axis=1).values

In [None]:
%%time
X_train_word_average = word_averaging_list(model,train_tokenized)
X_test_word_average = word_averaging_list(model,test_tokenized)

Let's see how the logistic regression classifier perform on these word-averaging document features.

In [None]:
logreg = linear_model.LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(X_train_word_average, train_data['genres'])
predicted = logreg.predict(X_test_word_average)

47% -- a small but significant lift.  Best that we have seen so far.

In [None]:
evaluate_prediction(predicted, test_data.genres)

# Feedforward Neural Network + word2vec
Let's try to improve our basic, logistic regression-based multi-class text classification model using a simple PyTorch=based FFN.


## Import PyTorch and Necessary Modules
Import `torch`, `torch.nn`, `torch.optim`, and `torch.utils.data` for building and training the neural network.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

print("PyTorch and necessary modules imported.")

In [None]:
genre_to_int = {genre: i for i, genre in enumerate(my_tags)}

train_labels = train_data['genres'].map(genre_to_int)
test_labels = test_data['genres'].map(genre_to_int)

print("Genre labels converted to integers.")
print("Train labels sample:", train_labels.head())
print("Test labels sample:", test_labels.head())

## Creating PyTorch `TensorDataset` and `DataLoader`


In [None]:
X_train_tensor = torch.tensor(X_train_word_average, dtype=torch.float32)
y_train_tensor = torch.tensor(train_labels.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_word_average, dtype=torch.float32)
y_test_tensor = torch.tensor(test_labels.values, dtype=torch.long)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print("PyTorch TensorDataset and DataLoader created.")
print(f"Training data tensor shape: {X_train_tensor.shape}")
print(f"Training labels tensor shape: {y_train_tensor.shape}")
print(f"Test data tensor shape: {X_test_tensor.shape}")
print(f"Test labels tensor shape: {y_test_tensor.shape}")

## Defining a PyTorch Neural Network


In [None]:
class GenreClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GenreClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Get input dimension from X_train_word_average
input_dim = X_train_word_average.shape[1]  # Should be 300
hidden_dim = 128  # Example hidden layer size
output_dim = len(my_tags)  # Number of genres, should be 3

model = GenreClassifier(input_dim, hidden_dim, output_dim)

print("Neural Network Model Defined:")
print(model)


## Instantiate Model, Loss Function, and Optimizer

Let's instntiate the previously defined `GenreClassifier` model, specify the loss function as `torch.nn.CrossEntropyLoss`, and select an optimizer, `torch.optim.Adam`.


In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print("Criterion (loss function) and Optimizer defined.")

## Training the Model


In [None]:
num_epochs = 40

for epoch in range(num_epochs):
    model.train() # Set the model to training mode
    total_loss = 0
    for batch_idx, (data, labels) in enumerate(train_loader):
        optimizer.zero_grad() # Zero the gradients
        outputs = model(data) # Forward pass
        loss = criterion(outputs, labels) # Calculate loss
        loss.backward() # Backward pass
        optimizer.step() # Update weights

        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}")

print("Model training complete.")

## Evaluating the Model


In [None]:
model.eval() # Set the model to evaluation mode
predictions_list = []
true_labels_list = []

with torch.no_grad(): # Disable gradient calculation for evaluation
    for data, labels in test_loader:
        outputs = model(data)
        _, predicted = torch.max(outputs.data, 1) # Get the index of the max log-probability as the prediction
        predictions_list.extend(predicted.cpu().numpy()) # Collect predictions
        true_labels_list.extend(labels.cpu().numpy()) # Collect true labels

# Convert integer predictions and true labels back to genre names
int_to_genre = {i: genre for i, genre in enumerate(my_tags)}
predicted_genre_names = [int_to_genre[p] for p in predictions_list]
true_genre_names = [int_to_genre[t] for t in true_labels_list]

# Evaluate the model using the provided function
print("Neural Network Model Evaluation:")
evaluate_prediction(predicted_genre_names, true_genre_names, title="NN Classifier Confusion Matrix")

Nice! 50% accuracy achieved with a very simple architecture and no regularization or hyperparameter tuning.