# Project 4: Model Performance and Comparison

## Part B: Sklearn and Natural Language Processing
In this part, you will apply sklearn and related NLP libraries to perform data analysis on the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/). Before you begin, check that your installed `scikit-learn` version is as specified in `requirements.txt`; otherwise you may not pass the local tests.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.decomposition import LatentDirichletAllocation

from gensim.models import Word2Vec

import pandas as pd
import numpy as np
import scipy.sparse as sp

We begin by loading a subset of the dataset, which contains 5000 movie reviews and their associated sentiment labels (i.e., whether a review is considered positive or negative).

In [2]:
df_reviews = pd.read_csv("imdb_reviews.csv")

In [3]:
# this cell has been tagged with excluded_from_script
# it will be ignored by the autograder
df_reviews.head()

Unnamed: 0,review,processed_review,sentiment
0,Taran Adarsh a reputed critic praised such a d...,taran adarsh repute critic praise dubba movie ...,negative
1,"Worth the entertainment value of a rental, esp...",worth entertainment value rental especially li...,negative
2,"I liked Antz, but loved ""A Bug's Life"". The an...",like antz love bug life animation put paid def...,positive
3,This reboot is like a processed McDonald's mea...,reboot like process mcdonald meal compare ang ...,negative
4,"The working title was: ""Don't Spank Baby"". <br...",work title spank baby wayne crawford go become...,positive


The `review` column contains raw review texts from the original dataset. However, it's always a good idea to process and clean text data before performing analysis. As you have performed this task in Project 3, we will provide the processed reviews for you in this case. The column `processed_review` was constructed by processing and tokenizing the raw reviews, using the `preprocess_text` function from Project 3, and then joining the review tokens by a single space. From this point, you only need to focus on the `processed_review` and `sentiment` columns.

Next, let's look at the distribution of class labels:

In [4]:
# this cell has been tagged with excluded_from_script
# it will be ignored by the autograder
display(df_reviews['sentiment'].value_counts())

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,2500
positive,2500


We see that there are 2500 positive reviews and 2500 negative reviews. In other words, our dataset is [perfectly balanced](https://i.imgflip.com/303krn.jpg).

### Question 11: Count Vectorizer

The first feature construction task we will perform is building a term-frequency matrix. Implement the function `count_vectorizer` that uses sklearn's `CountVectorizer` API to construct the term-frequency training matrix and testing matrix, along with the feature names (i.e., the list of words corresponding to the columns in the matrices).

One point to keep in mind is that `CountVectorizer` will, by default, do its own preprocessing and tokenization (see the [documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes) for more details). As these steps have already performed, we will need to overwrite sklearn's default behaviors by specifying the following parameters:
* `analyzer` and `tokenizer` should be `str.split`.
* `preprocessor` should be a function that simply returns the input. We have built this function, called `dummy_fun`, for you.


 **Notes**:
 * Recall from the data normalization function in Part A that, with any feature construction or transformation task, we will only perform fitting on the train data, and then transform both train and test data. In other words, no fitting should be done on the test data.

In [5]:
def dummy_fun(x):
    return x

def count_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the term-frequency matrices for train_data and test_data using CountVectorizer.

    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training

    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing

    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : TF matrix for training
            tf_test (scipy.sparse.csr_matrix) : TF matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the TF matrices
    """
    vectorizer = CountVectorizer(
        analyzer = "word",
        tokenizer = str.split,
        preprocessor = dummy_fun,
        token_pattern=None
    )

    tr_train = vectorizer.fit_transform(reviews_train)
    features = list(vectorizer.get_feature_names_out())

    tf_test = vectorizer.transform(reviews_test) if reviews_test is not None else None

    return tr_train, tf_test, features

In [6]:
def test_count_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    count_vec_train, count_vec_test, features = count_vectorizer(reviews_train, reviews_test)
    assert count_vec_train.shape == (3750, 27242)
    assert count_vec_test.shape == (1250, 27242)
    assert np.allclose(
        count_vec_train.sum(axis = 1)[:10].ravel().tolist()[0],
        [70, 65, 168, 77, 139, 132, 28, 139, 453, 89]
    )
    assert np.allclose(
        count_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [168, 60, 59, 144, 494, 135, 69, 119, 76, 68]
    )
    assert features[:10] == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert features[-10:] == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")

test_count_vectorizer()

All tests passed!


### Question 12: TF-IDF Vectorizer

Now let's use the TF-IDF feature construction method. Implement the function `tfidf_vectorizer` that uses sklearn's `TfidfVectorizer` API to construct the TF-IDF training matrix and testing matrices, along with the feature names (i.e., the list of words corresponding to the columns in the matrices). Use the same parameter values for `analyzer`, `tokenizer` and `preprocessor` as you did in the previous question.

In [None]:
def tfidf_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the TF-IDF matrices for train_data and test_data using TfidfVectorizer.

    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training

    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing

    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : TF-IDF matrix for training
            tf_test (scipy.sparse.csr_matrix) : TF-IDF matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the TF-IDF matrices
    """
    vectorizer = TfidfVectorizer(
        analyzer = "word",
        tokenizer = str.split,
        preprocessor = dummy_fun,
        token_pattern=None
    )

    tr_train = vectorizer.fit_transform(reviews_train)
    features = list(vectorizer.get_feature_names_out())

    tf_test = vectorizer.transform(reviews_test) if reviews_test is not None else None

    return tr_train, tf_test, features

In [None]:
def test_tfidf_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    tfidf_vec_trains, tfidf_vec_test, features = tfidf_vectorizer(reviews_train, reviews_test)
    assert tfidf_vec_trains.shape == (3750, 27242)
    assert tfidf_vec_test.shape == (1250, 27242)
    assert np.allclose(
        tfidf_vec_trains.sum(axis = 1)[:10].ravel().tolist()[0],
        [7.03658925089979, 7.417196035144321, 11.492434722367015, 6.965673648338525, 9.428219597939362, 9.425632229448961, 3.9722806270035345, 9.635230284023372, 11.779155501275017, 7.44670396016231]
    )
    assert np.allclose(
        tfidf_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [7.2233277330801196, 4.869804242110142, 6.249091468966529, 9.689812079503804, 11.89432945296538, 9.115185225757216, 6.798492438570971, 8.57464867777901, 7.954528809138947, 6.81383392701789]
    )
    assert features[:10] == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert features[-10:] == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")

test_tfidf_vectorizer()

All tests passed!


### Question 13: Predicting review sentiment
Let's now see which feature construction method -- TF or TF-IDF -- is better for predicting review sentiments in our dataset. Our choice of learning algorithm here will be a support vector machine with Gaussian kernel (this means that it uses a different hypothesis function that can also account for non-linearly separable data). You can apply this learning algorithm by creating an instance of sklearn's [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) class, with `kernel = "rbf"` and `C = 10`.

Implement the function `predict_sentiment` that takes as input the `reviews` and `sentiment` columns of our IMDB dataset and performs the following tasks:
1. Convert the `sentiment` column to a vector `y` of 1s and -1s: `positive` corresponds to 1 and `negative` to -1.
1. Perform a [stratified k-fold split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) of the review and sentiment vectors, based on the provided `k`. Also set `shuffle` to `True` and `random_state` to the provided `seed`.
1. For $f$ from $1 \to k$:
     * Let fold $f$ be the test set, and the remaining $k-1$ folds be the training set.
     * Convert the training and testing reviews to feature matrices `X_train` and `X_test`, using either TF or TF-IDF. Which method to use is based on the function parameter `method`.
     * Train the SVM model on `X_train, y_train` and evaluate its accuracy $a_f$ on `X_test, y_test`.
1. Return $a_1, a_2, \ldots, a_k$.

**Notes**:
* As a reminder, accuracy is defined as
$$\text{Acc} = \frac{1}{n} \sum_{i=1}^n \mathbb{1}(y^{(i)} = \hat y^{(i)}).$$
You can also use the `score` function from `SVC` to quickly compute accuracy on test data.

In [None]:
def predict_review_sentiment(reviews, sentiments, method, k, seed = 0):
    """
    Compute the cross-validated accuracy of SVM with either TF or TF-IDF features
    in predicting review sentiment.

    args:
        reviews (pd.Series[str]) : a Series of all processed movie reviews
        sentiments (pd.Series[str]) : a Series of movie review sentiments,
            containing either "positive" or "negative"
        method (str) : a string which is either "TF" or "TF-IDF",
            specifying which feature construction method to use
        k (int) : the number of folds in stratified k-fold split

    kwargs:
        seed (int) : the random generator seed for kfold split

    return:
        List[float] : a list of k accuracy values from evaluating a trained SVM model
            on each of the k folds, using the remaining folds as training data
    """
    y = np.where(sentiments == "positive", 1, -1)
    skf = StratifiedKFold(n_splits = k, shuffle = True, random_state = seed)

    accuracies = []

    for train_index, test_index in skf.split(reviews, y):
      reviews_train, reviews_test = reviews[train_index], reviews[test_index]
      y_train, y_test = y[train_index], y[test_index]

      if method == "TF":
        X_train, X_test, _ = count_vectorizer(reviews_train, reviews_test)
      elif method == "TF-IDF":
        X_train, X_test, _ = tfidf_vectorizer(reviews_train, reviews_test)
      else:
        raise ValueError("method must be either 'TF' or 'TF-IDF'")

      svm = SVC(kernel = "rbf", C = 10)
      svm.fit(X_train, y_train)

      accuracies.append(svm.score(X_test, y_test))

    return accuracies

In [None]:
def test_predict_review_sentiment():
    # prediction based on TF
    count_vec_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "TF", 10)
    assert count_vec_accs == [0.878, 0.836, 0.854, 0.824, 0.826, 0.824, 0.824, 0.85, 0.844, 0.83]

    # prediction based on TF-IDF
    tf_idf_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "TF-IDF", 10)
    assert tf_idf_accs == [0.88, 0.862, 0.85, 0.868, 0.854, 0.846, 0.864, 0.874, 0.874, 0.846]
    print("All tests passed!")
    print("Cross-validated accuracy of SVM with TF matrices", np.mean(count_vec_accs))
    print("Cross-validated accuracy of SVM with TF-IDF matrices", np.mean(tf_idf_accs))

test_predict_review_sentiment()

We see that using TF-IDF features yields better cross-validated accuracy than using TF features (when the learning algorithm is SVM with RBF kernel and $C = 10$), although the difference in this case is not large.

### Question 14: Topic modeling and word distribution
Let's now try to understand the review texts a bit more. We can treat all the reviews as a corpus and perform Latent Dirichlet Allocation to extract the corpus topics. We can also see which words are most frequent in a given topic. Implement the function `top_words_by_topic` that takes as input the `processed_reviews` column in our IMDB dataset and performs the following tasks:

1. Build a term-frequency matrix out of this column. Remember to use the same `CountVectorizer` parameters as in Q11.
1. Input this matrix to sklearn's `LatentDirichletAllocation`. The number of topics and random generator seed are provided as function parameters. You should specify `learning_method` as `"online"`.
1. In the resulting word-topic matrix, identify the most frequent `n_top_words` in each topic. These most frequent words should be sorted from lower to higher frequency.

In [15]:
def top_words_by_topic(reviews, n_topics = 10, n_top_words = 20, seed = 0):
    """
    Perform topic modeling on the movie review corpus and return the most frequent words in each topic.

    args:
        reviews (pd.Series[str]) : a Series of all processed movie reviews

    kwargs:
        n_topics (int) : the number of topics to model by LDA
        n_top_words (int) : the number of most frequent words to identify in each topic
        seed (int) : the random generator seed for LDA

    return:
        List[List[str]] : a nested list of words, where each of the n_topics inner lists
            contains the n_top_words most frequent words in a given topic
    """

    # Step 1: Use the existing count_vectorizer function from Q11 to get term-frequency matrix and feature names
    tf_matrix, _, feature_names = count_vectorizer(reviews)

    # Step 2: Apply Latent Dirichlet Allocation (LDA)
    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=seed,
        learning_method="online"
    )

    lda.fit(tf_matrix)

    # Step 3: Get the most frequent words for each topic **without reversing the order**
    topic_words = []
    for topic_idx, topic in enumerate(lda.components_):
        top_indices = topic.argsort()[-n_top_words:]  # Get top word indices
        top_words = [feature_names[i] for i in top_indices]  # Extract words in correct order
        topic_words.append(top_words)  # Store topic words

    return topic_words


In [16]:
def test_top_words_by_topic():
    corpus = pd.Series([
        "I like to eat broccoli and bananas",
        "I ate a banana and spinach smoothie for breakfast",
        "Chinchillas and kittens are cute",
        "My sister adopted a kitten yesterday",
        "Look at this cute hamster munching on a piece of broccoli"
    ])
    top_words = top_words_by_topic(corpus, n_topics = 2, n_top_words = 5)
    # print(top_words)
    assert top_words == [['Look', 'broccoli', 'and', 'cute', 'a'], ['I', 'eat', 'like', 'to', 'and']]

    top_words = top_words_by_topic(df_reviews["processed_review"], n_topics = 5, n_top_words = 5)
    assert top_words == [
        ['performance', 'play', 'version', 'jack', 'role'],
        ['dancer', 'paris', 'dance', 'cartoon', 'hitchcock'],
        ['make', 'like', 'one', 'film', 'movie'],
        ['film', 'father', 'world', 'american', 'war'],
        ['mad', 'sheriff', 'match', 'carmen', 'arthur']
    ]
    print("All tests passed!")

test_top_words_by_topic()

All tests passed!


### Bonus: Word embedding and word similarity
Finally, let's look at how we can train a word embedding model from our movie review corpus. Unfortunately, gensim's `Word2Vec` does not output reproducible results across different environments, so we will not grade this question. Instead, here we provide the implementation of the function `get_most_similar_words` that takes as input the `processed_reviews` column in our IMDB dataset, and for each input word, returns a list of `n_similar_words` top similar tokens to that word, based on the Word2Vec model. Here the tokens are ordered from lower to higher similarity values.

You can see the code below and play around with different settings to better understand Word2Vec.

In [None]:
def find_most_similar_words(reviews, input_words, n_similar_words):
    corpus = [review.split() for review in reviews]
    model = Word2Vec(sentences = corpus, vector_size = 100, window = 5, workers = 4, min_count = 1)
    similar_words = []
    for inp_word in input_words:
        similar_words.append([w for w in sorted(model.wv.most_similar(inp_word, topn = n_similar_words), key = lambda x: x[1])])
    return similar_words

In [None]:
def test_find_most_similar_words():
    input_words = ["see", "good", "bad", "watch", "check"]
    most_similar_words = find_most_similar_words(df_reviews["processed_review"], input_words, 7)
    for i in range(len(input_words)):
        print(f"Words most similar to '{input_words[i]}':")
        print(most_similar_words[i])
        print()

test_find_most_similar_words()

In [None]:
from gensim.models import Word2Vec

def find_most_similar_words(reviews, input_words, n_similar_words):
    """
    Train a Word2Vec model on the review corpus and find the most similar words for each input word.

    args:
        reviews (pd.Series[str]) : a Series of processed movie reviews
        input_words (List[str]) : a list of input words for which similar words are to be found
        n_similar_words (int) : number of top similar words to return for each input word

    return:
        List[List[Tuple[str, float]]] : a nested list where each inner list contains tuples of
                                       (word, similarity score) for the n_similar_words most
                                       similar words to each input word
    """
    # Prepare the corpus by splitting each review into a list of words
    corpus = [review.split() for review in reviews]

    # Train the Word2Vec model
    model = Word2Vec(
        sentences=corpus,
        vector_size=100,  # Dimensionality of word vectors
        window=5,         # Context window size
        workers=4,        # Number of worker threads for training
        min_count=1       # Minimum frequency count of words
    )

    similar_words = []

    for inp_word in input_words:
        if inp_word in model.wv:
            # Retrieve and sort similar words by similarity score (ascending order)
            similar = sorted(
                model.wv.most_similar(inp_word, topn=n_similar_words),
                key=lambda x: x[1]
            )
        else:
            similar = []  # If the word is not in the vocabulary
        similar_words.append(similar)

    return similar_words
