# Sentiment Analysis using Bag of Words and Logistic Regression

In this notebook, we consider *sentiment classification*, a standard task in natural language processing. Based on a review of a movie (or a restaurant, hotel, etc.), we want to predict whether the person liked the movie or not. As an example, we use a data set provided by the International Movie Database website www.imdb.com. The provided reviews are labeled with a binary rating whether they are positive (label 1) or negative (label 0).

## Set-up
First of all, we need to load the libraries that we will need for this task.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# some more general libraries for evaluation purposes:
import matplotlib.pyplot as plt
import datetime

In [None]:
from tensorflow.keras.layers import Input, TextVectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
# Configurations
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

In [None]:
# initialize random number generators to ensure reproducibility:
np.random.seed(123)

## Text Representation: Bag of Words - Toy Example
To illustrate the idea of the bag of words, we will start with a vocabulary size of just 10 and a single text (the ETH self-portrait):

In [None]:
ETHtext = "Freedom and individual responsibility, entrepreneurial spirit and open-​mindedness: ETH Zurich stands on a bedrock " +\
    "of true Swiss values. Our university for science and technology dates back to the year 1855, when the founders of modern-​day " +\
    "Switzerland created it as a centre of innovation and knowledge. At ETH Zurich, students discover an ideal environment for " +\
    "independent thinking, researchers a climate which inspires top performance. Situated in the heart of Europe, yet forging " +\
    "connections all over the world, ETH Zurich is pioneering effective solutions to the global challenges of today and tomorrow"

Next, we define a count vectorizer, using a vocabulary size of 10 words:

In [None]:
VOCAB_SIZE = 10
toy_countvectorizer = CountVectorizer(max_features=VOCAB_SIZE)

Now we use the `.fit()` method to adapt the count vectorizer to the given dataset, i.e., the self-portrait of ETH.
The `.fit()` method will iterate over all the texts it gets as argument; we therefore need to hand over the texts as e.g., a list:

In [None]:
toy_countvectorizer.fit([ETHtext])

Now we can get the **feature names**, i.e., the words that will be used to represent the text. These are the most common words in the text corpus:

In [None]:
toy_countvectorizer.get_feature_names_out()

Now we can transform the text to get the representation of the text. As both the text corpus and  the vocabulary size can be very large, the representation is compressed; we use `.toarray()` to get a somewhat human-readable representation:

In [None]:
toy_countvectorizer.transform([ETHtext]).toarray()

The below prints out the word counts in a format that's easier to understand:

In [None]:
def count_words_nice_output(countvectorizer, text):
    for (word, count) in zip(countvectorizer.get_feature_names_out(), countvectorizer.transform([text]).toarray()[0]):
        if count>0:
            print('word "' + word + '" occurs ' + str(count) + ' times.')

In [None]:
count_words_nice_output(toy_countvectorizer, ETHtext)

# Sentiment Classification on IMDb Data
Now we are ready to classify the sentiments in the move reviews:

## Loading the IMDb Data
The IMDb data set is available in the subfolder `data_imdb` in three files containing the datasets for training, validation and testing.

In [None]:
train_ds = pd.read_csv('./data_imdb/Train.csv')
test_ds = pd.read_csv('./data_imdb/Test.csv')

Let's have a first look at the data:

In [None]:
train_ds.head(5)

First we look at some examples from the training data set:

In [None]:
for idx in range(5):
  print('Input: ', train_ds.iloc[idx, 0])
  print(10*'.')
  print('Target labels: ', train_ds.iloc[idx, 1])
  print(50*'-' + '\n')

Now we fit the count vectorizer to this dataset. Note that we now use a vocabulary size of 1000 words. This will take a few moments!

In [None]:
VOCAB_SIZE = 1000
countvectorizer = CountVectorizer(max_features=VOCAB_SIZE)
countvectorizer.fit(train_ds['text'])

Again, we look at an example of an encoding, for the first text in the training data:

In [None]:
train_ds['text'][0]

In [None]:
count_words_nice_output(countvectorizer, train_ds['text'][0])

We now transfom the entire training dataset using the `countvectorizer` to get a representation of the training data:

In [None]:
boW_train = countvectorizer.transform(train_ds['text'])

## Logistic Regression Model
As the output is binary, logistic regression seems a natural choice. We will now use the 1000-dimensional vector representation of the text as input (independent variables, predictors), and the rating (0 or 1) as output (dependent variable, target variable). 

In [None]:
sentimentPredictor_BoW1000 = LogisticRegression(multi_class="multinomial", max_iter=10000)

Now, let's train the model. We do only a rather small number of epochs and include early stopping in order not to spend too much time on training.

In [None]:
sentimentPredictor_BoW1000.fit(boW_train, train_ds['label'])

## Evaluation
Now let evaluate the model. We will first check the performance on the training data.

In [None]:
train_label_pred = sentimentPredictor_BoW1000.predict(boW_train)

In [None]:
accuracy_train = accuracy_score(train_label_pred, train_ds['label'])
print(f"accuracy on training set = {accuracy_train}")

**EXERCISE**: Evaluate the model on the test data `test_ds`. Remeber to first run the `countvectrizer` on that dataset.

## Interpretation
We want to try to interpret what the model has learned. To do so, we look at the weights that have been inferred.

`sentimentPredictor_BoW1000.coef_` contains the 1000 weights for the words in the dictionary. We look at the weights and search the indices with the largest values -- these will be the words that are the most positive:

In [None]:
sentimentPredictor_BoW1000_weights = sentimentPredictor_BoW1000.coef_.squeeze()
sentimentPredictor_BoW1000_weights

In [None]:
sentimentPredictor_BoW1000_sortOrder = np.argsort(sentimentPredictor_BoW1000_weights, axis=0)
BoW1000_vocab = countvectorizer.get_feature_names_out()
BoW1000_vocab[sentimentPredictor_BoW1000_sortOrder[-5:]]

We can also look at the weights of these words:

In [None]:
sentimentPredictor_BoW1000_weights[sentimentPredictor_BoW1000_sortOrder[-5:]]

That seems plausible! 

**EXERCISE:** What are the most negative words?

Additionally, we might also look at the most *neutral* words, i.e., those that have (taken alone) the least impact on the sentiment. These are the words which get a weighting closest to 0, so we look at the absolute value of the weights:

In [None]:
sentimentPredictor_BoW1000_sortOrder_abs = np.argsort(abs(sentimentPredictor_BoW1000_weights), axis=0)
BoW1000_vocab[sentimentPredictor_BoW1000_sortOrder_abs[:5]]

In [None]:
sentimentPredictor_BoW1000_weights[sentimentPredictor_BoW1000_sortOrder_abs[:5]]

The word *because* has the least influence on the rating, which seems plausible as well.