# COGS 108 - Assignment 4: Natural Language Processing

This assignment covers working with text data and NLP.

This assignment is out of 8 points, worth 8% of your grade.

**PLEASE DO NOT CHANGE THE NAME OF THIS FILE.**

**PLEASE DO NOT COPY & PASTE OR DELETE CELLS INLCUDED IN THE ASSIGNMENT.**

# Important

- This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.
    - This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!
    - In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values. 
        - It is up to you to check the values, and make sure they seem reasonable.
- A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.
    - For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail. 
    - Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.

# Background & Work Flow

- In this homework assignment, we will be analyzing text data. A common approach to analyzing text data is to use methods that allow us to convert text data into some kind of numerical representation - since we can then use all of our mathematical tools on such data. In this assignment, we will explore 2 feature engineering methods that convert raw text data into numerical vectors:
    - **Bag of Words (BoW)**
        - BoW encodes an input sentence as the frequency of each word in the sentence. 
        - In this approach, all words contribute equally to the feature vectors.
    - **Term Frequency - Inverse Document Frequency (TF-IDF)**
        - TF-IDF is a measure of how important each term is to a specific document, as compared to an overall corpus. 
        - TF-IDF encodes each word as its frequency in the document of interest, divided by a measure of how common the word is across all documents (the corpus).
        - Using this approach, each word contributes differently to the feature vectors.
        - The assumption behind using TF-IDF is that words that appear commonly everywhere are not that informative about what is specifically interesting about a document of interest, so it is tuned to representing a document in terms of the words it uses that are different from other documents. 

- To compare those 2 methods, we will first apply them on the same Movie Review dataset to analyze sentiment (how positive or negative a text is). In order to make the comparison fair, an **SVM (support vector machine)** classifier will be used to classify positive reviews and negative reviews.

- SVM is a simple yet powerful and interpretable linear model. To use it as a classifier, we need to have at least 2 splits of the data: training data and test data. The training data is used to tune the weight parameters in the SVM to learn an optimal way to classify the training data. We can then test this trained SVM classifier on the test data, to see how well it works on data that the classifier has not seen before. 

In [None]:
# Imports - these are all the imports needed for the assignment
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import nltk package 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

For this assignment we will be using `nltk`: the Natural Language Toolkit.

To do so, we will need to download some text data.

Natural language processing (NLP) often requires corpus data (lists of words, and/or example text data) which is what we will download here now, if you don't already have them.

In [None]:
# Download the NLTK English tokenizer and the stopwords of all languages
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Part 1: Sentiment Analysis on Movie Review Data (4.75 points)

In part 1 we will apply sentiment analysis to Movie Review (MR) data.

- The MR data contains more than 10,000 reviews collected from the IMDB website, and each of the reviews is annotated as either positive or negative. The number of positive and negative reviews are roughly the same. For more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/

- For this homework assignment, we've already shuffled the data, and truncated the data to contain only 5000 reviews.

In this part of the assignment we will:
- Transform the raw text data into vectors with the BoW encoding method
- Split the data into training and test sets
- Write a function to train an SVM classifier on the training set
- Test this classifier on the test set and report the results

### 1a) Import data

Import the file with the url `https://raw.githubusercontent.com/COGS108/A4_Data/refs/heads/main/data/rt-polarity.tsv` into a DataFrame called `movie_df`. This file is large and may take some time to import.

Note that this file is a tab separated raw text file, in which data is separated by tabs. You can load this file with `pd.read_csv`, but we have to tell function to use tabs instead of commas as the column separator.  Look at the docs for `pd.read_csv` to figure out how to use tab as the column separator. 

The file doesn't have a set of column names in the first row, it just starts with data.  Therefore you will have to set the argument `header` as `None` to prevent pandas from thinking data is column names. After loading the data you will have to set the column names properly.  Call them 'index', 'label', and 'review' in that order.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(movie_df, pd.DataFrame)

In [None]:
# Check the data
movie_df.head()

### 1b) Create a function that converts string labels to numerical labels

Function name: `label_converter`

The function should do the following:
- take two parameters `label` and `direction`
- if `direction` is 'tonumber', 
    - and if the input `label` is "pos" return `1.0`
    - and if the input `label` is "neg" return `0.0`
    - otherwise, return the input `label` as is
- if `direction` is 'tolabel'
    - and if the input `label` is `1.0` return "pos"
    - and if the input `label` is `0.0` return "neg"
    - otherwise, return the `label` as is
        

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert label_converter

In [None]:
assert callable(label_converter)

### 1c) Numerical Labels

Convert all labels in `movie_df["label"]` to **numerical labels**, using the `label_converter` function. Be sure to specify the appropriate argument to the `direction` parameter.

Save them as a new column named "Y" in `movie_df`. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert sorted(set(movie_df['Y'])) == [0., 1.]

In [None]:
# Check the movie_df data
movie_df.dtypes

### 1d) Defining the train & test sets

Now, we'll instead use `sklearn`'s `train_test_split()` function here to define our train and test set. Store input data (predictors) into training data `movie_train_X` and testing data `movie_test_X`. Simlarly, store labels (outcomes) into training data `movie_train_Y`and testing data `movie_test_Y`.

In addition to providing the predictors (`movie_df['review']`) and outcomes (`movie_df['Y']`) to the function, we will use the following arguments for this task:
- `test_size`: 0.2
- `random_state`: 108


To prepare the input data for later vectorization, convert `movie_train_X` and `movie_test_X` into lists. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(movie_train_X) == movie_train_Y.shape[0]
assert len(movie_test_X) == movie_test_Y.shape[0]

assert type(movie_train_X) == list
assert type(movie_test_X) == list

assert len(movie_train_X) == 4000
assert len(movie_test_Y) == 1000

### 1e) Convert text data into vector 

We will now create a `CountVectorizer` object to transform the text data into vectors with numerical values. 
A `CountVectorizer` converts text into numbers by:

1. Reading Text: It takes sentences or a bunch of words as input.
2. Finding Unique Words: It looks at all the words and makes a list of unique words.
3. Counting Words: For each sentence, it counts how many times each unique word appears.

The output of a `CountVectorizer` is a matrix with each row representing a sentence and each column representing a unique word. The value represents how many times each unique word appears in each sentence.

We will initialize a `CountVectorizer` object, and name it as `vectorizer`.

We need to pass 4 arguments to initialize a CountVectorizer:
  1. `analyzer`: `'word'` 
          Specify to analyze data from word-level.
  2. `max_features`: `2100`
          Set a max number of unique words.
  3. `tokenizer`: `word_tokenize`
          Set to tokenize the text data by using the word_tokenizer from NLTK .
  4. `stop_words`: `stopwords.words('english')`
          Set to remove all stopwords in English. We do this since they generally don't provide useful discriminative information.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert vectorizer.analyzer == 'word'
assert vectorizer.max_features == 2100
assert vectorizer.tokenizer == word_tokenize
assert vectorizer.stop_words == stopwords.words('english')
assert hasattr(vectorizer, "fit_transform")

### 1f) Vectorize training reviews

After we create a `CountVectorizer` object, we need to fit the `CountVectorizer` to build a consistent vocabulary by:
1. Looking at all the text you provide and identifying every unique word. This vocabulary is a list of all the unique words that the CountVectorizer will use as features.
2. Assigning each unique word an index (position) in the vocabulary, which will later be used to construct the word count matrix.

Fit the `vectorizer` we created above using `movie_train_X` and transform the training data inputs `movie_train_X` into vectors. Save the transformed training input as `movie_train_counts` and make sure to convert the result into a numpy array.

HINT: use the sklearn docs to see how to fit / transform a `CountVectorizer` object.  And you should definitely understand the difference between fit and transform.  And docs might also help you in converting the result to numpy array.

NOTE: this operation may throw a warning about stopwords. This is ok.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

vocabulary_train = vectorizer.vocabulary_

In [None]:
assert type(movie_train_counts) == np.ndarray

### 1g) Vectorize testing reviews

Now let's turn the testing data inputs `movie_test_X` into vectors using the `vectorizer` we created above. 

Think about the machine learning examples we've covered in class. 

Ask yourself:
- what dataset should the transformer be fit on, training or test?
- after fitting the transformer, how can you transform a different dataset without refitting it?  

Store the transformed testing data inputs as `movie_test_counts` and convert it into a numpy array.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
vocabulary_test = vectorizer.vocabulary_

In [None]:
assert type(movie_test_counts) == np.ndarray

In [None]:
def train_SVM(X, y, kernel='linear'):
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert callable(train_SVM)

### 1j) Train SVM

Train an SVM classifier with the default linear kernel on the samples `movie_train_counts` and the labels `movie_train_Y`

You need to call the function `train_SVM` you just created. Name the returned object as `movie_classifier`.

Note: training your model may take many seconds / up to a few minutes to run.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
movie_classifier

In [None]:
assert isinstance(movie_classifier, SVC)
assert hasattr(movie_classifier, "predict")

### 1k) Predict outcome

Predict labels for both training samples and test samples. You will need to use `movie_classifier.predict(...)`

Name the predicted labels for the **training samples** as `movie_predicted_train_Y`.
Name the predicted labels for the **testing samples** as `movie_predicted_test_Y`.

Note: Your code here will also take a minute to run.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Now we will use the function `classification_report` to print out the performance of the classifier on the training set:

In [None]:
# Your classifier should be able to reach above 90% accuracy 
# on the training set
print(classification_report(movie_train_Y,movie_predicted_train_Y))

And finally, we check the performance of the trained classifier on the test set:

In [None]:
# Your classifier should be able to reach around 69% accuracy on the test set.
print(classification_report(movie_test_Y, movie_predicted_test_Y))

In [None]:
assert movie_predicted_train_Y.shape == (4000,)
assert movie_predicted_test_Y.shape == (1000,)

precision, recall, _, _ = precision_recall_fscore_support(movie_train_Y,movie_predicted_train_Y)
assert np.isclose(precision[0], 0.91, 0.02)
assert np.isclose(precision[1], 0.92, 0.02)

# Part 2: TF-IDF (1.5 points)

In this part, we will explore TF-IDF on sentiment analysis.

TF-IDF is used as an alternate way to encode text data, as compared to the Counts BoW approach used in Part 1. 

At this point you should probably ask yourself if you understand the difference between Counts and TF-IDF.  If you aren't clear this is a good time to talk to whoever is leading your dicsussion section :)

To get this done we will:
- Transform the raw text data into vectors using TF-IDF
- Train an SVM classifier on the training set and report the performance this classifer on the test set

### 2a) Text Data to Vectors

We will create a `TfidfVectorizer` object to transform the text data into vectors with TF-IDF

To do so, we will initialize a `TfidfVectorizer` object, and name it as `tf_idf`.

We need to pass 4 arguments into the "TfidfVectorizer" to initialize a "tf_idf":
  1. `sublinear_tf`: `True`
           Set to apply TF scaling.
  2. `analyzer`: `'word'`
           Set to analyze the data at the word-level
  3. `max_features`: `2100`
           Set the max number of unique words
  4. `tokenizer`: `word_tokenize`
           Set to tokenize the text data by using the word_tokenizer from NLTK

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert tf_idf.analyzer == 'word'
assert tf_idf.max_features == 2100
assert tf_idf.tokenizer == word_tokenize
assert tf_idf.stop_words == None

### 2b) 
Again, using `train_test_split`, split the `movie_df['review']` and `movie_df['Y']` into a training set and a test set. 

Name these variables as:
- `movie_train_X_tf_idf` and `movie_test_X_tf_idf` for the input
- `movie_train_Y_tf_idf` and `movie_test_Y_tf_idf` for the labels

We will use the same 80/20 split as in part 1 and same arguments for the parameters `test_size` (0.2) and `random_state` (108). Again, convert `movie_train_X_tf_idf` and `movie_test_X_tf_idf` into lists.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(movie_train_X_tf_idf) == 4000
assert len(movie_test_X_tf_idf) == 1000
assert movie_train_Y_tf_idf.shape == (4000,)
assert movie_test_Y_tf_idf.shape == (1000,)

### 2c) Transform Reviews 

Fit `tf_idf` we created above using the appropriate input data and vectorize `movie_train_X_tf_idf` and `movie_test_X_tf_idf` into vectors. Save the transformed training and test input data into `movie_train_vector_tf_idf` and `movie_test_vector_tf_idf`, respectively. Finally, convert `movie_train_vector_tf_idf` and `movie_test_vector_tf_idf` into numpy arrays.

In [None]:

# YOUR CODE HERE
raise NotImplementedError()
vocabulary_tf_idf_train = tf_idf.vocabulary_

In [None]:
type(tf_idf)

In [None]:
assert isinstance(movie_train_vector_tf_idf, np.ndarray)
assert isinstance(movie_test_vector_tf_idf, np.ndarray)

assert "skills" in set(tf_idf.stop_words_)
assert "risky" in set(tf_idf.stop_words_)
assert "adopts" in set(tf_idf.stop_words_)

### 2d) Training

Train an SVM classifier on the training samples and labels.

You need to call the function `train_SVM` you created in part 1. Name the returned object as `movie_tf_idf_classifier`.

Note: training your model may take many seconds, up to a few minutes, to run.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(movie_tf_idf_classifier, SVC)
assert hasattr(movie_tf_idf_classifier, "predict")

### 2e) Prediction

Predict the labels for both the training and test samples (the 'X' data).

Name the predicted labels on **training samples** as `movie_train_Y_tf_idf_pred`. Name the predicted labels on **testing samples** as `movie_test_Y_tf_idf_pred`

Note: this may take a few seconds to run.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Again, we use `classification_report` to check the performance on the training set.

In [None]:
# Your classifier should be able to reach above 86% accuracy.
print(classification_report(movie_train_Y_tf_idf, movie_train_Y_tf_idf_pred))

Again, check performance on the test set:

In [None]:
# Your classifier should be able to reach around 73% accuracy.
print(classification_report(movie_test_Y_tf_idf, movie_test_Y_tf_idf_pred))

In [None]:
precision, recall, _, _ = precision_recall_fscore_support(movie_train_Y_tf_idf, movie_train_Y_tf_idf_pred)
assert np.isclose(precision[0], 0.86, 0.02)
assert np.isclose(precision[1], 0.89, 0.02)

### Written Answer Question

How does the performance of the TF-IDF classifier compare to the classifier used in part 1?  And why do you think you see what you see?

YOUR ANSWER HERE

# Part 3: Sentiment Analysis on Customer Review with TF-IDF (2 points)

In this part, we will use TF-IDF to analyze the sentiment of some Customer Review (CR) data.

The CR data contains around 3771 reviews, and they were all collected from the Amazon website. The reviews are annotated by humans as either positive reviews or negative reviews. In this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.

For more information on this dataset, you can visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In this part, we have already split the data into a training set and a test set, in which the training set has labels for the reviews, but the test set doesn't. 

The goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.

To do so, we will:
- Use the TF-IDF feature engineering method to encode the raw text data into vectors
- Train an SVM classifier on the training set
- Predict labels for the reviews in the test set

The performance of your trained classifier on the test set will be checked by a hidden test.

### 3a) Loading the data

Customer review task has 2 files
- `https://raw.githubusercontent.com/COGS108/A4_Data/refs/heads/main/data/custrev_train.tsv` contains training data with labels
- `https://raw.githubusercontent.com/COGS108/A4_Data/refs/heads/main/data/custrev_test.tsv` contains test data without labels which need to be predicted 

Import the training data into a DataFrame called `CR_train_df`. Set the column names as `index`, `label`, `review`.

Import the test data into a DataFrame called `CR_test_df`. Set the column names as `index`, `review`

Note that both will need to be imported with `sep` and `header` arguments (like in 1a)

In [None]:
CR_train_file = 'https://raw.githubusercontent.com/COGS108/A4_Data/refs/heads/main/data/custrev_train.tsv'
CR_test_file = 'https://raw.githubusercontent.com/COGS108/A4_Data/refs/heads/main/data/custrev_test.tsv'

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(CR_train_df, pd.DataFrame)
assert isinstance(CR_test_df, pd.DataFrame)

### 3b) Concatenation
Concatenate the 2 DataFrames from the last step into a single DataFrame, and name it `CR_df`. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(CR_df) == 3771

### 3c) Cleaning

Convert all labels in `CR_df["label"]` to numerical labels using the `label_converter` function we defined above. Save these numerical labels as a new column named `Y` in CR_df.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(CR_df['Y'], pd.Series)

### 3d)  Use `tf_idf`

Transform reviews `CR_df["review"]` into vectors using the `tf_idf` vectorizer we created in part 2 and convert the result into a numpy array. Save the transformed data into a variable called `CR_X_tf_idf`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(CR_X_tf_idf, np.ndarray)

Here we will collect all training samples & numerical labels from `CR_X_tf_idf`. The code provided below will extract all samples with labels from the dataframe:


In [None]:
# code provided to collect labels
CR_train_X = CR_X_tf_idf[~CR_df['Y'].isnull()]
CR_train_Y = CR_df['Y'][~CR_df['Y'].isnull()]

# Note: if these asserts fail, something went wrong
#  Go back and check your code (in part 3) above this cell
assert CR_train_X.shape == (3016, 2100)
assert CR_train_Y.shape == (3016, )

### 3e) SVM 

Train an SVM classifier on the samples `CR_train_X` and the labels `CR_train_Y`:
- You need to call the function `train_SVM` you created above.
- Name the returned object as `CR_clf`.

Note: training your model may take many seconds / up to a few minutes to run.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(CR_clf, SVC)

### 3f) Predict: training data

Predict labels on the training set `CR_train_X` using `CR_clf`, and name the returned variable as `CR_train_Y_pred`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Check the classifier accuracy on the train data
# Note that your classifier should be able to reach above 80% accuracy.
print(classification_report(CR_train_Y, CR_train_Y_pred))

In [None]:
precision, recall, _, _ = precision_recall_fscore_support(CR_train_Y, CR_train_Y_pred)
assert np.isclose(precision[0], 0.84, 0.02)
assert np.isclose(precision[1], 0.86, 0.02)

In [None]:
# Collect all test samples from CR_X_tf_idf
CR_test_X = CR_X_tf_idf[CR_df['Y'].isnull()]

### 3g)  Predict: test set
Predict the labels on the test set `CR_test_X`. Create a pandas DataFrame called `CR_test_label_pred` and store the numeric predictions in a column called `'label'`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(CR_test_X, np.ndarray)
assert isinstance(CR_test_label_pred, pd.DataFrame)
assert CR_test_label_pred.columns == 'label'

In [None]:
CR_test_label_pred['label']

### 3h) Convert labels

Using the `label_converter` function, convert the predicted numerical labels `CR_test_label_pred['label']` back to string labels ("pos" and "neg").

Create a column called `label` in `CR_test_df` to store the converted labels.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
CR_test_df['label']

In [None]:
assert isinstance(CR_test_df['label'], pd.Series)
assert set(CR_test_df['label']) == {'neg', 'pos'}

The hidden assignments tests for the cell above will check that your model predicts the right number of pos/neg reviews in the test data provided. 

We now have a model that can predict positive or negative sentiment! 

Briefly in your own words, think about and write a quick example of when and why it might be useful to computationally analyze the sentiment of text data. And in contrast, when it might not work so well. [This whole answer can/should be a couple of sentences].

After you answer this question, you are done! 

YOUR ANSWER HERE

# Complete! 

Good work! Have a look back over your answers, and also make sure to `Restart & Run All` from the kernel menu to double check that everything is working properly. While you can typically use the 'Validate' button above, which runs your notebook from top to bottom and checks to ensure all `assert` statements pass silently, ***this may fail on this assignment as the code takes too long to run. Use Restart & Run All instead***. When you are ready, submit on datahub!

Note that ***the final validation is for your reassurance and is not a required step***. You can submit without validating. You can also submit without passing all asserts (for partial credit on the assignment). We grade whatever is submitted on datahub. We will grade your most recent submission.