# Assignment 7: Dimensionality Reduction

## Follow These Steps Before Submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking **'Runtime' > 'Restart session and run all'**.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.

# Dataset

In this assignment, you will work on a text dataset. The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset_challenge. The Yelp reviews polarity dataset is a subset of Yelp reviews dataset and is constructed by considering stars 1 and 2 negative, and 3 and 4 positive.

In [None]:
# imports
import os
import numpy as np
import pandas as pd
import polars as pl
from scipy.sparse import csr_matrix
import sklearn.feature_extraction.text as sktext
from sklearn.decomposition import PCA, SparsePCA, TruncatedSVD
import re
from sklearn.manifold import TSNE

import umap

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
!gdown https://drive.google.com/uc?id=1A0-Q7SbdoA3r7aawraRSwMgKujBYlFLv

# Part 1: Data Preprocessing

## Question 1.1: Load data

Read the **`yelp.csv`** file as a **`polars.DataFrame`** and show the first 5 rows of the dataframe and its descriptive statistics.

In [36]:
# Load the dataset
df = pl.read_csv("yelp.csv")

# Display the first 5 rows
print(df.head(5))

# Show descriptive statistics
print(df.describe())

shape: (5, 2)
┌───────────┬─────────────────────────────────┐
│ Sentiment ┆ Review                          │
│ ---       ┆ ---                             │
│ i64       ┆ str                             │
╞═══════════╪═════════════════════════════════╡
│ 0         ┆ Maintenance here is ridiculous… │
│ 1         ┆ I really enjoy smaller more in… │
│ 1         ┆ Looking at their menu, I was a… │
│ 1         ┆ Best sandwiches in Las Vegas! … │
│ 0         ┆ Was upset because they didnt h… │
└───────────┴─────────────────────────────────┘
shape: (9, 3)
┌────────────┬───────────┬─────────────────────────────────┐
│ statistic  ┆ Sentiment ┆ Review                          │
│ ---        ┆ ---       ┆ ---                             │
│ str        ┆ f64       ┆ str                             │
╞════════════╪═══════════╪═════════════════════════════════╡
│ count      ┆ 1500.0    ┆ 1500                            │
│ null_count ┆ 0.0       ┆ 0                               │
│ mean       ┆ 0.

## Question 1.2: Convert categorical variable

Since we are not predicting the categorical variable in this assignment, let's convert **`Sentiment`** to string:
- Replace **1** with **`positive`**.
- Replace **0** with **`negative`**.

Display the first 5 rows of the resulting dataframe.


In [42]:
# Convert Sentiment column
df = df.with_columns(
  df["Sentiment"].cast(pl.Utf8) # Ensure it's a string type first
  .replace("1", "positive")
  .replace("0", "negative")
)

# Display the first 5 rows
print(df.head(5))

shape: (5, 2)
┌───────────┬─────────────────────────────────┐
│ Sentiment ┆ Review                          │
│ ---       ┆ ---                             │
│ str       ┆ str                             │
╞═══════════╪═════════════════════════════════╡
│ negative  ┆ Maintenance here is ridiculous… │
│ positive  ┆ I really enjoy smaller more in… │
│ positive  ┆ Looking at their menu, I was a… │
│ positive  ┆ Best sandwiches in Las Vegas! … │
│ negative  ┆ Was upset because they didnt h… │
└───────────┴─────────────────────────────────┘


## Question 1.3: Transform text

Apply **`Term Frequency - Inverse Document Frequency`** transformation using [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):
- Eliminate accents and other characters
- Eliminate stopwords
- Eliminate words that appear in less than 5% and words that appear in more than 95% of texts
- Apply sublinear tf scaling

Extract and save the word list. Report the number of words that are kept.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Extract text data
texts = df["Review"].to_list()

# Initialize the TF-IDF Vectorizer with specified constraints
vectorizer = TfidfVectorizer(
  strip_accents="unicode", # Normalize accents
  stop_words="english",    # Remove stopwords
  min_df=0.05,             # Ignore words in <5% of documents
  max_df=0.95,             # Ignroe words in >95% of documents
  sublinear_tf=True        # Apply sublinear TF scaling
)

# Fit the vectorizer to the text data
tfidf_matrix = vectorizer.fit_transform(texts)

# Get the feature names (words)
word_list = vectorizer.get_feature_names_out()

# Report the number of words kept
num_words_kept = len(word_list)
print(f"Number of words kept: {num_words_kept}")

# Save the word list
with open("word_list.txt", "w") as f:
  for word in word_list:
    f.write(word + "\n")

Number of words kept: 157


## Question 1.4: Explore words

Based on TF-IDF scores, show the 10 most often repeated words and the 10 least often repeated words.

Hint: You might need to use [`np.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html). Pay attention to sorting order.

In [46]:
# Compute the mean TF-IDF score for each word across all reviews
word_tfidf_scores = np.array(tfidf_matrix.mean(axis=0)).flatten()

# Get the indices of the highest and lowest TF-IDF scores
top_10_indices = np.argsort(word_tfidf_scores)[-10:][::-1] # 10 largest values (descending)
bottom_10_indices = np.argsort(word_tfidf_scores)[:10] # 10 smallest valeus (ascending)

# Retreive the corresponding words
top_10_words = [word_list[i] for i in top_10_indices]
bottom_10_words = [word_list[i] for i in bottom_10_indices]

# Dispaly results
print("The top 10 most often repeated words:")
for word in top_10_words:
  print(word)

print("\nThe top 10 least often repeated words:")
for word in bottom_10_words:
  print(word)

The top 10 most often repeated words:
food
place
good
great
service
like
just
time
really
don

The top 10 least often repeated words:
having
tell
half
decided
town
30
finally
kind
review
reviews


# Part 2: Dimensionality Reduction

## Question 2.1: PCA

(1) Apply **normal PCA**. Set the number of components to 100. Report the percentage variance explained by the 100 PCs.

(2) Show the words that have positive weight in the **third PC** (index 2).

In [None]:
# (1) YOUR CODE HERE


In [None]:
# (2) YOUR CODE HERE


## Question 2.2: LSA

(1) Apply **LSA** using [`TruncatedSVD`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html). Set:
- number of components to 100
- number of iterations to 10
- random state to 2025.

Report the percentage variance explained by the 100 PCs.

(2) Show the five words that relate the most with the **fifth PC** (index 4). What would you name this principal component?

In [None]:
# (1) YOUR CODE HERE


In [None]:
# (2) YOUR CODE HERE


**Written answer:**

## Question 2.3: PCA vs LSA

Compare PCA and LSA. Comment on your findings.

**Written answer:**

## Question 2.4: t-SNE

Apply **t-SNE**. Set:
- number of components to 2
- random first inintialization
- try a perplexity of 2 and 10.
- tightness of natural clusters to 30
- auto learning rate
- maximum number of iterations to 1000
- maximum number of iterations without progress before we abort to 100
- use cosine metric
- gradient threshold to 0.0000001
- random state to 2025

Create a plot, showing 2D projection of our data using t-SNE for both perplexities, in separate plots. Remember to add labels and title.

Written answer: Compare the two projections. Which projection would you think separates the classes better? Why?

In [None]:
# YOUR CODE HERE - Perplexity 2


In [None]:
# YOUR CODE HERE - Perplexity 10


**Written Answer:**

## Question 2.5: UMAP

(1) Apply **UMAP**. Set:
- number of components to 2
- use 10 nearest neighbors
- use cosine metric
- number of training epochs to 1000
- effective minimum distance between embedded points to 0.1
- effective scale of embedded points to 1
- avoids excessive memory use
- do not use a random seed to allow parallel processing.

(2) Create a plot, showing 2D projection of our data using UMAP. Remember to add labels and title.

In [None]:
# (1) YOUR CODE HERE


In [None]:
# (2) YOUR CODE HERE


## Question 2.6: t-SNE vs UMAP

Compare t-SNE (perplexity 10) and UMAP. Comment on your findings.

**Written answer:**