Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [1]:
NAME = "Aymane Hachcham"

---

# Latent Semantic Indexing

First we will implement latent semantic indexing (LSI)

In [3]:
### Load the input data - do not modify
import json, gzip, numpy as np
raw = json.load(gzip.open("/data/simpsonswiki.json.gz", "rt", encoding="utf-8"))
titles, texts, classes = [x["title"] for x in raw], [x["text"] for x in raw], [x["c"] for x in raw]

In [4]:
### Vectorize the text - do not modify
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vect = TfidfVectorizer(stop_words="english", sublinear_tf=True, smooth_idf=False, min_df=5)
vect.fit(texts)
vect.idf_ -= 1
idf = vect.idf_
tfidf = vect.transform(texts)
vocabulary = vect.get_feature_names_out()

In [6]:
import pandas as pd
pd.DataFrame.sparse.from_spmatrix(tfidf, columns=vocabulary)

Unnamed: 0,00,000,01,02,04,05,06,07,08,10,...,zoo,zooms,zorina,zorro,zsa,zuckerberg,zuylen,zzyzwicz,zörker,üter
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.087577,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10123,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Implement LSI

Implement Latent Semantic Indexing. Do **not** use regular SVD, but instead use truncated SVD from sklearn. (Do not attempt to implement Truncated SVD yourself, use the library here.) Return weights how well the factors explain the topics.

In [7]:
# Implement LSI here
def lsi(tfidf, k):
    """Latent Semantic Indexing. Return the factors, document assignment, and factor weights"""
    from sklearn.decomposition import TruncatedSVD

    # We use TruncatedSVD to return the U, Sigma and V_t matrices
    # The assignment: => document-topic matrix shape (num_docs, k)
    # The factors: => topics-words matrix shape (k, num_words)
    # The weights: => The weights for each topic, shape (1, k)
    lsi_object = TruncatedSVD(n_components=k, n_iter=100, random_state=42)

    assignment = lsi_object.fit_transform(tfidf)
    factors = lsi_object.components_
    weights = lsi_object.singular_values_

    return factors, assignment, weights

In [8]:
### Automatic tests. You do not need to understand or modify this code.
_tmp = lsi(tfidf, 2)
assert len(_tmp) == 3, "Incomplete result"
assert _tmp[0].shape == (2, tfidf.shape[1]), "Factor shape is not correct."
assert _tmp[1].shape == (tfidf.shape[0], 2), "Assignment shape is not correct."
del _tmp

## Explore your result

Explore the result: write a function to determine the most important words for each factor, and the most relevant documents.

In [9]:
def most_important(vocabulary, factor, k=10):
    """Most important words for each factor"""
    indices_max_values = np.argpartition(factor, -k)[-k:]
    list_vocabs = [vocabulary[i] for i in indices_max_values]
    return list_vocabs

def most_relevant(assignment, k=5):
    """Most relevant documents for each factor (return document indexes)"""
    # YOUR CODE HERE
    indices_max_values = np.argpartition(assignment, -k)[-k:]
    return indices_max_values

def explain(vocabulary, titles=None, classes=None, factors=None, assignment=None, weights=None):
    """Print an explanation for each factor.
       If weights is None, use the relative share of the assignment weights.
       Print the ARI when assigning each document to its maximum only."""
    from sklearn.metrics import adjusted_rand_score
    for i, f in enumerate(factors):
        print('For the Factor: {}, these are the following results'.format(i))
        important_vocabs = most_important(vocabulary, f)
        print('The most relevant words in this topic are: ')
        print('-------------------------------------------------------')
        print('\n')
        print(important_vocabs)
        important_docs = most_relevant(assignment)
        print('-------------------------------------------------------')
        print('\n')
        print('The most relevant documents belonging to this topic are: ')
        print([titles[i] for fact in important_docs for i in fact])
        print('\n')
        print('Their respective classes are ')
        print([classes[i] for fact in important_docs for i in fact])
        if weights is not None:
            factor_weight = weights[i]
            print('-------------------------------------------------------')
            print('\n')
            print('The Weight factor for this topic is {}'.format(factor_weight))
        print('#################################################################')

In [10]:
### Automatic tests. You do not need to understand or modify this code.
_tmp = lsi(tfidf, 2)
assert len(most_important(vocabulary, _tmp[0][0], 42)) == 42, "Wrong number of most important words"
for x in most_important(vocabulary, _tmp[0][0]): assert isinstance(x, str), "Most important words are not words"
assert len(most_relevant(_tmp[1][:,0], 42)) == 42, "Wrong number of relevant results."
from unittest.mock import patch
with patch('__main__.most_important') as mock_1, patch('__main__.most_relevant') as mock_2, patch('__main__.print') as mock_3:
    explain(vocabulary, titles, classes, *_tmp)
    assert mock_1.called, "You did not use most_important"
    assert mock_2.called, "You did not use most_central"
    assert mock_3.called, "You did not print"

In [11]:
# Explore your result. These should mostly be meaningful topics!
lsi_factors, lsi_assignment, lsi_weights = lsi(tfidf, 6)
explain(vocabulary, titles, classes, lsi_factors, lsi_assignment, lsi_weights)

For the Factor: 0, these are the following results
The most relevant words in this topic are: 
-------------------------------------------------------


['maggie', 'song', 'gag', 'family', 'marge', 'lisa', 'homer', 'bart', 'simpson', 'couch']
-------------------------------------------------------


The most relevant documents belonging to this topic are: 
['The Pacifier', 'Bart Jumps', 'Watching Television', 'Babysitting Maggie', 'Good Night', 'Burp Contest', 'Watching Television', 'Bart Jumps', 'Good Night', 'Babysitting Maggie', 'The Pacifier', 'Burp Contest', 'Babysitting Maggie', 'Bart Jumps', 'Watching Television', 'Good Night', 'The Pacifier', 'Burp Contest', 'Watching Television', 'Burp Contest', 'Bart Jumps', 'Babysitting Maggie', 'The Pacifier', 'Good Night', 'Watching Television', 'Burp Contest', 'Bart Jumps', 'Babysitting Maggie', 'The Pacifier', 'Good Night']


Their respective classes are 
['Episodes', 'Episodes', 'Episodes', 'Episodes', 'Episodes', 'Episodes', 'Episodes'

In [None]:
### This cell contains additional tests. You do not need to modify this cell.