Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [1]:
NAME = "AVISHA ANILKUMAR BHIRYANI"

---

# Probabilistic Latent Semantic Indexing

Now we will implement latent semantic indexing (LSI)

In [2]:
### Load the input data - do not modify
import json, gzip, numpy as np
raw = json.load(gzip.open("/data/simpsonswiki.json.gz", "rt", encoding="utf-8"))
titles, texts, classes = [x["title"] for x in raw], [x["text"] for x in raw], [x["c"] for x in raw]

In [3]:
### This cell reduces the data set size for the autograder tests - do not modify

In [4]:
### Vectorize the text - do not modify
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words="english", min_df=5)
counts = cvect.fit_transform(texts)
vocabulary = cvect.get_feature_names_out()

## Explore your result

Explore the result: write a function to determine the most important words for each factor, and the most relevant documents.

**COPY your code from the first file here** (this is a rare case where copying is okay)

In [5]:
def most_important(vocabulary, factor, k=10):
    """Most important words for each factor"""
    # YOUR CODE HERE
    dict_words = {}
    i = 0
    for key in vocabulary:
        dict_words[key] = factor[i]
        i = i+1
    sorted_words = dict(sorted(dict_words.items(), key=lambda item: item[1], reverse=True))
    arr = [each for each in list(sorted_words)[:k]]
    return arr

def most_relevant(assignment, k=5):
    """Most relevant documents for each factor (return document indexes)"""
    # YOUR CODE HERE
    arr = []
    dict_docs = {}
    for i in range(0,len(assignment)):
        dict_docs[i] = assignment[i]
        i = i+1
    sorted_docs = dict(sorted(dict_docs.items(), key=lambda item: item[1], reverse=True))
    arr= [each for each in list(sorted_docs)[:k]]
    return arr

def explain(vocabulary, titles, classes, factors, assignment, weights=None):
    """Print an explanation for each factor.
       If weights is None, use the relative share of the assignment weights.
       Print the ARI when assigning each document to its maximum only."""
    from sklearn.metrics import adjusted_rand_score
    # YOUR CODE HERE
    if weights is not None:
        total_factors = factors.shape[0]
        for i in range(0,total_factors):
            important_words = most_important(vocabulary, factors[i])
            important_docs = most_relevant(assignment[:i])
            print(i)
            print(important_docs)
            print(important_words)
        

## Implement probabilistic Latent Semantic Indexing

Implement pLSI using the non-negative matrix factorization function of sklearn. Make sure to choose appropriate parameters to use KL divergence -- it is not sufficient to use defaults!

In [8]:
# Implement pLSI here using NMF
def plsi(counts, k):
    """Probabilistic Latent Semantic Indexing. Return the factors and document assignment"""
    from sklearn.decomposition import NMF
    # YOUR CODE HERE
    model = NMF(n_components=k, solver = 'mu', beta_loss='kullback-leibler')
    model_fit = model.fit(counts)
    assignment = model_fit.transform(counts)
    factors = model.components_
    return factors, assignment

In [9]:
### Automatic tests. You do not need to understand or modify this code.
from unittest.mock import patch
with patch('gensim.models.lsimodel.LsiModel') as mock_2:
    _tmp = plsi(counts, 2)
    assert len(_tmp) == 2, "Incomplete result"
    assert _tmp[0].shape == (2, counts.shape[1]), "Factor shape is not correct."
    assert _tmp[1].shape == (counts.shape[0], 2), "Assignment shape is not correct."
    assert not mock_2.called, "You were supposed to use sklearn here, not gensim."
del _tmp



In [10]:
### This cell contains additional tests. You do not need to modify this cell.

In [11]:
# Explore your result. These must be meaningful topics!
plsi_factors, plsi_assignment = plsi(counts, 6)
explain(vocabulary, titles, classes, plsi_factors, plsi_assignment)



In [13]:
### This cell contains additional tests. You do not need to modify this cell.