Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [1]:
NAME = "Aymane Hachcham"

---

# Latent Dirichlet Allocation

Now we will use latent dirichlet allocation

**Important notice:** the "Validate" function might timeout at 30 or 60 seconds.
We intend to have the actual autograding later run with higher tolerance.

In [2]:
### Load the input data - do not modify
import json, gzip, numpy as np
raw = json.load(gzip.open("/data/simpsonswiki.json.gz", "rt", encoding="utf-8"))
titles, texts, classes = [x["title"] for x in raw], [x["text"] for x in raw], [x["c"] for x in raw]

In [3]:
### This cell reduces the data set size for the autograder tests - do not modify

In [4]:
### Vectorize the text - do not modify
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words="english", min_df=5)
counts = cvect.fit_transform(texts)
vocabulary = cvect.get_feature_names_out()

## Explore your result

Explore the result: write a function to determine the most important words for each factor, and the most relevant documents.

**COPY your code from the first file here** (one of the rare cases where copying is okay)

In [5]:
def most_important(vocabulary, factor, k=10):
    """Most important words for each factor"""
    # YOUR CODE HERE
    indices_max_values = np.argpartition(factor, -k)[-k:]
    list_vocabs = [vocabulary[i] for i in indices_max_values]
    return list_vocabs

def most_relevant(assignment, k=5):
    """Most relevant documents for each factor (return document indexes)"""
    # YOUR CODE HERE
    indices_max_values = np.argpartition(assignment, -k)[-k:]
    return indices_max_values

def explain(vocabulary, titles, classes, factors, assignment, weights=None):
    """Print an explanation for each factor.
       If weights is None, use the relative share of the assignment weights.
       Print the ARI when assigning each document to its maximum only."""
    from sklearn.metrics import adjusted_rand_score
    # YOUR CODE HERE
    for i, f in enumerate(factors):
        print('For the Factor: {}, these are the following results'.format(i))
        important_vocabs = most_important(vocabulary, f)
        print('The most relevant words in this topic are: ')
        print('-------------------------------------------------------')
        print('\n')
        print(important_vocabs)
        important_docs = most_relevant(assignment)
        print('-------------------------------------------------------')
        print('\n')
        print('The most relevant documents belonging to this topic are: ')
        print([titles[i] for fact in important_docs for i in fact])
        print('\n')
        print('Their respective classes are ')
        print([classes[i] for fact in important_docs for i in fact])
        if weights is not None:
            factor_weight = weights[i]
            print('-------------------------------------------------------')
            print('\n')
            print('The Weight factor for this topic is {}'.format(factor_weight))
        print('#################################################################')

## LDA with Gensim

The `gensim` package contains more powerful implementations of LDA.

To use these, you will need to convert the scipy data structures using `Scipy2Corpus`.

For LDA, use an asymmetric topic prior. Use `chunksize=128, passes=2`.

In [6]:
### Enable logging in Gensim - no need to modify
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

In [7]:
### Convert the corpus and vocabulary as needed for gensim here!
corpus = None
id2word = None

from gensim.matutils import Scipy2Corpus
from gensim.corpora import Dictionary

corpus = Scipy2Corpus(counts)
id2word = dict([(i, s) for i, s in enumerate(vocabulary)])

In [8]:
### Automatic tests. You do not need to understand or modify this code.
assert isinstance(id2word, dict), "Not a dictionary"
assert len(id2word) == len(vocabulary)

In [None]:
### Use Gensim LDA here with an asymmetric prior!
def gensim_lda(counts, id2word, k):
    """Latent Dirichlet Allocation. Return the factors and document assignment"""
    from gensim.models.ldamodel import LdaModel
    from gensim.matutils import corpus2dense # for return
    # YOUR CODE HERE
    lda_model = LdaModel(counts, num_topics=k, id2word=id2word)
    topics_terms = lda_model.state.get_lambda()

    #convert estimates to probability (sum equals to 1 per topic)
    factors = np.apply_along_axis(lambda x: x/x.sum(),1,topics_terms)

    return factors, assignment

In [None]:
glda_factors, glda_assignment = gensim_lda(counts, id2word, 6)

In [None]:
# Explore your result. These must be meaningful topics!
explain(vocabulary, titles, classes, glda_factors, glda_assignment)

In [None]:
### Automatic tests. You do not need to understand or modify this code.
assert glda_factors.shape == (6, counts.shape[1]), "Factor shape is not correct."
assert glda_assignment.shape == (counts.shape[0], 6), "Assignment shape is not correct."
assert abs(glda_factors.sum()-6)<1e-6, "Topic word matrix are not probabilities."
# assert abs(glda_assignment.sum()-counts.shape[0])<1e-6, "Document topic matrix are not probabilities."

In [None]:
### This cell contains additional tests. You do not need to modify this cell.