In [1]:
import numpy
import matplotlib.pyplot as plt 
from matplotlib import cm
import pandas
import mglearn

import os
import scipy

import sklearn
import sklearn.ensemble              # import seperatley otherwise sub module won't be imported
import sklearn.neural_network        # import seperatley otherwise sub module won't be imported
from sklearn.cluster import KMeans
import sklearn.feature_selection

import graphviz
import mpl_toolkits.mplot3d as plt3dd

import time



import nltk     # language processing packages for lemmatization
import spacy 

import re

# Data crunching

## Topic modeling and document clustering

One particular technique that is often applied to text data is topic modeling, which is an umbrella term describing the task of assigning each document to one or multiple topics, usually without supervision. A good example for this is news data, which might be categorized into topics like “politics,” “sports,” “finance,” and so on. If each document is assigned a single topic, this is the task of clustering the documents, as discussed in Chapter 3. If each document can have more than one topic, the task relates to the decomposition methods from Chapter 3. Each of the components we learn then corresponds to one topic, and the coefficients of the components in the representation of a document tell us how strongly related that document is to a particular topic. Often, when people talk about topic modeling, they refer to one particular decomposition method called Latent Dirichlet Allocation (often LDA for short)

### Latent dirichlet allocation

Intuitively, the LDA model tries to find groups of words (the topics) that appear together frequently. LDA also requires that each document can be understood as a “mixture” of a subset of the topics. It is important to understand that for the machine learning model a “topic” might not be what we would normally call a topic in everyday speech, but that it resembles more the components extracted by PCA or NMF (which we discussed in Chapter 3), which might or might not have a semantic meaning. Even if there is a semantic meaning for an LDA “topic”, it might not be something we’d usually call a topic. Going back to the example of news articles, we might have a collection of articles about sports, politics, and finance, written by two specific authors. In a politics article, we might expect to see words like “governor,” “vote,” “party,” etc., while in a sports article we might expect words like “team,” “score,” and “season.” Words in each of these groups will likely appear together, while it’s less likely that, for example, “team” and “governor” will appear together. However, these are not the only groups of words we might expect to appear together. The two reporters might prefer different phrases or different choices of words. Maybe one of them likes to use the word “demarcate” and one likes the word “polarize.” Other “topics” would then be “words often used by reporter A” and “words often used by reporter B,” though these are not topics in the usual sense of the word. 

Let’s apply LDA to our movie review dataset to see how it works in practice. For unsupervised text document models, it is often good to remove very common words, as they might otherwise dominate the analysis. We’ll remove words that appear in at least 20 percent of the documents, and we’ll limit the bag-of-words model to the 10,000 words that are most common after removing the top 20 percent:

Movie review data can be downloaded from $\href{https://ai.stanford.edu/\%7Eamaas/data/sentiment/}{link}$

In [7]:
path = r"./Raw Data/aclImdb/";
reviews_train = sklearn.datasets.load_files(path + "train", categories=["pos","neg"]);
reviews_test = sklearn.datasets.load_files(path + "test", categories=["pos","neg"])

text_train, y_train = reviews_train.data, reviews_train.target;
text_test, y_test = reviews_train.data, reviews_train.target;

text_test = [doc.replace(b"<br />", b" ") for doc in text_test];
text_train = [doc.replace(b"<br />",b" ") for doc in text_train];

In [8]:
vect = sklearn.feature_extraction.text.CountVectorizer(max_features=10000, max_df=0.15);

In [10]:
vect.fit(text_train);

In [16]:
X_train = vect.transform(text_train);
X_test = vect.transform(text_test);

In [17]:
lda = sklearn.decomposition.LatentDirichletAllocation(n_components=10, n_jobs=4, learning_method="batch", max_iter=25, random_state=0);

In [19]:
# We build the model and transform the data in one step
# Computing transform takes some time,
# and we can save time by doing both at once

document_topics = lda.fit_transform(X_train)

In [45]:
a_max = document_topics.argmax(axis=1);
bc = numpy.bincount(a_max)
print("Number of docs with primary topic #:\n")
for i, _count in enumerate(bc.ravel()):
    print(f"Topic #{i+1} : {_count}\n")

Number of docs with primary topic #:

Topic #1 : 4343

Topic #2 : 2209

Topic #3 : 1934

Topic #4 : 1329

Topic #5 : 5120

Topic #6 : 4064

Topic #7 : 1346

Topic #8 : 1654

Topic #9 : 830

Topic #10 : 2171



Like the decomposition methods we saw in Chapter 3, LatentDirichletAllocation
has a components_ attribute that stores how important each word is for each topic.
The size of components_ is (n_topics, n_words):

In [48]:
lda.components_.shape

(10, 10000)

In [58]:
# For each topic (a row in the components_), sort the features (ascending)
# Invert rows with [:, ::-1] to make sorting descending
sorting_mask = lda_components.argsort(axis=1)[:,::-1];
feature_names = numpy.array(vect.get_feature_names_out())

In [60]:
mglearn.tools.print_topics(topics=range(10), feature_names=feature_names,
sorting=sorting_mask, topics_per_chunk=5, n_words=10)

topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
between       war           funny         show          didn          
family        world         comedy        series        saw           
young         us            guy           episode       thought       
real          american      laugh         tv            am            
us            our           jokes         episodes      thing         
director      documentary   fun           shows         got           
work          history       humor         season        10            
both          years         re            new           want          
beautiful     new           hilarious     years         going         
each          human         doesn         television    watched       


topic 5       topic 6       topic 7       topic 8       topic 9       
--------      --------      --------      --------      --------      
acti