# Latent Dirichlet Allocation

##### Topic modeling is the task of assigning each document to one or multiple topics.

###### LDA model tries to find groups of words (the topics) that appear together frequently. LDA also requires that each document can be understood as a “mixture” of a subset of the topics.

We will use a dataset of movie reviews from the IMDb website collected by Andrew Maas.

The dataset is provided as text files in two separate folders,
one for the training data and one for the test data. Each of these in turn has two subfolders,
one called pos and one called neg. The pos folder contains all the positive reviews, each as a separate text file, and similarly for the neg folder.

We will use the helper function in scikit-learn to load files stored
in such a folder structure, called load_files. We apply the load_files function first to the training data:

In [1]:
import numpy as np

from sklearn.datasets import load_files

reviews_train = load_files("aclImdb/train/")
# Load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))

type of text_train: <class 'list'>
length of text_train: 25000


Let’s apply LDA to our movie review dataset. For unsupervised text document models, it is good to remove very common words, as they might otherwise dominate the analysis. We’ll remove words that appear in at
least 15 percent of the documents, and we’ll limit the bag-of-words model to the
10,000 words that are most common after removing the top 15 percent:

### Model 1 (10 topics)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features=10000, max_df=.15)
X = vect.fit_transform(text_train)

We will learn a topic model with 10 topics. We’ll use the
"batch" learning method, which is somewhat slower than the default ("online") but
usually provides better results, and increase "max_iter", which can also lead to better
models:

In [9]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, learning_method="batch",
                                max_iter=25, random_state=0)
# We build the model and transform the data in one step
# Computing transform takes some time,
# and we can save time by doing both at once
document_topics = lda.fit_transform(X)

##### LatentDirichletAllocation has a components_ attribute that stores how important each word is for each topic. The size of components_ is (n_topics, n_words):

In [10]:
lda.components_.shape

(10, 10000)

To understand better what the different topics mean, we will look at the most important
words for each of the topics.

In [12]:
# For each topic (a row in the components_), sort the features (ascending)
# Invert rows with [:, ::-1] to make sorting descending
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]

# Get the feature names from the vectorizer
feature_names = np.array(vect.get_feature_names())

import mglearn
# Print out the 10 topics:
mglearn.tools.print_topics(topics=range(10), feature_names=feature_names, sorting=sorting, topics_per_chunk=5, n_words=10)

topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
guy           show          horror        family        action        
gets          series        killer        years         effects       
around        episode       house         saw           original      
girl          tv            murder        book          animation     
car           episodes      wife          old           special       
down          shows         night         children      fight         
sex           season        thriller      kids          game          
woman         television    death         now           fi            
women         new           creepy        again         sci           
girls         funny         dead          young         look          


topic 5       topic 6       topic 7       topic 8       topic 9       
--------      --------      --------      --------      --------      
musi

Topic 1 seems to be about historical and war movies,
topic 2 might be about bad comedies, topic 3 might be about TV series. Topic 4
seems to capture some very common words, while topic 6 appears to be about children’s
movies and topic 8 seems to capture award-related reviews.

###### Using only 10 components (topics), each of the topics needs to be very broad, so that they can together cover all the different kinds of reviews in our dataset.

#### Let us build a model with 100 components.

### Model 2 (100 topics)

##### With 100 topics, each topic can specalise more to get more interesting features of the data

In [15]:
# LDA with 100 components/topics
lda100 = LatentDirichletAllocation(n_components=100, learning_method="batch",max_iter=25, random_state=0, n_jobs=-1)

# Transforming the dataset
document_topics100 = lda100.fit_transform(X)

In [19]:
print(document_topics100.shape)

(25000, 100)


The documets_topics100 contains the 25000 training samples(reviews) represented by 100 components(topics)

In [21]:
print(lda100.components_.shape)

(100, 10000)


This shows that the 100 components/topics were identified and each have 10000 most relevant and frequent words for that topic.

###### Let us see at only some of the 100 topics:

In [16]:
# Chosen topics
topics = np.array([7, 16, 24, 25, 28, 36, 37, 45, 51, 53, 54, 63, 89, 97])

# For each topic (a row in the components_), sort the features (ascending)
# Invert rows with [:, ::-1] to make sorting descending
sorting = np.argsort(lda100.components_, axis=1)[:, ::-1]
feature_names = np.array(vect.get_feature_names())

# Print the chosen topics
mglearn.tools.print_topics(topics=topics, feature_names=feature_names,sorting=sorting, topics_per_chunk=7, n_words=20)

topic 7       topic 16      topic 24      topic 25      topic 28      topic 36      topic 37      
--------      --------      --------      --------      --------      --------      --------      
us            romantic      horror        years         effects       jeff          lady          
our           jack          gore          old           special       anderson      french        
world         romance       zombie        ago           fi            simon         julia         
lives         comedy        scary         early         sci           beach         kelly         
own           danny         blood         later         space         wave          leading       
human         hotel         slasher       saw           monster       surfing       american      
each          perfect       dead          today         science       lose          beautiful     
may           kubrick       zombies       year          alien         magazine      julie         
real      

The topics we extracted this time seem to be more specific, though many are hard to
interpret. Topic 7 seems to be about horror movies and thrillers; topics 16 and 54
seem to capture bad reviews, while topic 63 mostly seems to be capturing positive
reviews of comedies.

For example, topic 45 seems to be about music. Let’s check which kinds of reviews are assigned to this topic:

In [22]:
# Sort by weight of "music" i.e topic 45
music = np.argsort(document_topics100[:, 45])[::-1]

# Print the five documents where the topic is most important
for i in music[:10]:
    # Show first two sentences
    print(b".".join(text_train[i].split(b".")[:2]) + b".\n")

b'The script is nice.Though the casting is absolutely non-watchable.\n'
b'Jane Austen would definitely approve of this one!<br /><br />Gwyneth Paltrow does an awesome job capturing the attitude of Emma. She is funny without being excessively silly, yet elegant.\n'
b'I have no idea how a Texan (the director, Douglas McGrath) and the American actress Gwyneth Paltrow ever pulled this off but seeing this again will remind you what all the fuss about Ms. Paltrow was in the first place! I had long since gone off the woman and still feel she is rather dull in her Oscar-winning "Shakespeare In Love" performance but she gets all the beats right here--she is nigh on perfect as Emma Woodhouse.\n'
b"While there aren't any talking animals, big lavish song production numbers, or villians with half white / half black hair ..\n"
b"While there aren't any talking animals, big lavish song production numbers, or villians with half white / half black hair ..\n"
b'The Color Purple is a masterpiece. It displ

As we can see, this topic covers a wide variety of music-centered reviews, from musicals,
to biographical movies.