
# <center> <span style='color:Blue'>--- Latent Dirichlet Allocation (LDA)---</span> <center>

It is an unsupervised generative method based on the following assumptions:
-Each document in the corpus is a bag-of-words;
- Each m document covers a number of topics in varying proportions $P(\theta_m)$;
- Each word has a distribution associated with each topic $P(\phi_k)$;
- We can therefore represent each topic by a probability on each word;
- $Z_n$ represents the topic of the word $W_n$;

Since we only have access to the documents, we have to determine the topics, the distributions of each word on the topics, the frequency of appearance of each topic on the corpus.

<img src="images/LDA.jpg" width="550">

The inference of this model is realized using scikit which implements a version of LDA.
the LDA algorithms will be applied to a classic dataset already present in the scikit library: the newsgroup dataset, which contains a set of 20,000 document news articles.

# Importing the database 

In [15]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

# Create the LDA model

In [16]:
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 20

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)

In [18]:
# Materialize the sparse data
data_dense = tf.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparsicity:  2.529883330387131 %


In [19]:
lda = LatentDirichletAllocation(
        n_components=n_topics, 
        max_iter=5, 
        learning_method='online', 
        learning_offset=50.,
        random_state=0)

# Fitter on the data
lda.fit(tf)

LatentDirichletAllocation(learning_method='online', learning_offset=50.0,
                          max_iter=5, n_components=20, random_state=0)

# Evaluation

We display the most representative words of the modelled topics.

In [20]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}:".format(topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(lda, tf_vectorizer.get_feature_names(), no_top_words)


Topic 0:
people gun state control right guns crime states law police
Topic 1:
time question book years did like don space answer just
Topic 2:
mr line rules science stephanopoulos title current define int yes
Topic 3:
key chip keys clipper encryption number des algorithm use bit
Topic 4:
edu com cs vs w7 cx mail uk 17 send
Topic 5:
use does window problem way used point different case value
Topic 6:
windows thanks know help db does dos problem like using
Topic 7:
bike water effect road design media dod paper like turn
Topic 8:
don just like think know people good ve going say
Topic 9:
car new price good power used air sale offer ground
Topic 10:
file available program edu ftp information files use image version
Topic 11:
ax max b8f g9v a86 145 pl 1d9 0t 34u
Topic 12:
government law privacy security legal encryption court fbi technology information
Topic 13:
card bit memory output video color data mode monitor 16
Topic 14:
drive scsi disk mac hard apple drives controller software port
T

In [23]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda.score(tf))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda.perplexity(tf))

# See model parameters
print(lda.get_params())

Log Likelihood:  -3049240.730366368
Perplexity:  254.8741111824851
{'batch_size': 128, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.7, 'learning_method': 'online', 'learning_offset': 50.0, 'max_doc_update_iter': 100, 'max_iter': 5, 'mean_change_tol': 0.001, 'n_components': 20, 'n_jobs': None, 'perp_tol': 0.1, 'random_state': 0, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 0}


In [25]:
from sklearn.model_selection import GridSearchCV
# Define Search Param
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(tf)

GridSearchCV(estimator=LatentDirichletAllocation(),
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [10, 15, 20, 25, 30]})

In [26]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(tf))

Best Model's Params:  {'learning_decay': 0.9, 'n_components': 10}
Best Log Likelihood Score:  -634375.5770007303
Model Perplexity:  263.326708365321


# dominant topic in each document and the weights

In [55]:

# Create Document - Topic Matrix
import pandas as pd
import numpy as np
lda_output = best_lda_model.transform(tf)

# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

# index names
docnames = ["Doc" + str(i) for i in range(len(documents))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight= weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

  and should_run_async(code)


Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
Doc0,0.0,0.0,0.0,0.0,0.0,0.6,0.38,0.0,0.0,0.0,5
Doc1,0.0,0.0,0.97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
Doc2,0.0,0.0,0.0,0.0,0.0,0.77,0.21,0.0,0.0,0.0,5
Doc3,0.14,0.0,0.0,0.12,0.0,0.43,0.0,0.3,0.0,0.0,5
Doc4,0.01,0.01,0.39,0.01,0.13,0.44,0.01,0.01,0.01,0.01,5
Doc5,0.01,0.01,0.43,0.01,0.01,0.5,0.01,0.01,0.01,0.01,5
Doc6,0.0,0.17,0.0,0.29,0.0,0.38,0.0,0.0,0.0,0.15,5
Doc7,0.0,0.0,0.0,0.0,0.0,0.93,0.0,0.0,0.0,0.04,5
Doc8,0.0,0.0,0.0,0.0,0.0,0.96,0.0,0.0,0.0,0.0,5
Doc9,0.2,0.0,0.0,0.17,0.0,0.62,0.0,0.0,0.0,0.0,5


In [36]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

Unnamed: 0,Topic Num,Num Documents
0,5,3971
1,1,1802
2,2,1723
3,0,1154
4,3,1135
5,4,659
6,6,470
7,7,237
8,9,153
9,8,10


# visualize the LDA model with pyLDAvis

A good topic model will have non-overlapping, fairly big sized blobs for each topic.

In [39]:
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, tf, tf_vectorizer, mds='tsne')
panel

  and should_run_async(code)


In [41]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

# Assign Column and Index
df_topic_keywords.columns = tf_vectorizer.get_feature_names()
df_topic_keywords.index = topicnames

# View
df_topic_keywords.head()

  and should_run_async(code)


Unnamed: 0,00,000,01,02,03,04,0d,0t,10,100,...,written,wrong,wrote,x11,xt,year,years,yes,york,young
Topic0,0.10001,82.871537,1.021928,0.100008,0.100032,7.951919,0.1,0.1,187.679456,129.387907,...,4.148385,15.072315,1.932609,0.100002,0.100002,235.187179,236.884697,1.587526,6.578765,0.100021
Topic1,14.795715,0.100008,6.079161,8.515794,0.100018,0.100014,0.1,0.1,48.661987,10.882132,...,66.738695,77.62662,39.925199,21.446189,44.848521,107.270577,0.100013,47.452668,0.100004,0.100017
Topic2,0.100001,0.100011,0.100002,0.100008,0.100001,0.100003,0.1,0.1,15.587243,13.682722,...,103.220258,229.423398,67.043302,0.1,0.1,362.484304,174.058499,133.010024,5.163812,44.089248
Topic3,30.858415,0.100023,0.100027,3.23586,0.100016,0.16601,0.100001,0.100002,79.508682,84.215862,...,0.279577,12.531578,0.100012,0.10001,40.395244,35.636685,36.172727,44.039826,0.100007,0.10001
Topic4,4.694725,0.234618,15.33173,3.54545,8.096337,33.087373,0.1,0.1,47.538207,40.738668,...,127.015593,1.735114,15.689276,200.753787,129.89322,0.109397,1.747276,42.356514,13.772621,13.87899


# Get the top 15 keywords each topic

In [48]:
# Show top n keywords for each topic
def show_topics(vectorizer=tf_vectorizer, lda_model=best_lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=tf_vectorizer, lda_model=best_lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

  and should_run_async(code)


Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,space,use,nasa,power,new,high,car,research,used,data,time,good,earth,launch,low
Topic 1,file,windows,use,thanks,program,does,know,problem,using,window,like,output,help,files,need
Topic 2,god,think,jesus,does,people,believe,don,good,say,db,just,time,like,game,bible
Topic 3,drive,card,scsi,disk,hard,mac,dos,new,pc,price,drives,bit,like,use,controller
Topic 4,edu,com,available,mail,ftp,information,list,pub,send,software,version,file,internet,email,anonymous
Topic 5,people,don,just,like,know,think,right,time,good,make,say,ve,way,really,want
Topic 6,mr,president,people,government,armenian,said,new,turkish,jews,armenians,american,states,stephanopoulos,war,press
Topic 7,key,encryption,chip,keys,clipper,law,use,number,play,gun,public,security,1993,bit,des
Topic 8,ax,max,g9v,b8f,a86,pl,145,1d9,0t,1t,34u,bhj,75u,giz,3t
Topic 9,00,10,15,25,11,12,20,17,14,16,13,24,30,18,50


### source 
https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html \
https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/