In this lab, we shall learn implementation of topic modeling in Python.

Let's start by loading the data of the 20 newsgroups dataset in scikit-learn. You can use all the data but for simpler and fast execution, the code below selects first 100 articles.


In [86]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
articles = dataset.data[:100]
print(len(articles))
print(articles[1])
print()
print("<><><><>><><><>><><><><><>")
print()
print(articles[2])

100
A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

<><><><>><><><>><><><><><>

well folks, my mac plus finally gave up the ghost this weekend after
starting life as a 512k way back in 1985.  sooo, i'm in the market for a
new machine a bit sooner than i intended to be...

i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) somebody can answer:

* does anybody know any dirt on when the next round of powerbook
introductions are expected?  i'd heard the 185c was 

We shall use the same familiar approach of CountVectorizer to measure terms/words and their frequencies.  Our custom tokenization function for CountVectorizer is shown below. In this function, we are performing lemmatization on each word. In order to have correct lemma of a word, we also need to determine the part-of-speech tag of it. For example, the word saw as noun and as verb have different lemmas (root word) and of course they have different meanings.

In [87]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

#Custom function for toeknization
def myTokenizer(text):
    
    lemmatizer = WordNetLemmatizer()
    lemmas=[]
    
    for sent in nltk.sent_tokenize(text):
        #nltk return the tag from Penntreebank tagsets
        sentTag=nltk.pos_tag(nltk.word_tokenize(sent))
        #print (sentTag)
        for word, tag in sentTag:
            # the problem wordnet lemmatizer is that, it recognizes only
            # wordnet tags and not the PennTreebank tags. So we shall
            # first convert Penntreebank tags to Wordnet tags
            wordNetTag=getWordnetPos(tag)
            if wordNetTag is None:
                continue
            else:
                lemmas.append(lemmatizer.lemmatize(word,wordNetTag))
                
    return  lemmas
    
    
# Function to convert 
#Penntreebank tags to wordnet tags
def getWordnetPos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:#We are igonring everything else other than four of the above 
         # tags. You can add more if you like
        return None      
    
print("done")


done


Let's add some stop words to our recipe.

In [88]:
import nltk
import string
stopWords=nltk.corpus.stopwords.words('english')
stopWords+=["''", "'s", "...", "``","--","*","-"]
stopWords+=list(string.punctuation)
print("done")

done


Time to create a term document (or document term rather) matrix using the CountVectorizer class. All the parameters in this class are already dicussed in the earlier lab. If you need further help on parameters type help(CountVectorizer) in another cell.

In [89]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features=10000, max_df=.70,
                       tokenizer=myTokenizer, stop_words=stopWords)
X = vect.fit_transform(articles)
print (X.shape)
print(vect.get_feature_names())

(100, 3905)


Now, we shall train LDA topic modeling algorithm on our data. In the code below, LDA have been asked to create only 5 topics (n_components) and told to iterate using EM algorithm up till 25 iterations. More details can be found by using help(LatentDirichletAllocation).

In [90]:
from sklearn.decomposition import LatentDirichletAllocation

#Initialize LDA
vocabulary=X.shape[1] # total words in the training data
topics=5 
alpha=(1/topics) #alpha for LDA
beta=(1/vocabulary)# beta for LDA

#Note alpha and beta in actual LDA algorithm are actually vectors of decimal values and not a single decimal value
# LDA implementation in Scikit does not take vectors as iput for alpha and beta. So, we have to assign one value for 
# them. This means we can't really control the skewness of topics' dsitribution or skeweness of words's distrbution
# and we just have to assign equal values to all in Scikit-learn. Another gensim library can help us solve this issue
# (see Exercises)

lda = LatentDirichletAllocation(n_components=topics, learning_method="batch",
max_iter=25, random_state=0, topic_word_prior=beta ,doc_topic_prior=alpha)


# Train it.
documentTopics = lda.fit_transform(X)

print ("Documents and topics shape: ", documentTopics.shape)
print("Topics and words shape: {}".format(lda.components_.shape))


Documents and topics shape:  (100, 5)
Topics and words shape: (5, 3905)


Let's print five topics and top ten words in each topic. However, the last line of the code (topic.argsort[:-11:-1]) could be difficult to understand. Argsort gives the indexes of the values that sorts the data (words in topic) in ascending order. And the remaining part [-11:-1] sort them in descending order and picks the indexes of top 10 words. To understand this code play with the following commented code.

In [91]:
# Code to understand the following reverse sorting. 
a=[1,2,3,4,5,6,7,8,9,10,11,12,13,14]
# Try putting different negative and positive numbers and see what happens
a[:-4:-1]


[14, 13, 12]

In [92]:
# Get the names of each word
feature_names=vect.get_feature_names()
topWords=-11 # 10 top words actually 11th is not printed
# Go through the topic-word matrix
for topicIdx, topic in enumerate(lda.components_):
    print ("Topic ",  topicIdx)
    #Get top n words
    print (",".join([feature_names[i]   for i in topic.argsort()[:topWords:-1]]))
    

Topic  0
use,n't,anyone,people,know,available,starter,phone,please,system
Topic  1
n't,revolver,'ve,use,auto,board,semi,say,get,file
Topic  2
n't,year,get,use,car,go,problem,insurance,reserve,'m
Topic  3
armenian,-*-,russian,people,army,genocide,ottoman,muslim,use,option
Topic  4
launch,probe,n't,mission,titan,earth,get,space,use,orbit


There is some noise in our tokens but other than that, some of the topics are quite distinct and mentioning different things. Let us also see what are the topic distributions of the five topics in first two documents.

In [93]:
print ("Topic 1 \t Topic 2  Topic 3\t Topic 4  Topic 5")
print(documentTopics[0])
print()
print(documentTopics[1])


Topic 1 	 Topic 2  Topic 3	 Topic 4  Topic 5
[0.98066011 0.00480248 0.00490835 0.00479358 0.00483548]

[0.9847768  0.00379841 0.0038237  0.00379802 0.00380306]


# PLSA 

PLSA topic modeling in scikit-learn is implemented in the same way as LDA but uses a TruncatedSVD class.

In [94]:
from sklearn.decomposition import  TruncatedSVD
lsa = TruncatedSVD(n_components=5)
lsaDocTopic = lsa.fit_transform(X)
print("Document topic shape", lsaDocTopic.shape)
print ("Topics and word shape", lsa.components_.shape)

Document topic shape (100, 5)
Topics and word shape (5, 3905)


In [95]:
for topic_idx, topic in enumerate(lsa.components_):
    print ("Topic %d:" % (topic_idx))
    print (",".join([feature_names[i]   for i in topic.argsort()[:-10-1:-1]]))

Topic 0:
armenian,russian,people,army,genocide,ottoman,turkish,turk,muslim,war
Topic 1:
probe,launch,mission,titan,earth,space,orbiter,year,orbit,atmosphere
Topic 2:
Topic 3:
-*-,**,mattress,-*,suresh,come,well,contact,pick,box
Topic 4:
option,power,ssf,use,capability,module,flight,redesign,station,team


Topics generated by PLSA are similar but not exactly the same.  Here is a good blog on topic modeling covering different libraries and visualization: https://nlpforhackers.io/topic-modeling/

Attempt anyone of the exercises. They may have diffrent difficulty levels.

# Exercise 7.1
Modify the code to get rid of noise from the tokens. For example, there are lots of characters like *,/,-,=,\,_. Feel free to remove any other noise that you deem appropriate.





# Exercise 7.2
Download some documents (minimum 10 dcs) from Gutenberg project: https://www.gutenberg.org/. Apply both LDA and LSA on the documents to find out different topics discussed in the documents.



## First part - Apply LDA

In [96]:
# We first downloaded project gutenberg dataset locally at Gutenberg/txt/*.txt
# Gutenberg dataset obtained as single zip file from 
# https://drive.google.com/uc?id=0B2Mzhc7popBga2RkcWZNcjlRTGM&export=download

# The sample / reduced dataset includes works from 5 authors
# (trying to match the 5 topics we will model using LDA)
# Abraham Lincoln
# Abrose Bierce
# Edgar Allan Poe
# Lewis Carroll
# Walt Whitman
from sklearn.datasets import load_files
books = load_files("./Gutenberg_small")

print(len(books.data),"books loaded...")
#print("Contents of...",books.filenames[0])
#print(books.data[0])

gutenberg_books = books.data


63 books loaded...


In [97]:
# First we create the term-document matrix, using the same tokenizer / stemmer as before
from sklearn.feature_extraction.text import CountVectorizer

gut_vect = CountVectorizer(max_features=10000, max_df=.70,
                       tokenizer=myTokenizer, stop_words=stopWords)
gut_matrix = gut_vect.fit_transform(gutenberg_books)

print (gut_matrix.shape)
print(vect.get_feature_names())

(63, 10000)


In [98]:
# Now we do LDA

from sklearn.decomposition import LatentDirichletAllocation

#Initialize LDA
vocabulary=gut_matrix.shape[1] # total words in the training data
topics=5 
alpha=(1/topics) #alpha for LDA
beta=(1/vocabulary)# beta for LDA

#Note alpha and beta in actual LDA algorithm are actually vectors of decimal values and not a single decimal value
# LDA implementation in Scikit does not take vectors as iput for alpha and beta. So, we have to assign one value for 
# them. This means we can't really control the skewness of topics' dsitribution or skeweness of words's distrbution
# and we just have to assign equal values to all in Scikit-learn. Another gensim library can help us solve this issue
# (see Exercises)

lda = LatentDirichletAllocation(n_components=topics, learning_method="batch",
max_iter=25, random_state=0, topic_word_prior=beta ,doc_topic_prior=alpha)


# Train it.
documentTopics = lda.fit_transform(gut_matrix)

print ("Documents and topics shape: ", documentTopics.shape)
print("Topics and words shape: {}".format(lda.components_.shape))

Documents and topics shape:  (63, 5)
Topics and words shape: (5, 10000)


In [99]:
# This is the top words for each topic

# Get the names of each word
feature_names=gut_vect.get_feature_names()
topWords=-11 # 10 top words actually 11th is not printed
# Go through the topic-word matrix
for topicIdx, topic in enumerate(lda.components_):
    print ("Topic ",  topicIdx)
    #Get top n words
    print (",".join([feature_names[i]   for i in topic.argsort()[:topWords:-1]]))

Topic  0
soul,'d,woman,song,poem,thee,poet,sun,thy,america
Topic  1
lincoln,slavery,a.,united,washington,government,constitution,slave,union,douglas
Topic  2
woman,mr.,dog,gentleman,n.,political,system,american,officer,principle
Topic  3
x,·,-|,-·,h,proposition,i.e,exist,¶,class
Topic  4
alice,thy,'m,thou,'ve,bruno,soul,mr.,queen,tone


## Second part - Apply PLSA

In [100]:
from sklearn.decomposition import  TruncatedSVD
lsa = TruncatedSVD(n_components=5)
lsaDocTopic = lsa.fit_transform(gut_matrix)
print("Document topic shape", lsaDocTopic.shape)
print ("Topics and word shape", lsa.components_.shape)

Document topic shape (63, 5)
Topics and word shape (5, 10000)


In [101]:
for topic_idx, topic in enumerate(lsa.components_):
    print ("Topic %d:" % (topic_idx))
    print (",".join([feature_names[i]   for i in topic.argsort()[:-10-1:-1]]))

Topic 0:
lincoln,a.,washington,united,government,slavery,union,president,mr.,telegram
Topic 1:
·,-·,x,-|,h,proposition,¶,exist,i.e,univ
Topic 2:
soul,'d,woman,poem,thy,poet,thee,thou,song,america
Topic 3:
a.,washington,telegram,lincoln,major-general,mansion,executive,c.,department,army
Topic 4:
'd,america,c.,literature,hospital,york,to-day,american,future,democracy


# Exercise 7.3
Use LDA or LSA on movie review database of earlier labs (or any other text database with known classes) to extract 50 to 100 topics. Once the topics are extracted, use them as features for your Naive Bayes classifier. Each document topic vector will be your input to the classifier. Find out the accuracy, precision and recall.



# Exercise 7.4
In Scikit-learn code above, we couldn't pass alpha and beta as vectors with different starting probabilities for each topic or word. This is possible to achieve in gensim library in Python (https://radimrehurek.com/gensim/models/ldamodel.html). Implement LDA in gensim in Python on the text dataset used in this lab (above) with alpha and beta as vectors and not a single value. Also implement Exercise 7.1. Note gensim is easier to use than Scikit-learn.