In [0]:
! pip install -q kaggle
from google.colab import files
files.upload()
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json


Saving kaggle.json to kaggle.json


In [0]:
! wget https://www.dropbox.com/s/7mwk6671aa7cckg/simplewiki-20170201-pages-articles-multistream.zip?dl=0

--2020-05-15 08:30:20--  https://www.dropbox.com/s/7mwk6671aa7cckg/simplewiki-20170201-pages-articles-multistream.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.1, 2620:100:6018:1::a27d:301
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/7mwk6671aa7cckg/simplewiki-20170201-pages-articles-multistream.zip [following]
--2020-05-15 08:30:20--  https://www.dropbox.com/s/raw/7mwk6671aa7cckg/simplewiki-20170201-pages-articles-multistream.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc33743e5fdc331583f814df56c3.dl.dropboxusercontent.com/cd/0/inline/A3wXmJ6C3ZU8ulrQlytgP-pSt2WETokgQXtmGmGZxVR4vnV4Af8R4BMwjW1RZz2RmR15b0ScQj2BoN7UbCXdgD8zOSFjqoYVF_-X4XnePxJo0OsnDdREN8nihvIuZq_bDeQ/file# [following]
--2020-05-15 08:30:21--  https://uc33743e5fdc331583f814df56c3.dl.dropboxusercontent.com/cd/0/in

In [0]:
! unzip '/content/data.zip'

Archive:  /content/data.zip
  inflating: simplewiki-20170201-pages-articles-multistream-index.txt  
  inflating: simplewiki-20170201-pages-articles-multistream.xml  


https://appliedmachinelearning.blog/2017/08/28/topic-modelling-part-1-creating-article-corpus-from-simple-wikipedia-dump/

A huge number of text articles are generated everyday from different publishing houses, blogs, media, etc. This leads to one of the major tasks in natural language processing i.e. effectively managing, searching and categorizing articles depending upon their subjects or themes. Typically, these text mining tasks will include text clustering, document similarity and categorization of text. Comprehensively, we have to find out some ways so that the theme of the article can be extracted. In text analytics, this is known as “Topic Modelling”. Also, given a topic, our software should be able to find out articles which are similar to it. This is known as “Document Similarity”.

Deriving such meaningful information from text documents is the main objective of this blog-post series. I will be covering the whole application of topic modelling in 3 blog-posts. The purpose of the blog-post series is to build the system from scratch and provide an insight of implementation of the same to our readers. This particular post will be focusing on creating a corpus of Simple Wikipedia articles from dumped simple wiki XML file. Once the text data (articles) has been retrieved, it can be used by machine learning techniques for model training in order to discover topics from the text corpus.

There are mainly two steps in the text data retrieval process from simple Wikipedia dump:

1. XML parsing of the wiki dump
2. Cleaning of the articles’ text

The Simple Wikipedia is an edition of the online encyclopedia Wikipedia, primarily written in Basic English. The articles on Simple Wikipedia are usually shorter than their English Wikipedia counterparts, presenting only the basic information. It contains over 127,000 content pages for people to search, explore or even edit. We downloaded the free backup XML file in which all the articles are dumped. Then a sample of 60,000 simple Wikipedia articles is randomly selected for building the application. You can download the same backup XML file(used in this blog) from here or it can be downloaded from index of simple wiki website.

Seeing all this, one can observe that we have to get article text from the text tag in the XML file, which is one of the children of the revision tag (revision itself being a child of the page tag). We will use the Element Tree XML API for parsing the XML file and extracting the text portion of the article. The below Python code traverses down the tree to get the content of the text tag. The contents of each article are extracted from the text tag of that corresponding page in iterations and can be written in separate text files.

## 2. Cleaning of Article Text

Data pre-processing (a.k.a data cleaning) is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to topic modelling in any way.

There are mainly two steps that need to be done on word level:

a) Removal of stop words – Stop words like “and”, “if”, “the”, etc are very common in all English sentences and are not very meaningful in deciding the theme of the article, so these words have been removed from the articles.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider the meaning of the sentence).

In [0]:
import xml.etree.ElementTree as ET
import codecs
import re

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

tree = ET.parse('/content/simplewiki-20170201-pages-articles-multistream.xml')
root = tree.getroot()
dir_path = '/content/articles-corpus/'

for i,page in enumerate(root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')):
    for p in page:    
        if p.tag == "{http://www.mediawiki.org/xml/export-0.10/}revision":
            for x in p:
                if x.tag == "{http://www.mediawiki.org/xml/export-0.10/}text":                    
                    article_txt = x.text
                    if not article_txt == None:                                                
                        article_txt = article_txt[ : article_txt.find("==")]
                        article_txt = re.sub(r"{{.*}}","",article_txt)
                        article_txt = re.sub(r"\[\[File:.*\]\]","",article_txt)
                        article_txt = re.sub(r"\[\[Image:.*\]\]","",article_txt)
                        article_txt = re.sub(r"\n: \'\'.*","",article_txt)
                        article_txt = re.sub(r"\n!.*","",article_txt)
                        article_txt = re.sub(r"^:\'\'.*","",article_txt)
                        article_txt = re.sub(r"&nbsp","",article_txt)
                        article_txt = re.sub(r"http\S+","",article_txt)
                        article_txt = re.sub(r"\d+","",article_txt)   
                        article_txt = re.sub(r"\(.*\)","",article_txt)
                        article_txt = re.sub(r"Category:.*","",article_txt)
                        article_txt = re.sub(r"\| .*","",article_txt)
                        article_txt = re.sub(r"\n\|.*","",article_txt)
                        article_txt = re.sub(r"\n \|.*","",article_txt)
                        article_txt = re.sub(r".* \|\n","",article_txt)
                        article_txt = re.sub(r".*\|\n","",article_txt)
                        article_txt = re.sub(r"{{Infobox.*","",article_txt)
                        article_txt = re.sub(r"{{infobox.*","",article_txt)
                        article_txt = re.sub(r"{{taxobox.*","",article_txt)
                        article_txt = re.sub(r"{{Taxobox.*","",article_txt)
                        article_txt = re.sub(r"{{ Infobox.*","",article_txt)
                        article_txt = re.sub(r"{{ infobox.*","",article_txt)
                        article_txt = re.sub(r"{{ taxobox.*","",article_txt)
                        article_txt = re.sub(r"{{ Taxobox.*","",article_txt)
                        article_txt = re.sub(r"\* .*","",article_txt)
                        article_txt = re.sub(r"<.*>","",article_txt)
                        article_txt = re.sub(r"\n","",article_txt)  
                        article_txt = re.sub(r"\!|\"|\#|\$|\%|\&|\'|\(|\)|\*|\+|\,|\-|\.|\/|\:|\;|\<|\=|\>|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}|\~"," ",article_txt)
                        article_txt = re.sub(r" +"," ",article_txt)
                        article_txt = article_txt.replace(u'\xa0', u' ')
                       
                        if not article_txt == None and not article_txt == "" and len(article_txt) > 150 and is_ascii(article_txt):
                            outfile = dir_path + str(i+1) +"_article.txt"
                            f = codecs.open(outfile, "w", "utf-8")
                            f.write(article_txt)
                            f.close()
                            #print article_txt
                            #print '\n=================================================================\n'

The above code snippet of text filters can be plugged to the text extracted from the text tag (Figure 1). Finally, we keep only those articles which have length more than 150 characters. Also, we check and write only those text articles which contain only ASCII characters (English characters only).

This completes the first step towards Topic modeling, i.e. creating the corpus of articles from simple Wikipedia. Once you follow this blog till here, you will be able to create a corpus of around 70,000 articles in the directory “articles-corpus” used in python program. I will be writing about discovering the hidden topics from the corpus created in the next blog-post soon. So stay tuned till then !!

## Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation

This blog-post is second in the series of blog-posts covering “Topic Modelling” from simple Wikipedia articles. Before reading this post, I would suggest reading our first article here. In the first step towards Topic modeling which entailed creating a corpus of articles from simple Wikipedia, we were able to create a corpus of around 70,000 articles in the directory “articles-corpus”.

Look at the above featured image of this blog-post –  these are some of the topics (word distributions) which are the outcome of the experiment undertaken in this post. Lets get started with discovering topics from the corpus of wiki articles. We will be using an unsupervised machine learning technique, Latent Dirichlet Allocation (LDA), for automatically finding the mixture of similar words together, thus forming the topic or theme. From such a huge corpus of articles, we do not have the information about the categories to which these articles belong to or are related. This forms an unsupervised problem where we do not know the labels/classes/categories of the data and aim to find the groups or the clusters within the population. Having said that, I am now going to list down the steps which we have to perform in order to discover the topics hidden in the 60,000 articles, serving as training data:


    Pre-processing and training corpus creation
    Building dictionary
    Feature extraction
    LDA model training


### 1.  Preprocessing & Training data preparation.

As discussed in Part-I, we need to remove the stop words from the articles because they do not contribute to the theme of the article’s content. Similarly, stemming or lemmatization is an effective process in order to treat various inflected forms of words as a single word as they essentially mean the same. I would encourage you to go through the previous post (Part-1) if the above sentences do not make sense to you.

In [13]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [0]:
import os
import random
import codecs
import pickle
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
 
# Function to remove stop words from sentences & lemmatize words.
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    normalized = " ".join(lemma.lemmatize(word,'v') for word in stop_free.split())
    x = normalized.split()
    y = [s for s in x if len(s) > 2]
    return y
 
# Remember this folder contains 72,000 articles extracted in Part-1 (previous post)
corpus_path = "/content/articles-corpus/"
article_paths = [os.path.join(corpus_path,p) for p in os.listdir(corpus_path)]
 
# Read contents of all the articles in a list "doc_complete"
doc_complete = []
for path in article_paths:
    fp = codecs.open(path,'r','utf-8')
    doc_content = fp.read()
    doc_complete.append(doc_content)  
 
# Randomly sample 70000 articles from the corpus created from wiki_parser.py
docs_all = random.sample(doc_complete, 70000)
docs = open("docs_wiki.pkl",'wb')
pickle.dump(docs_all,docs)
 
# Use 60000 articles for training.
docs_train = docs_all[:60000]
 
# Cleaning all the 60,000 simplewiki articles
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
doc_clean = [clean(doc) for doc in docs_train]

In the above code, we are reading all the articles in a list and creating the training data by choosing 60,000 articles from randomly sampled 70,000 articles from that list. The remaining 10,000 articles are left for test purpose (document clustering/categorization) in Part-3. Further, the articles are cleaned by removing stop words and passing each word of corpus through “WordNetLemmatizer”. As a result, we get cleaned articles on which we can build the dictionary and train the LDA model for topic modelling.

In [38]:
doc_clean[0]

['francis',
 'newton',
 'frank',
 'gifford',
 'american',
 'football',
 'player',
 'television',
 'sportscaster',
 'know',
 'broadcast',
 'monday',
 'night',
 'football',
 'gifford',
 'play',
 'new',
 'york',
 'giants',
 'bear',
 'santa',
 'monica',
 'california',
 'august',
 'gifford',
 'die',
 'heart',
 'attack',
 'home',
 'greenwich',
 'connecticut',
 'age',
 'suffer',
 'chronic',
 'traumatic',
 'encephalopathy']

## 2.  Building word dictionary

In this step, we need to build the vocabulary of the corpus in which all the unique words of the article corpus are given IDs and their frequency counts are also stored. The following Python code creates the dictionary from the 60,000 randomly sampled cleaned articles. You may note that we are using gensim library for building the dictionary. In gensim, the words are referred to as “tokens” and the index of each word in the dictionary is called “id”.

In [0]:
from gensim import corpora
# Creating term dictionary of corpus, where each unique term is assigned an index.
dictionary = corpora.Dictionary(doc_clean)
 
# Filter terms which occurs in less than 4 articles & more than 40% of the articles
dictionary.filter_extremes(no_below=4, no_above=0.4)
 
# List of few words which are removed from dictionary as they are content neutral
stoplist = set('also use make people know many call include part find become like mean often different \
               usually take wikt come give well get since type list say change see refer actually iii \
               aisne kinds pas ask would way something need things want every str'.split())
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
dictionary.filter_tokens(stop_ids)

Also, it can be seen that there are 2 additional steps performed after creating the dictionary:

  All the tokens in the dictionary which either have occurred in less than 4 articles or have occurred in more than 40% of the articles are removed from the dictionary, as these words will not be contributing to the various themes or topics.

  After printing the most frequent words of the dictionary, we found that few words which are mostly content neutral words are also present in the dictionary. These words may lead to modeling of “word distribution”(topic) which is neutral and do not capture any theme or content. We made a list of such words and filtered all such words.

Once you have built the dictionary, you may find the most frequent words with their respective frequencies like this:

In [40]:
print(dictionary)

Dictionary(26530 unique tokens: ['age', 'american', 'attack', 'august', 'bear']...)


Each word is also given a unique id in the vocabulary (dictionary).

## 3.  Feature Extraction (Bag of Words)

Histograms of words are the features used for text representation. In general, we first build the vocabulary of the article corpus and then we generate a word count vector for each article, which is nothing but the frequencies of all the words in the vocabulary for that particular article. Most of them will be zero as a single article won’t contain all the words in the vocabulary. For example, suppose we have 500 words in vocabulary. So, each word count vector will contain the frequencies of these 500 vocabulary words in a particular wiki article. Suppose that the text in an article was “Get the work done, work done”. So, a fixed length encoding will be generated as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of the 500 length word count vector and the rest are zero.

In [0]:
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

The above Python code uses gensim to convert all the 60,000 articles into a document term matrix (word count vector for each document).

## 4. LDA Model Training

We have finally arrived at the training phase of topic modeling. Latent Dirichlet Allocation is an unsupervised probabilistic model which is used to discover latent themes in a document. Let’s try to understand briefly the working of LDA technique.

LDA technique makes the following two assumptions:
1. Articles/Documents are produced from a mixture of topics. Each article belongs to each of the topics to a certain degree (Each articles is made up of some topic distribution).
2. Each topic is a generative model which generates words of the vocabulary with certain probabilities. Words frequently occurring together will have more probability (Each topic is made of some word distribution).

So, can you guess the input to this algorithm?

Input is the “document-term matrix” which keeps the histograms of words (word count) present in each wiki article. The dimensions of the matrix is (M,N) ,i.e. number of documents * number of words in vocabulary. Documents and articles are interchangeable words here. We also provide K as an input, which is the number of topics that have to be discovered.

What is the output of the Latent Dirichlet Allocation algorithm?

The output of LDA algorithm are 2 smaller matrices – a document to topic matrix and a topic to word matrix. Document-Topic matrix is of (M,K) dimensions where M is number of articles andK is number of topics in the vocabulary. Topic-Word matrix is of (K,N) where N is the number of words in the vocabulary.

Document-Topic matrix accounts for the probability distribution of the topics present in the article. Similarly, Topic-Words matrix accounts for the probability distribution of words that they have been generated from that topic. Both these matrices are initialized randomly and then these distributions are improved upon in an iterative process. After repeating the previous step a large number of times, you’ll eventually reach an approximately steady state where these distributions seem logically correct.

The following Python code runs LDA algorithm using gensim implementation. Once the training is completed, the model is dumped using cPickle library for future use and all the 50 topics (learned by model) are printed.

In [42]:
from gensim.models.ldamodel import LdaModel as Lda
# Creating the object for LDA model using gensim library & Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=50, id2word = dictionary, passes=50, iterations=500)
 
# dump LDA model using cPickle for future use
ldafile = open('lda_model_sym_wiki.pkl','wb')
pickle.dump(ldamodel,ldafile)
ldafile.close()
 
# Print all the 50 topics
for i,topic in enumerate(ldamodel.print_topics(num_topics=50, num_words=10)):
   words = topic[1].split("+")
   print(words,"\n")

['0.265*"france" ', ' 0.118*"commune" ', ' 0.084*"region" ', ' 0.073*"department" ', ' 0.064*"regions" ', ' 0.059*"departments" ', ' 0.034*"north" ', ' 0.027*"calvados" ', ' 0.026*"calais" ', ' 0.017*"northwest"'] 

['0.053*"minister" ', ' 0.040*"prime" ', ' 0.036*"court" ', ' 0.034*"roman" ', ' 0.028*"carolina" ', ' 0.027*"italian" ', ' 0.021*"rome" ', ' 0.020*"movement" ', ' 0.018*"north" ', ' 0.017*"wild"'] 

['0.045*"build" ', ' 0.030*"air" ', ' 0.029*"force" ', ' 0.020*"fire" ', ' 0.019*"airport" ', ' 0.018*"attack" ', ' 0.017*"ship" ', ' 0.015*"fly" ', ' 0.015*"aircraft" ', ' 0.014*"military"'] 

['0.053*"color" ', ' 0.040*"red" ', ' 0.037*"image" ', ' 0.033*"blue" ', ' 0.021*"world" ', ' 0.020*"jpg" ', ' 0.019*"file" ', ' 0.018*"colour" ', ' 0.017*"heritage" ', ' 0.016*"map"'] 

['0.192*"movie" ', ' 0.039*"play" ', ' 0.034*"star" ', ' 0.034*"direct" ', ' 0.031*"release" ', ' 0.029*"director" ', ' 0.021*"review" ', ' 0.020*"american" ', ' 0.020*"comedy" ', ' 0.016*"drama"'] 

['0

Each row above represents a topic and their word distribution 

Final Thoughts

Hope it was an easy task for our readers to follow the blog-post till here. In this post, I have tried to explain the pipeline of the topic discovery process, from preparing the training data to the training of the LDA model. I have also tried to briefly explain the Latent Dirichlet Allocation algorithm to provide an idea of what goes into and what comes out from the LDA model. I would encourage readers to implement this series of blog-posts (see Part-1), and match their outputs with the results shown here (though topics discovered can be different at every run).

There are several factors that you can experiment with in order to get even better word distributions forming the topics:

1. Getting more number of articles : You can try increasing the number of articles by changing the minimum article length from 150 to 100 characters in Part-1. Also see if we can prevent discarding the articles which contains few non-ASCII characters. More training data may lead to better topic-word distribution.

2. Preprocessing : By analyzing the word distributions of topics generated, you may find

  pairs that are always juxtaposed (entities) e.g. “Los Angeles” (topic 20), “New York” (topic 27). These pairs should be combined like Los_Angeles or New_York.
  
  words that are not properly lemmatized like (germany, german), (chinese, china), (america,americans) etc. Lemmatization of nouns may help. Remember, we did lemmatization of verbs lemma.lemmatize(word,'v').

3. Dictionary : The vocabulary of the corpus can be improved by removing the content neutral words. Iteratively running the whole topic discovery process and analyzing the word distributions (topics) can help in finding content neutral words from dictionary. Some example are “ing” (topic 10), “per” (Topic 43).

4. Parameters of LDA : There are two parameters of LDA to look upon –  alpha and beta. Understanding the mathematics behind LDA model may help in tuning these parameters. I would encourage readers to do so.

## Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles

In this final blog-post of the “Topic Modelling” series, we will see following usages of the knowledge acquired (the topics discovered) after LDA training.

    Document Clustering : Clustering the set of similar wiki-articles in 50 clusters.
    Document Exploration : Given some word, search related articles.
    Theme Extraction : Find out the theme of the article.

As the first step, we will write a function to clean the test article similar to what we had used before training the corpus. The pre-processing is always required before fetching the articles for any  of the above mentioned applications.

In [0]:
def rem_ascii(s):
    return "".join([ch for ch in s if ord(ch) < 128 ])
    return y

In the pre-processing step, we are basically removing the non-ASCII characters, punctuation marks and stop words. Other than that, we are also lemmatizing the words of the input articles.

## 1. Document Clustering

Document clustering is an unsupervised approach to cluster the articles depending upon the topics which have been discovered in the training phase. Document clustering takes a corpus of unlabeled articles as an input and categorizes them in various groups according to the best matched word distributions (topics) generated from training. Following are the steps performed for document clustering.

  Clean all the articles in the input cluster.

  Convert each of the text articles into bag-of-words features using the same dictionary of trained model.

  Extract the best matched topic from each article using the trained LDA model. In gensim implementation, we have get_document_topic()function which does the same.

  Write the article in the directory belonging to extracted topic if minimum probability criteria is satisfied, otherwise push it in the “unknown” directory.
  
  If the extracted topic (word distribution) is ambiguous , then we choose the 2nd best matched topic (as some of the discovered 50 topics are content neutral).


In [0]:
def clean_doc(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    normalized = " ".join(lemma.lemmatize(word,'v') for word in stop_free.split())
    x = normalized.split()
    y = [s for s in x if len(s) > 2]
    return y

In [0]:
def cluster_similar_documents(corpus, dirname):
    clean_docs = [clean_doc(doc) for doc in corpus]
    test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs]
    doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20)
    for k,topics in enumerate(doc_topics):
        if topics:
            topics.sort(key = itemgetter(1), reverse=True)
            dir_name = dirname + "/" + str(topics[0][0])
            file_name = dir_name + "/" + str(k) + ".txt"
            if not os.path.exists(dir_name):
                os.makedirs(dir_name)
            fp = open(file_name,"w")
            fp.write(docs_test[k] + "\n\n" + str(topics[0][1]) )
            fp.close()
        else:
            if not os.path.exists(dirname + "/unknown"):
                os.makedirs(dirname + "/unknown")
            file_name = dirname + "/unknown/" + str(k) + ".txt"
            fp = open(file_name,"w")
            fp.write(docs_test[k])

The above Python function follows the above steps to perform document clustering given an article corpus. It also takes a parameter dirname under which it creates 50 sub-directories containing clustered articles.

2. Document Exploration

Document exploration can be another application that can be build over the trained LDA model. Here, given a word or theme, we extract the documents related to it. Mainly, it is a two step process:

    Get the best matched topic cluster (highest probability) for the given word.
    Get “top” most probable related articles from the matched topic cluster in step 1.

get_term_topics() is the function which is used for getting the best matched topic cluster given a theme/word.

In [0]:
def get_related_documents(term, top, corpus):
    clean_docs = [clean_doc(doc) for doc in corpus]
    related_docid = []
    test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs]
    doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20)
    term_topics =  ldamodel.get_term_topics(term, minimum_probability=0.000001)
    for k,topics in enumerate(doc_topics):
        if topics:
            topics.sort(key = itemgetter(1), reverse=True)
            if topics[0][0] == term_topics[0][0]:
                related_docid.append((k,topics[0][1]))
 
    related_docid.sort(key = itemgetter(1), reverse=True)
    for j,doc_id in enumerate(related_docid):
        print(doc_id[1],"\n\n",docs_test[doc_id[0]])
        if(j == (top-1)):
            break

The above Python function implements a documents exploring system where given a word/theme/topic as an input, it prints the “top” most related articles from the simple wiki test corpus. The test corpus is also given as input to the function.

## 3. Theme Extraction

We know that 50 word distributions were discovered in Part-2 of this blog-post series. I have manually given theme names to each of the word distributions. You may give different theme names depending upon how you look at the word distributions. If each of the word distributions discovered accurately belong to a particular theme, then topic extraction from articles can be another useful application that can be implemented. You can view this file to see the mapping of manually given topics to the word distributions discovered.

In [0]:
def get_theme(doc):
    # below topics variable is manually defined by looking at 50 topics created by LDA model
    topics = "Electrical_systems_or_Education unknown music unknown Software International_event Literature War_or_Church Lingual_or_Research Biology Waterbody Wikipedia_or_Icehockey unknown unknown html_tags sports TV_shows Terms_and_Services music US_states Timeline Chemistry Germany Location_area Film_awards Games US_school unknown Railways Biography Directions_Australlia France India_Pakistan Canada_politcs_or_WWE Politics unknown British_Royal_Family American_Movies unknown Colors_or_Birds Fauna Chinese_Military unknown unknown unknown unknown unknown html_tags US_Govt Music_band".split()
    theme = ""
    cleandoc = clean_doc(doc)
    doc_bow = ldamodel.id2word.doc2bow(cleandoc)
    doc_topics = ldamodel.get_document_topics(doc_bow, minimum_probability=0.20)
    if doc_topics:
        doc_topics.sort(key = itemgetter(1), reverse=True)
        theme = topics[doc_topics[0][0]]
        if theme == "unknown":
            theme = topics[doc_topics[1][0]]
    else:
        theme = "unknown"
    return theme

The above Python function extracts the theme from article given as an argument. Having written 3 different functions, we will see now how we can call them in main program. The following Python snippet can be executed to perform these applications

In [49]:
import pickle
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from operator import itemgetter
import os
 
# initialize WordNetLemmatizer and get the list of english stop words
stop = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
 
# Load trained LDA model (described in Part-2 of blog-post series)
lda_fp = open("/content/lda_model_sym_wiki.pkl", 'rb')
ldamodel = pickle.load(lda_fp)
 
# Load the articles corpus to choose 10,000 files for test purpose
docs_fp = open("/content/docs_wiki.pkl", 'rb')
docs_all = pickle.load(docs_fp)
docs_test = docs_all[60000:]
 
# Get 'top' related documents given a word(term)
get_related_documents("music",5,docs_test)
# performs document clustering given a set of documents
cluster_similar_documents(docs_test,'/content/cluster')
 
# Extract the theme of article
article = "Mohandas Karamchand Gandhi[14] was born on 2 October 1869[1] to a Hindu Modh Baniya family[15] in Porbandar (also known as Sudamapuri), a coastal town on the Kathiawar Peninsula and then part of the small princely state of Porbandar in the Kathiawar Agency of the Indian Empire. His father, Karamchand Uttamchand Gandhi (1822–1885), served as the diwan (chief minister) of Porbandar state.[16] Although he only had an elementary education and had previously been a clerk in the state administration, Karamchand proved a capable chief minister.[17] During his tenure, Karamchand married four times. His first two wives died young, after each had given birth to a daughter, and his third marriage was childless. In 1857, Karamchand sought his third wife's permission to remarry; that year, he married Putlibai (1844–1891), who also came from Junagadh,[18] and was from a Pranami Vaishnava family.[19][20][21][22] Karamchand and Putlibai had three children over the ensuing decade, a son, Laxmidas (c. 1860 – March 1914), a daughter, Raliatbehn (1862–1960) and another son, Karsandas (c. 1866–1913)"
print("For the given article :", "\n")
print("Theme -> ",get_theme(article))

0.6751872 

  Emotional Rescue is the tenth Worldwide studio album and seventeenth American studio album by The Rolling Stones The album was released on June It was the group s seventh studio album with label Rolling Stones Records 
0.64980745 

  album ratings Bathory is the first studio album by the Sweden Swedish extreme metal band Bathory Bathory It was released in October through Tyfon Grammofon Many people say that this album is the very first black metal album 
0.64852554 

  Their Satanic Majesties Request is the eighth American studio album and first Worldwide studio album by The Rolling Stones The album was released on December 
0.64146054 

  The World Needs a Hero is the ninth studio album by American thrash metal band Megadeth The World Needs a Hero was released on May through Sanctuary Records Group Sanctuary Records It is Megadeth s last album to feature original bassist David Ellefson until he returned in It is also the only Megadeth album to feature Al Pitrelli on lead