# LDA Topic Modelling in Python:  Example 1

This notebook is a use case of Gensim’s LDA model based on the "LDA Topic Modelling in Python" [Medium article](https://medium.com/swlh/lda-topic-modelling-in-python-7e9d08a64f33) posted by George Pipis.

## Loading the data and required libraries

In [3]:
import pandas as pd
import gensim
from sklearn.feature_extraction.text import CountVectorizer
documents = pd.read_csv('news-data.csv');
documents.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [6]:
for i in range(5):
    print(documents.headline_text[i])

aba decides against community broadcasting licence
act fire witnesses must be aware of defamation
a g calls for infrastructure protection summit
air nz staff in aust strike for pay rise
air nz strike to affect australian travellers


## Pre-Processing

### CountVectoriser

In [7]:
vect = CountVectorizer(min_df=20, # remove tokens that don't appear in at least 20 documents
                       max_df=0.2, # remove tokens that appear in more than 20% of the documents
                       stop_words='english', # remove stop words
                       token_pattern='(?u)\\b\\w\\w\\w+\\b') # find three-letter tokens

X = vect.fit_transform(documents.headline_text) # fit and transform the data

### Converting our sparse matrix to a dense Gensim structure

In [8]:
# Convert sparse matrix to dense gensim corpus
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

In [9]:
# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

## LDA Modelling

In [10]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

ldamodel = gensim.models.LdaMulticore(corpus=corpus, 
                                      id2word=id_map, 
                                      passes=2,
                                      random_state=5, 
                                      num_topics=10, 
                                      workers=2)

## Exploring the topics

In [11]:
for idx, topic in ldamodel.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.014*"west" + 0.013*"high" + 0.012*"market" + 0.011*"qld" + 0.009*"rise" + 0.009*"tax" + 0.008*"nsw" + 0.008*"share" + 0.008*"premier" + 0.007*"business"


Topic: 1 
Words: 0.015*"open" + 0.013*"live" + 0.012*"time" + 0.012*"report" + 0.011*"island" + 0.010*"australia" + 0.009*"commission" + 0.009*"says" + 0.009*"inquiry" + 0.009*"western"


Topic: 2 
Words: 0.026*"government" + 0.019*"nsw" + 0.016*"says" + 0.014*"health" + 0.013*"election" + 0.012*"indigenous" + 0.011*"calls" + 0.011*"council" + 0.011*"labor" + 0.009*"help"


Topic: 3 
Words: 0.014*"afl" + 0.012*"win" + 0.011*"2015" + 0.010*"change" + 0.009*"energy" + 0.009*"australia" + 0.009*"media" + 0.009*"week" + 0.008*"final" + 0.008*"talks"


Topic: 4 
Words: 0.024*"south" + 0.023*"adelaide" + 0.015*"rural" + 0.014*"national" + 0.013*"perth" + 0.011*"city" + 0.010*"community" + 0.009*"hobart" + 0.009*"students" + 0.009*"aboriginal"


Topic: 5 
Words: 0.031*"queensland" + 0.024*"coast" + 0.021*"north" + 0.016*"

✏️ We can see the key words of each topic. For example the Topic 6 contains words such as “ court”, “ police”, “ murder” and the Topic 7 contains words such as “ donald”, “ trump “ etc

## Topic Distribution

### Getting probabilities for belonging to each topic 

In [42]:
my_document = documents.headline_text[2]
print(my_document)

def topic_distribution(string_input):
    string_input = [string_input]
    X = vect.transform(string_input)
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    output = list(ldamodel[corpus])[0]
    return output

for i in topic_distribution(my_document):
    print(i)

a g calls for infrastructure protection summit
(0, 0.21994263)
(1, 0.020007217)
(2, 0.41983947)
(3, 0.22017446)
(4, 0.020006038)
(5, 0.020006038)
(6, 0.020006038)
(7, 0.020006038)
(8, 0.020006038)
(9, 0.020006038)


As we can see, this document is more likely to belong to topic 2 with a
41% probability. Let’s recall topic 4:

In [43]:
ldamodel.print_topics(-1)[2]

(2,
 '0.026*"government" + 0.019*"nsw" + 0.016*"says" + 0.014*"health" + 0.013*"election" + 0.012*"indigenous" + 0.011*"calls" + 0.011*"council" + 0.011*"labor" + 0.009*"help"')

### Getting a single topic prediction

In [45]:
def topic_prediction(my_document):
    string_input = [my_document]
    X = vect.transform(string_input)
    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    output = list(ldamodel[corpus])[0]
    topics = sorted(output,key=lambda x:x[1],reverse=True)
    return topics[0][0]

topic_prediction(my_document)

2

# Conclusion from the authors:

Additional suggestions: TF-IDF, lemmatisation/stemming etc.