# Topic Modeling using Latent Dirichlet Allocation

## What is Topic Modeling

Topic modeling is the process of identifying patterns in text data that corresponds to a topic.

## What if the text contains multiple topic

If the text contains multiple topics, this technique can be used to identify and separate those themes within the input text. We do this to uncover hidden thematic structure in the given sets of documents.


## Is this a Supervised/Unsupervised Learning?

Topic modeling algorithms does not need any labeled data. It is like unsupervised learning where it will identify the patterns on its own. 

## Latent Dirichlet Allocation

A topic modeling technique where the underlying intuition is that a given piece of text

In [8]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

In [9]:
# Load input data
def load_data(input_file):
    data = []
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data.append(line[:-1])
    return data

In [18]:
# Processor function for tokenizing, removing stop words and stemming
def process(input_text):
    # Create a regular expression tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Create a snowball stemmer
    stemmer = SnowballStemmer('english')
    
    # Get the list of stopwords
    stop_words = stopwords.words('english')
    
    # Tokenize the input string
    tokens = tokenizer.tokenize(input_text.lower())
    
    # Remove the stop words
    tokens = [x for x in tokens if not x in stop_words]
    
    # Perform stemming on the tokenized words
    tokens_stemmed = [stemmer.stem(x) for x in tokens]
    
    return tokens_stemmed   

In [31]:
# Load input data
data = load_data('data.txt')
for i, line in enumerate(data):
    print('{}. {}\n'.format(i, line))

0. The Roman empire expanded very rapidly and it was the biggest empire in the world for a long time.

1. An algebraic structure is a set with one or more finitary operations defined on it that satisfies a list of axioms.

2. Renaissance started as a cultural movement in Italy in the Late Medieval period and later spread to the rest of Europe.

3. The line of demarcation between prehistoric and historical times is crossed when people cease to live only in the present.

4. Mathematicians seek out patterns and use them to formulate new conjectures.  

5. A notational symbol that represents a number is called a numeral in mathematics. 

6. The process of extracting the underlying essence of a mathematical concept is called abstraction.

7. Historically, people have frequently waged wars against each other in order to expand their empires.

8. Ancient history indicates that various outside influences have helped formulate the culture and traditions of Eastern Europe.

9. Mappings between s

In [20]:
# Create a list for sentence tokens
tokens = [process(x) for x in data]
print(tokens)

[['roman', 'empir', 'expand', 'rapid', 'biggest', 'empir', 'world', 'long', 'time'], ['algebra', 'structur', 'set', 'one', 'finitari', 'oper', 'defin', 'satisfi', 'list', 'axiom'], ['renaiss', 'start', 'cultur', 'movement', 'itali', 'late', 'mediev', 'period', 'later', 'spread', 'rest', 'europ'], ['line', 'demarc', 'prehistor', 'histor', 'time', 'cross', 'peopl', 'ceas', 'live', 'present'], ['mathematician', 'seek', 'pattern', 'use', 'formul', 'new', 'conjectur'], ['notat', 'symbol', 'repres', 'number', 'call', 'numer', 'mathemat'], ['process', 'extract', 'under', 'essenc', 'mathemat', 'concept', 'call', 'abstract'], ['histor', 'peopl', 'frequent', 'wage', 'war', 'order', 'expand', 'empir'], ['ancient', 'histori', 'indic', 'various', 'outsid', 'influenc', 'help', 'formul', 'cultur', 'tradit', 'eastern', 'europ'], ['map', 'set', 'preserv', 'structur', 'special', 'interest', 'mani', 'field', 'mathemat']]


In [22]:
# Create a dictionary based on the sentence tokens
dict_tokens = corpora.Dictionary(tokens)
print(dict_tokens)

Dictionary(78 unique tokens: ['roman', 'empir', 'expand', 'rapid', 'biggest']...)


In [23]:
# Create a document-term matrix
doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]
print(doc_term_mat)

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)], [(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(7, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1)], [(46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)], [(50, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1)], [(1, 1), (2, 1), (33, 1), (35, 1), (59, 1), (60, 1), (61, 1), (62, 1)], [(20, 1), (29, 1), (43, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1)], [(9, 1), (10, 1), (52, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1)]]


In [38]:
# Define the number of topics for the LDA model
num_topics = 2

In [39]:
# Generate the LDA model
ldamodel = models.ldamodel.LdaModel(doc_term_mat, num_topics=num_topics, id2word=dict_tokens, passes=25)

In [41]:
num_words = 10
print('\nTopc {} contributing words to each topic:'.format(num_words))
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic:', item[0])
    
    # Print the contributing words along with their relative contributions
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        weight = text.split('*')[0]
        word = text.split('*')[1]
        print('{} => {}%'.format(word, round(float(weight) * 100, 2)))


Topc 10 contributing words to each topic:

Topic: 0
"mathemat" => 5.5%
"call" => 3.9%
"structur" => 2.4%
"set" => 2.4%
"map" => 2.4%
"field" => 2.4%
"interest" => 2.4%
"preserv" => 2.4%
"special" => 2.4%
"mani" => 2.4%

Topic: 1
"empir" => 3.3%
"cultur" => 2.3%
"europ" => 2.3%
"time" => 2.3%
"peopl" => 2.3%
"histor" => 2.3%
"expand" => 2.3%
"formul" => 2.3%
"movement" => 1.4%
"itali" => 1.4%
