# LDA Topic modeling

LDA stands for Latent Dirichlet allocation, It is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. It is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories

In [81]:
import numpy as np
import json
import glob

In [82]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

In [83]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [84]:
from itertools import chain

## Preparing the data

In [85]:
def load_json_data(file):
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data

def write_json_data(file, data):
    with open(file, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)

In [86]:
stop_words = stopwords.words('english')

In [87]:
# stop_words

In [88]:
with open('./SampleData/sample_text.txt') as f:
    data = f.readlines()

In [89]:
sentences = nltk.sent_tokenize(data[0])

In [90]:
sentences[0]

'artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.'

In [91]:
def lemmatization(texts, stop_words ,allowed_postags=['NN', 'NNS', 'NNP', 'NNPS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS']):
    #Default Allowed postags are nouns, adjectives, verbs, adverb
    texts_out = []
    lemmatizer = WordNetLemmatizer()
    for text in texts:
        new_text = []
        words = nltk.word_tokenize(text)
        tagged_words = nltk.pos_tag(words)
        for tags in tagged_words:
            if((tags[0] not in stop_words) and (tags[1] in allowed_postags)):
                new_text.append(lemmatizer.lemmatize(tags[0]))
        final_text = " ".join(new_text)
        texts_out.append(final_text)
    
    return texts_out

In [92]:
def group_sentences(sentences, group_len = 3):
    new_sentences = []
    for idx in range(0, len(sentences), group_len):
        new_sent = ''
        i = idx
        while i<len(sentences) and i<idx+3:
            new_sent += sentences[i]
            i += 1
        new_sentences.append(new_sent)
    return new_sentences

In [93]:
bigger_sentences = group_sentences(sentences)

In [94]:
bigger_sentences[0]

'artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience.Since the development of the digital computer in the 1940s, it has been demonstrated that computers can be programmed to carry out very complex tasksâ€”as, for example, discovering proofs for mathematical theorems or playing chessâ€”with great proficiency.'

In [95]:
lemmatized_text = lemmatization(sentences, stop_words)

In [96]:
lemmatized_text[0]

'artificial intelligence AI ability digital computer computer-controlled robot perform task commonly associated intelligent being'

In [97]:
def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True) #Deacc is used to remove accents
        final.append(new)
    return final

In [98]:
data_words = gen_words(lemmatized_text)
print(data_words[0])

['artificial', 'intelligence', 'ai', 'ability', 'digital', 'computer', 'computer', 'controlled', 'robot', 'perform', 'task', 'commonly', 'associated', 'intelligent', 'being']


In [99]:
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

In [100]:
print(corpus[0])
print(id2word[[0][:1][0]]) #Return first word --> 0 index has word ability

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]
ability


## Improving the data by NLP More NLP methods

Bigrams and Trigrams

In [101]:
bigram_phrases = gensim.models.Phrases(data_words, min_count=3, threshold=25)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=25)

bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

# min count is number of times the word must occur to be considered as a bigram
# Threshold is represent a score threshold for forming the phrases , Higher number means fewer phrases

In [102]:
def make_bigrams(texts):
    return [bigram[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram[bigram[doc]] for doc in texts]

In [103]:
data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

In [104]:
print(data_bigrams_trigrams[0])

['artificial_intelligence', 'ai', 'ability', 'digital', 'computer', 'computer', 'controlled', 'robot', 'perform', 'task', 'commonly', 'associated', 'intelligent', 'being']


TF-IDF removal

In [105]:
from gensim.models import TfidfModel

In [106]:
id2word = corpora.Dictionary(data_bigrams_trigrams)
texts = data_bigrams_trigrams
corpus = [id2word.doc2bow(text) for text in texts]  # bow - bag of words
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]


In [107]:
tfidf = TfidfModel(corpus=corpus, id2word=id2word)
low_value = 0.03
words = []
words_missing_in_tfid = []
for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = []
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    drops = low_value_words + words_missing_in_tfid
    for item in drops:
        words.append(id2word[item])
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf score 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]     
    corpus[i] = new_bow

## Modeling LDA Model

In [108]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10,
                                           random_state=42,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto')

In [109]:
lda_model.print_topics()

[(0,
  '0.044*"model" + 0.030*"machine_learning" + 0.030*"data" + 0.016*"predict" + 0.016*"common" + 0.016*"future" + 0.016*"inductive" + 0.016*"collected" + 0.016*"anomalous" + 0.016*"behavioura"'),
 (1,
  '0.027*"truth" + 0.018*"conclusion" + 0.018*"case" + 0.018*"premise" + 0.018*"deductive" + 0.018*"inference" + 0.018*"computer" + 0.018*"artificial_intelligence" + 0.018*"identify" + 0.010*"difference"'),
 (2,
  '0.048*"burrow" + 0.048*"food" + 0.025*"deposit" + 0.025*"return" + 0.025*"coast" + 0.025*"inside" + 0.025*"clear" + 0.025*"wasp" + 0.025*"first" + 0.025*"check"'),
 (3,
  '0.029*"intelligence" + 0.029*"characterize" + 0.029*"trait" + 0.029*"generally" + 0.029*"combination" + 0.029*"ability" + 0.029*"many" + 0.029*"diverse" + 0.029*"human" + 0.029*"psychologists"'),
 (4,
  '0.046*"ai" + 0.046*"problem" + 0.024*"solving" + 0.024*"following" + 0.024*"reasoning" + 0.024*"research" + 0.024*"learning" + 0.024*"intelligence" + 0.024*"component" + 0.024*"using"'),
 (5,
  '0.029*"ar

## Vizualizing the data

In [110]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [111]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word, mds='mmds', R=15)
vis

  default_term_info = default_term_info.sort_values(


## Find which sentence belong to which cluster

In [112]:
lda_corpus = lda_model[corpus]

In [113]:
cluster_index_list = [doc for doc in lda_corpus]
len(cluster_index_list)

26

In [114]:
# scores = list(chain(*[[score for topic_id,score in topic] \ for topic in [doc for doc in lda_corpus]]))
# threshold = sum(scores)/len(scores)
# print(threshold)

In [115]:
cluster_idx = lda_model.show_topics()
topics = {}
present_topics = []
for term in cluster_idx:
    terms = term[1].split('+')
    idx = 0
    element = terms[idx].split('*')[1]
    element = element.strip()[1:-1]
    while(element in present_topics):
        idx += 1
        element = terms[idx].split('*')[1]
        element = element.strip()[1:-1]
    present_topics.append(element)
    topics[term[0]] = element
    
topics

{0: 'model',
 1: 'truth',
 2: 'burrow',
 3: 'intelligence',
 4: 'ai',
 5: 'artificial_intelligence',
 6: 'regardless',
 7: 'human',
 8: 'computer',
 9: 'ichneumoneus'}

In [116]:
sentences[0]

'artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.'

In [117]:
clustered_sentences = {}
for idx in range(0, len(cluster_index_list)):
    indexes = cluster_index_list[idx]
    if(len(indexes) == 1):
        clustered_sentences[sentences[idx]] = topics[indexes[0][0]]
    else:
        max_prob = 0
        topic = ''
        for index in indexes:
            prob = index[1]
            if(prob > max_prob):
                max_prob = prob
                topic = topics[index[0]]
        clustered_sentences[sentences[idx]] = topic

In [118]:
clustered_sentences

{'artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.': 'artificial_intelligence',
 'The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience.': 'human',
 'Since the development of the digital computer in the 1940s, it has been demonstrated that computers can be programmed to carry out very complex tasksâ€”as, for example, discovering proofs for mathematical theorems or playing chessâ€”with great proficiency.': 'computer',
 'Still, despite continuing advances in computer processing speed and memory capacity, there are as yet no programs that can match human flexibility over wider domains or in tasks requiring much everyday knowledge.': 'artificial_intelligence',
 'On the other hand, some programs have attained the perfo

In [119]:
grouped_sentences = {k: '' for k in range(0, 10)}
for idx in range(0, len(cluster_index_list)):
    indexes = cluster_index_list[idx]
    if(len(indexes) == 1):
        grouped_sentences[indexes[0][0]] += sentences[idx]
    else:
        max_prob = 0
        best_index = 0
        for index in indexes:
            prob = index[1]
            if(prob > max_prob):
                max_prob = prob
                best_index = index[0]
        grouped_sentences[best_index] += sentences[idx]

print(grouped_sentences)

{0: 'Intelligenceâ€”conspicuously absent in the case of Sphexâ€”must include the ability to adapt to new circumstances.Inductive reasoning is common in science, where data are collected and tentative models are developed to describe and predict future behaviourâ€”until the appearance of anomalous data forces the model to be revised.Thatâ€™s why we invented automated machine learning, which allows users of all skill levels to easily and rapidly build and deploy machine learning models.', 1: 'On the other hand, some programs have attained the performance levels of human experts and professionals in performing certain specific tasks, so that artificial intelligence in this limited sense is found in applications as diverse as medical diagnosis, computer search engines, and voice or handwriting recognition.The most significant difference between these forms of reasoning is that in the deductive case the truth of the premises guarantees the truth of the conclusion, whereas in the inductive c

In [120]:
for i in grouped_sentences:
    print(grouped_sentences[i])
    print('\n')

Intelligenceâ€”conspicuously absent in the case of Sphexâ€”must include the ability to adapt to new circumstances.Inductive reasoning is common in science, where data are collected and tentative models are developed to describe and predict future behaviourâ€”until the appearance of anomalous data forces the model to be revised.Thatâ€™s why we invented automated machine learning, which allows users of all skill levels to easily and rapidly build and deploy machine learning models.


On the other hand, some programs have attained the performance levels of human experts and professionals in performing certain specific tasks, so that artificial intelligence in this limited sense is found in applications as diverse as medical diagnosis, computer search engines, and voice or handwriting recognition.The most significant difference between these forms of reasoning is that in the deductive case the truth of the premises guarantees the truth of the conclusion, whereas in the inductive case the t

# Testing the py file (Converted function file)

In [121]:
import ldamodule

In [122]:
data[0]

'artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience. Since the development of the digital computer in the 1940s, it has been demonstrated that computers can be programmed to carry out very complex tasksâ€”as, for example, discovering proofs for mathematical theorems or playing chessâ€”with great proficiency. Still, despite continuing advances in computer processing speed and memory capacity, there are as yet no programs that can match human flexibility over wider domains or in tasks requiring much everyday knowledge. On the other hand, some programs have attained the performance levels of human experts and professionals in performing certain specific tasks, so th

In [123]:
grouped_text = ldamodule.create_topics(data[0], 3, 10)

In [124]:
grouped_text

[{0: 'He is not in the cafÃ©; therefore he is in the museum, and of the latter, Previous accidents of this sort were caused by instrument failure; therefore this accident was caused by instrument failure.The most significant difference between these forms of reasoning is that in the deductive case the truth of the premises guarantees the truth of the conclusion, whereas in the inductive case the truth of the premise lends support to the conclusion without giving absolute assurance.Inductive reasoning is common in science, where data are collected and tentative models are developed to describe and predict future behaviourâ€”until the appearance of anomalous data forces the model to be revised.',
  1: 'Deductive reasoning is common in mathematics and logic, where elaborate structures of irrefutable theorems are built up from a small set of basic axioms and rules.There has been considerable success in programming computers to draw inferences, especially deductive inferences.However, true 