## Topic Modeling: LDA

 This notebook includes the process of training an LDA (Latent Dirichlet allocation) model.
 
#### Input:

ws2_1_article_clean.csv: 

This dataset contains all the clean articles, obtained from the `ws2_1_data_preparation` notebook.

#### Output:

covid_topic_<strong>XX</strong>.html:

The code will produce the LDA plot as html file where <strong>XX</strong> is the optimal number of topics obtained from analyzing the LDA results.

ws_2_article_topic_<strong>XX</strong>.csv:

The code will produce the LDA results as features and save them into this file where <strong>XX</strong> is the optimal number of topics. It is structured in 9 columns: an article ID, an article (original text), a number of words (in the article), a cleaned version of the text, the number of words (in the cleaned text), a publication date, a dominant topic in the article, the weight of the topic, a set of keywords related to the topic, and a topic label.
 

#### Topic modeling process includes:

- Include all the dependencies
- Import clean data
- Train the LDA model and compute the coherence metric
- Visualize the topics
- LDA as feature
- Map manual labels to topics


### Include all the dependencies

In [None]:
import pandas as pd
import numpy as np

!pip install gensim
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt

!pip install pyLDAvis
import pyLDAvis.gensim

pd.options.display.max_colwidth = 200
import warnings
warnings.filterwarnings('ignore')

### Configuration parameters

In [None]:
# The path to the output folder where all the outputs will be saved
output_path = "/project_data/data_asset"

### Import Clean Data

In [None]:
# import the clean articles
articles = pd.read_csv(f"{output_path}/ws2_1_article_clean.csv")
articles.head()

### Train LDA Model

Here we train the LDA model and compute the coherence metric for a range of topic numbers. This metric calculates topic coherence for topic models which is the degree of semantic similarity between high scoring words in a topic. 

First, we create the term dictionary of our corpus, where every unique term is assigned an index. Then, we filter the least and most frequent words and convert the list of documents (corpus) into Document Term Matrix using the dictionary. We train LDA and obtain the number of topics where the topic coherence is the highest. Finally, we train the LDA model with the optimal number of topics.

In [None]:
words = [text.split() for text in articles['article_clean']]

In [None]:
# create the term dictionary of courpus
dictionary = corpora.Dictionary(words)

# filter the least and most frequent words: filters if less than no_below, more than no_above
dictionary.filter_extremes(no_below=10, no_above=0.9) 
dictionary.compactify()

# convert list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(word) for word in words]

In [None]:
# train LDA, computing the coherence score for a range of topics
coherence_scores = []

for num_topics in range(2, 14, 2):
    
    print(f"Number of topics: ", num_topics)
    
    # create the object for LDA model using gensim library
    Lda = gensim.models.ldamulticore.LdaMulticore

    # run and train LDA model on the document term matrix.
    ldamodel = Lda(doc_term_matrix, 
                   num_topics=num_topics, 
                   id2word = dictionary, 
                   passes=20, 
                   chunksize = 2000, 
                   random_state=42,
                   workers=6)
    
    # compute the coherence score
    coherence_model = CoherenceModel(model=ldamodel, 
                                     texts=words, 
                                     dictionary=dictionary, 
                                     coherence='c_v')

    coherence_lda = coherence_model.get_coherence()
    
    coherence_scores.append((num_topics, coherence_lda))

coherence_scores = [*zip(*coherence_scores)]

In [None]:
# plot the coherence score for topics
plt.plot(coherence_scores[0], coherence_scores[1], marker='o')
plt.title('Coherence Score for Topics')
plt.show()

In [None]:
# set the number of topics where coherence score is the highest
num_topics = 6

# run and train LDA model on the document term matrix.
Lda = gensim.models.ldamulticore.LdaMulticore

ldamodel = Lda(doc_term_matrix, 
               num_topics=num_topics, 
               id2word=dictionary, 
               passes=20, 
               chunksize=10000, 
               random_state=42,
               workers=6)

In [None]:
# view the topics with their most important words and their proportions
ldamodel.print_topics(num_topics=num_topics, num_words=10)

### Visualization

For understanding the LDA plot:
    
- click a circle in the left panel to select a topic.
- the bar chart in the right panel will display the 30 most relevant terms for the selected topic.
- the red bars represent the frequency of a term in a given topic, (proportional to p(term | topic)). 
- the blue bars represent a term's frequency across the entire corpus, (proportional to p(term)). 
- small values of λ (near 0) highlight potentially rare, but exclusive terms for the selected topic. 
- large values of λ (near 1) highlight frequent, but not necessarily exclusive, terms for the selected topic.

In [None]:
# visualize the intractive LDA plot
lda_display = pyLDAvis.gensim.prepare(ldamodel, 
                                      doc_term_matrix, 
                                      dictionary, 
                                      sort_topics=False)
pyLDAvis.display(lda_display)

In [None]:
# save the plot in html format
pyLDAvis.save_html(lda_display, f"{output_path}/covid_topic_{num_topics}.html")

### LDA as feature

Here we get the dominant topic and its proportion per document, and concatenate them with the main dataset.

In [None]:
# user inputs
corpus = doc_term_matrix
texts = articles
df = articles

In [None]:
# function to get dominant topic, percentage of contribution, and keywords for each document
def format_topics_sentences(ldamodel, corpus):

    results = []
    
    # get main topic in each document
    for row in ldamodel[corpus]:
        
        if len(row) == 0:
            continue
            
        row = list(sorted(row, key=lambda elem: elem[1], reverse=True))
        
        # get the dominant topic, percentage of contribution and keywords for each document
        topic_num, prop_topic = row[0]        
        wp = ldamodel.show_topic(topic_num)
        topic_keywords = ", ".join([word for word, prop in wp])
        results.append((topic_num, round(prop_topic, 4), [topic_keywords]))
    
    df = pd.DataFrame.from_records(results, columns=['dominant_topic', 'weight', 'keywords'])
    
    return(df)

In [None]:
df_topics = format_topics_sentences(ldamodel, corpus)
df_topics.head()

In [None]:
# concatenate with the main dataset
articles = pd.concat([articles, df_topics.reindex(articles.index)], axis=1)

### Map to topic labels

Here we map the topic labels to the `dominant_topic` column obtained above. The topic labels are defined by analysing the LDA interactive plot.

In [None]:
# Define the topic labels for all the topics identified.
 
topics_dict = [[0, 'label_1'],
               [1, 'label_2'], 
               [2, 'label_3'], 
               [3, 'label_4'],
               [4, 'label_5'],
               [5, 'label_6']]

labels = pd.DataFrame(topics_dict, columns =['topic_num', 'topic_label'])

# merge with the main dataset
articles = pd.merge(articles, labels, how='left', left_on = 'dominant_topic', right_on='topic_num')
articles.drop("topic_num", axis=1, inplace=True)
articles.head()

In [None]:
# save data in data assets
articles.to_csv(f"{output_path}/ws_2_article_topic_{num_topics}.csv", index=False)

#### Authors
    
* **Mehrnoosh Vahdat** is Data Scientist with Data Science & AI Elite team where she specializes in Data Science, Analytics platforms, and Machine Learning solutions.
* **Vincent Nelis** is Senior Data Scientist with Data Science & AI Elite team where he specializes in Data Science, Analytics platforms, and Machine Learning solutions.
* **Anthony Ayanwale** is Data Scientist with CPAT team where he specializes in Data Science, Analytics platforms, and Machine Learning solutions.

Copyright © IBM Corp. 2020. Licensed under the Apache License, Version 2.0. Released as licensed Sample Materials.