<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/probabilistic_topic_models/LDA_Topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook serves as an introduction to **Probabilistic Topic Models**. 

Textual data is loaded from a Google Sheet and topics derived from LDA will be generated. 

First we need to obtain credentials from our Google Account to access the corpus hosted on Google Drive.

In [3]:
!pip install --upgrade -q gspread

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Loading Data

[CORDIS](https://cordis.europa.eu/projects/en) is the primary source of results from EU-funded projects since 1990. 

We used a [sample stored in a spreadsheet](https://goo.gl/dG8eVF) with 100 such projects. Save a copy in your `Google Colab/` folder at Google Drive.

A dataframe will be created from the training google sheet:


In [5]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# name of the spreadsheet with texts (e.g.'texts_nlp')
corpus = 'texts_nlp'

worksheet = gc.open(corpus).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
            
# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows[1:], columns=["ID","TITLE","DESCRIPTION"])
data_table.DataTable(dataset_df, include_index=False, num_rows_per_page=5)


Unnamed: 0,ID,TITLE,DESCRIPTION
0,EU100000,Visual object population codes relating human...,representation brain-activity datum acquire re...
1,EU100001,New Opportunities for Research Funding Agency ...,norface action research funding agency norface...
2,EU100002,USA and Europe Cooperation in Mini UAVs,aerial systems have area research year world r...
3,EU100003,Sustainable Infrastructure for Resilient Urban...,fellowship identify space infrastructure influ...
4,EU100004,Modelling star formation in the local universe,goal proposal revolutionize understanding star...
...,...,...,...
95,EU100097,The Environmental Observation Web and its Serv...,community generate amount observation scale co...
96,EU100098,Future INternet for Smart ENergY,energy sector have enter period change continu...
97,EU100099,Smart Food and Agribusiness: Future Internet f...,smartagrifood project address food agribusines...
98,EU100100,Instant Mobility for Passengers and Goods,mobility project have create concept transport...


### Data Cleaning

Now, let`s create the BoWs by using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) facility. 

It is a [ScikitLearn](https://scikit-learn.org/stable/index.html) module focused to create bag-of-words from strings. 


In [None]:
# from sklearn.feature_extraction import text 
# my_additional_stop_words = ['xxx','yyy']
# my_stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

# list of texts
documents = dataset_df['DESCRIPTION'].tolist()

# bag-of-words
tf_vectorizer = CountVectorizer(
    stop_words=[],
    min_df=1,
    max_df=1.0,
    lowercase=True,
    max_features=50000,
    token_pattern='[a-zA-Z0-9]{3,}',  
    analyzer = 'word'
)
bag_of_words = tf_vectorizer.fit_transform(documents)
dictionary = tf_vectorizer.get_feature_names()
vocabulary = tf_vectorizer.vocabulary_

print("Vocabulary Size: ", len(dictionary))

Sorted list of terms by frequency:

In [None]:
s = bag_of_words.toarray().sum(axis=0)
st = sorted(range(len(s)), key=lambda k: s[k], reverse=True)
for i,x in enumerate(st[:20]):
  print(dictionary[x],s[x])

### Topic Model

Now it's time to build a LDA-based model by setting values for:
- number of topics
- alpha
- beta

In [None]:

topics = 3 

alpha = 100.0

beta = 100.0

# Run LDA
lda = LatentDirichletAllocation(
    n_components=topics, 
    doc_topic_prior=alpha, 
    topic_word_prior=beta, 
    max_iter=25, 
    learning_method='online', 
    evaluate_every=1,
    n_jobs = -1,
    random_state=0,
    verbose=1)
lda.fit(bag_of_words)


Explore topics:

In [None]:
no_top_words = 10
no_top_documents = 5

doc_topics = lda.transform(bag_of_words)
topics = lda.components_

print("LDA Topics")
for topic_idx, topic in enumerate(topics):
    print("-"*30)
    print(" Topic ",(topic_idx)," :")
    print("["," | ".join([dictionary[i]
                    for i in topic.argsort()[:-no_top_words - 1:-1]]),"]")
    top_doc_indices = np.argsort( doc_topics[:,topic_idx] )[::-1][0:no_top_documents]
    for doc_index in top_doc_indices:
        row_index = doc_index +1
        print("[",doc_index,"] (",rows[row_index][0],") \'",rows[row_index][1],"\'", [ "{0:.5f}".format(weight) for weight in doc_topics[doc_index]])
        

### Doc-Topic Matrix


In [None]:
from IPython.display import display, HTML
import pandas as pd
#pd.set_option('display.max_columns', None)

topicnames = ["topic"+ str(x) for x in range(0, lda.n_components)]

df = pd.DataFrame(doc_topics, 
                  columns=topicnames, 
                  index=dataset_df['TITLE'].tolist())

data_table.DataTable(df, num_rows_per_page=10)

### Topic-Word Matrix

In [None]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis])

# Assign Column and Index
df_topic_keywords.columns = dictionary
df_topic_keywords.index = topicnames

# View
data_table.DataTable(df_topic_keywords, num_rows_per_page=10)
#df_topic_keywords.head()

### Top10 Words per Topic

In [None]:
def show_topics(vectorizer=tf_vectorizer, lda_model=lda, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=tf_vectorizer, lda_model=lda, n_words=10)

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
data_table.DataTable(df_topic_keywords, num_rows_per_page=10)

### Topic Inference

In [None]:
text = "we develop a project to process models in a large-scale" #@param {type:"string"}

print("Topic Distribution: ", lda.transform(tf_vectorizer.transform([text])))


### Diagnose model performance

A model with higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)) is considered to be good.

In [None]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda.score(bag_of_words))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda.perplexity(bag_of_words))

# See model parameters
print(lda)

Determine the best LDA model:

In [None]:
from sklearn.model_selection import GridSearchCV

# Define Search Param
search_params = {'n_components': [5, 10, 15], 'doc_topic_prior': [.1, .3, .5], 'topic_word_prior': [.01, .03, .05]}

# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(bag_of_words)

Read the best configuration:

In [None]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(bag_of_words))

### Topic-based Document Similarity

In [None]:
# Compute Jensen Shannon Divergence
from scipy.spatial import distance

for i1,d1 in enumerate(doc_topics[0:10]):
   for i2,d2 in enumerate(doc_topics[0:10]):
      print(rows[i1+1][1],"-", rows[i2+1][1],":", 1-distance.jensenshannon(d1, d2))

You can tune the `CounterVectorizer` module to clean the input text, or use an already processed corpus. 

Save a copy of this [file](https://goo.gl/RF4cWB) in the `Google Colab/` folder of your Google Drive.