<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/probabilistic_topic_models/LDA_Topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook serves as an introduction to **Probabilistic Topic Models**. 

Textual data is loaded from a Google Sheet and topics derived from LDA will be generated. 

First we need to obtain credentials from our Google Account to access the corpus hosted on Google Drive.

In [1]:
!pip install --upgrade -q gspread

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np
import warnings
warnings.filterwarnings('ignore')

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m30.7/40.5 kB[0m [31m59.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m525.7 kB/s[0m eta [36m0:00:00[0m
[?25h

### Loading Data

[CORDIS](https://cordis.europa.eu/projects/en) is the primary source of results from EU-funded projects since 1990. 

We used a [sample stored in a spreadsheet](https://goo.gl/dG8eVF) with 100 such projects. Save a copy in your `Google Colab/` folder at Google Drive.

A dataframe will be created from the training google sheet:


In [15]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# name of the spreadsheet with texts (e.g.'texts_nlp')
corpus = 'texts'

worksheet = gc.open(corpus).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
            
# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows[1:], columns=["ID","TITLE","DESCRIPTION"])
data_table.DataTable(dataset_df, include_index=False, num_rows_per_page=5)


Unnamed: 0,ID,TITLE,DESCRIPTION
0,EU100000,Visual object population codes relating human...,Two major challenges facing systems neuroscien...
1,EU100001,New Opportunities for Research Funding Agency ...,NORFACE is a co-ordinated common action of fif...
2,EU100002,USA and Europe Cooperation in Mini UAVs,Unmanned Aerial Systems have been an active ar...
3,EU100003,Sustainable Infrastructure for Resilient Urban...,This fellowship aim is to identify how the use...
4,EU100004,Modelling star formation in the local universe,The goal of this proposal is to revolutionize ...
...,...,...,...
95,EU100097,The Environmental Observation Web and its Serv...,Large European communities generate significan...
96,EU100098,Future INternet for Smart ENergY,The energy sector has entered a period of majo...
97,EU100099,Smart Food and Agribusiness: Future Internet f...,The SmartAgriFood project addresses the food a...
98,EU100100,Instant Mobility for Passengers and Goods,The Instant Mobility project has created a con...


### Data Cleaning

Now, let`s create the BoWs by using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) facility. 

It is a [ScikitLearn](https://scikit-learn.org/stable/index.html) module focused to create bag-of-words from strings. 


In [3]:
# from sklearn.feature_extraction import text 
# my_additional_stop_words = ['xxx','yyy']
# my_stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

# list of texts
documents = dataset_df['DESCRIPTION'].tolist()

# bag-of-words
tf_vectorizer = CountVectorizer(
    stop_words=[],
    min_df=1,
    max_df=1.0,
    lowercase=True,
    max_features=50000,
    token_pattern='[a-zA-Z0-9]{3,}',  
    analyzer = 'word'
)
bag_of_words = tf_vectorizer.fit_transform(documents)
dictionary = tf_vectorizer.get_feature_names_out()
vocabulary = tf_vectorizer.vocabulary_

print("Vocabulary Size: ", len(dictionary))

Vocabulary Size:  2424


Sorted list of terms by frequency:

In [4]:
s = bag_of_words.toarray().sum(axis=0)
st = sorted(range(len(s)), key=lambda k: s[k], reverse=True)
for i,x in enumerate(st[:20]):
  print(dictionary[x],s[x])

research 147
system 131
project 126
have 119
develop 87
study 82
datum 72
model 70
technology 70
approach 63
design 63
provide 54
application 53
need 51
process 51
base 50
challenge 48
propose 47
development 46
support 46


### Topic Model

Now it's time to build a LDA-based model by setting values for:
- number of topics
- alpha
- beta

In [19]:

topics = 2 

alpha = 1.0

beta = 1.0

# Run LDA
lda = LatentDirichletAllocation(
    n_components=topics, 
    doc_topic_prior=alpha, 
    topic_word_prior=beta, 
    max_iter=25, 
    learning_method='online', 
    evaluate_every=1,
    n_jobs = -1,
    random_state=0,
    verbose=1)
lda.fit(bag_of_words)


iteration: 1 of max_iter: 25, perplexity: 1964.4737
iteration: 2 of max_iter: 25, perplexity: 1684.5141
iteration: 3 of max_iter: 25, perplexity: 1588.0070
iteration: 4 of max_iter: 25, perplexity: 1543.6931
iteration: 5 of max_iter: 25, perplexity: 1519.6826
iteration: 6 of max_iter: 25, perplexity: 1505.2810
iteration: 7 of max_iter: 25, perplexity: 1496.0332
iteration: 8 of max_iter: 25, perplexity: 1489.7904
iteration: 9 of max_iter: 25, perplexity: 1485.4075
iteration: 10 of max_iter: 25, perplexity: 1482.2285
iteration: 11 of max_iter: 25, perplexity: 1479.8572
iteration: 12 of max_iter: 25, perplexity: 1478.0433
iteration: 13 of max_iter: 25, perplexity: 1476.6232
iteration: 14 of max_iter: 25, perplexity: 1475.4868
iteration: 15 of max_iter: 25, perplexity: 1474.5580
iteration: 16 of max_iter: 25, perplexity: 1473.7836
iteration: 17 of max_iter: 25, perplexity: 1473.1251
iteration: 18 of max_iter: 25, perplexity: 1472.5549
iteration: 19 of max_iter: 25, perplexity: 1472.0524
it

Explore topics:

In [21]:
no_top_words = 10
no_top_documents = 2

doc_topics = lda.transform(bag_of_words)
topics = lda.components_

print("LDA Topics")
for topic_idx, topic in enumerate(topics):
    print("-"*30)
    print(" Topic ",(topic_idx)," :")
    print("["," | ".join([dictionary[i]
                    for i in topic.argsort()[:-no_top_words - 1:-1]]),"]")
    top_doc_indices = np.argsort( doc_topics[:,topic_idx] )[::-1][0:no_top_documents]
    for doc_index in top_doc_indices:
        row_index = doc_index +1
        print("[",doc_index,"] (",rows[row_index][0],") \'",rows[row_index][1],"\'", [ "{0:.5f}".format(weight) for weight in doc_topics[doc_index]])
        

LDA Topics
------------------------------
 Topic  0  :
[ system | circuit | research | develop | application | quantum | design | technique | information | have ]
[ 46 ] ( EU100046 ) ' Ensemble based advanced quantum light matter interfaces ' ['0.98294', '0.01706']
[ 64 ] ( EU100066 ) ' Fundamental Physics at the Low Background Frontier ' ['0.97616', '0.02384']
------------------------------
 Topic  1  :
[ research | project | have | system | datum | develop | study | model | technology | approach ]
[ 37 ] ( EU100037 ) ' Models for Optimising Dynamic Urban Mobility ' ['0.00759', '0.99241']
[ 36 ] ( EU100036 ) ' A network for supporting the coordination of Supercomputing research between Europe and Latin America ' ['0.00962', '0.99038']


### Doc-Topic Matrix


In [22]:
from IPython.display import display, HTML
import pandas as pd
#pd.set_option('display.max_columns', None)

topicnames = ["topic"+ str(x) for x in range(0, lda.n_components)]
norm_doc_topics = []
for i in doc_topics:
  norm_doc_topics.append([ "{0:.3f}".format(weight) for weight in i])

df = pd.DataFrame(norm_doc_topics, 
                  columns=topicnames, 
                  index=dataset_df['TITLE'].tolist())

data_table.DataTable(df, num_rows_per_page=10)

Unnamed: 0,topic0,topic1
Visual object population codes relating human brains to nonhuman and computational models with representational similarity analysis,0.024,0.976
New Opportunities for Research Funding Agency Co-operation in Europe II,0.015,0.985
USA and Europe Cooperation in Mini UAVs,0.017,0.983
Sustainable Infrastructure for Resilient Urban Environments,0.016,0.984
Modelling star formation in the local universe,0.030,0.970
...,...,...
The Environmental Observation Web and its Service Applications within the Future Internet,0.522,0.478
Future INternet for Smart ENergY,0.011,0.989
Smart Food and Agribusiness: Future Internet for Safe and Healthy Food from Farm to Fork,0.014,0.986
Instant Mobility for Passengers and Goods,0.023,0.977


### Topic-Word Matrix

In [23]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis])

# Assign Column and Index
df_topic_keywords.columns = dictionary
df_topic_keywords.index = topicnames

# View
data_table.DataTable(df_topic_keywords, num_rows_per_page=10)
#df_topic_keywords.head()



Unnamed: 0,1mbit,2011,2nd,ability,ablation,absorb,absorption,abstraction,academia,acare,...,workshops,workstation,world,write,www,year,yes,yield,zebrafish,zone
topic0,0.000242,0.000227,0.000225,0.000837,0.000427,0.000245,0.000481,0.000436,0.000431,0.000224,...,0.000225,0.00023,0.000576,0.00043,0.000226,0.000436,0.000229,0.000469,0.001058,0.000236
topic1,0.000168,0.000174,0.000174,0.000529,9.2e-05,0.000167,0.000158,0.000175,0.000264,0.000262,...,0.000174,0.000259,0.000981,9.1e-05,0.000174,0.00147,0.000173,0.000161,9.3e-05,0.000429


### Top10 Words per Topic

In [24]:
def show_topics(vectorizer=tf_vectorizer, lda_model=lda, n_words=20):
    keywords = np.array(vectorizer.get_feature_names_out())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=tf_vectorizer, lda_model=lda, n_words=10)

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
data_table.DataTable(df_topic_keywords, num_rows_per_page=10)

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topic 0,system,circuit,research,develop,application,quantum,design,technique,information,have
Topic 1,research,project,have,system,datum,develop,study,model,technology,approach


### Topic Inference

In [10]:
text = "we develop a project to process models in a large-scale" #@param {type:"string"}

print("Topic Distribution: ", lda.transform(tf_vectorizer.transform([text])))


Topic Distribution:  [[0.18179209 0.18654582 0.63166209]]


### Diagnose model performance

A model with higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)) is considered to be good.

In [11]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda.score(bag_of_words))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda.perplexity(bag_of_words))

# See model parameters
print(lda)

Log Likelihood:  -81997.84520747705
Perplexity:  1456.101534014788
LatentDirichletAllocation(doc_topic_prior=1.0, evaluate_every=1,
                          learning_method='online', max_iter=25, n_components=3,
                          n_jobs=-1, random_state=0, topic_word_prior=1.0,
                          verbose=1)


Determine the best LDA model:

In [12]:
from sklearn.model_selection import GridSearchCV

# Define Search Param
search_params = {'n_components': [5, 10, 15], 'doc_topic_prior': [.1, .3, .5], 'topic_word_prior': [.01, .03, .05]}

# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(bag_of_words)

Read the best configuration:

In [13]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(bag_of_words))

Best Model's Params:  {'doc_topic_prior': 0.1, 'n_components': 5, 'topic_word_prior': 0.05}
Best Log Likelihood Score:  -36129.05740067852
Model Perplexity:  8484.752532998951


### Topic-based Document Similarity

In [14]:
# Compute Jensen Shannon Divergence
from scipy.spatial import distance

for i1,d1 in enumerate(doc_topics[0:10]):
   for i2,d2 in enumerate(doc_topics[0:10]):
      print(rows[i1+1][1],"-", rows[i2+1][1],":", 1-distance.jensenshannon(d1, d2))

Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis : 1.0
Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - New Opportunities for Research Funding Agency Co-operation in Europe II : 0.72163322260343
Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - USA and Europe Cooperation in Mini UAVs : 0.7250288135188105
Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - Sustainable Infrastructure for Resilient Urban Environments : 0.7264859724678631
Visual object population codes  relating human brains to nonhuman and computational mode

You can tune the `CounterVectorizer` module to clean the input text, or use an already processed corpus. 

Save a copy of this [file](https://goo.gl/RF4cWB) in the `Google Colab/` folder of your Google Drive.