<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/probabilistic_topic_models/LDA_Topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook serves as an introduction to **Probabilistic Topic Models**. 

Textual data is loaded from a Google Sheet and topics derived from LDA will be generated. 

First we need to obtain credentials from our Google Account to access the corpus hosted on Google Drive.

In [6]:
!pip install --upgrade -q gspread

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Loading Data

[CORDIS](https://cordis.europa.eu/projects/en) is the primary source of results from EU-funded projects since 1990. 

We used a [sample stored in a spreadsheet](https://goo.gl/dG8eVF) with 100 such projects. Save a copy in your `Google Colab/` folder at Google Drive.

A dataframe will be created from the training google sheet:


In [7]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# name of the spreadsheet with texts (e.g.'texts_nlp')
corpus = 'texts_nlp'

worksheet = gc.open(corpus).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
            
# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows[1:], columns=["ID","TITLE","DESCRIPTION"])
data_table.DataTable(dataset_df, include_index=False, num_rows_per_page=5)


Unnamed: 0,ID,TITLE,DESCRIPTION
0,EU100000,Visual object population codes relating human...,representation brain-activity datum acquire re...
1,EU100001,New Opportunities for Research Funding Agency ...,norface action research funding agency norface...
2,EU100002,USA and Europe Cooperation in Mini UAVs,aerial systems have area research year world r...
3,EU100003,Sustainable Infrastructure for Resilient Urban...,fellowship identify space infrastructure influ...
4,EU100004,Modelling star formation in the local universe,goal proposal revolutionize understanding star...
...,...,...,...
95,EU100097,The Environmental Observation Web and its Serv...,community generate amount observation scale co...
96,EU100098,Future INternet for Smart ENergY,energy sector have enter period change continu...
97,EU100099,Smart Food and Agribusiness: Future Internet f...,smartagrifood project address food agribusines...
98,EU100100,Instant Mobility for Passengers and Goods,mobility project have create concept transport...


### Data Cleaning

Now, let`s create the BoWs by using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) facility. 

It is a [ScikitLearn](https://scikit-learn.org/stable/index.html) module focused to create bag-of-words from strings. 


In [9]:
# from sklearn.feature_extraction import text 
# my_additional_stop_words = ['xxx','yyy']
# my_stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

# list of texts
documents = dataset_df['DESCRIPTION'].tolist()

# bag-of-words
tf_vectorizer = CountVectorizer(
    stop_words=[],
    min_df=1,
    max_df=1.0,
    lowercase=True,
    max_features=50000,
    token_pattern='[a-zA-Z0-9]{3,}',  
    analyzer = 'word'
)
bag_of_words = tf_vectorizer.fit_transform(documents)
dictionary = tf_vectorizer.get_feature_names_out()
vocabulary = tf_vectorizer.vocabulary_

print("Vocabulary Size: ", len(dictionary))

Vocabulary Size:  2424


Sorted list of terms by frequency:

In [10]:
s = bag_of_words.toarray().sum(axis=0)
st = sorted(range(len(s)), key=lambda k: s[k], reverse=True)
for i,x in enumerate(st[:20]):
  print(dictionary[x],s[x])

research 147
system 131
project 126
have 119
develop 87
study 82
datum 72
model 70
technology 70
approach 63
design 63
provide 54
application 53
need 51
process 51
base 50
challenge 48
propose 47
development 46
support 46


### Topic Model

Now it's time to build a LDA-based model by setting values for:
- number of topics
- alpha
- beta

In [13]:

topics = 3 

alpha = 1.0

beta = 1.0

# Run LDA
lda = LatentDirichletAllocation(
    n_components=topics, 
    doc_topic_prior=alpha, 
    topic_word_prior=beta, 
    max_iter=25, 
    learning_method='online', 
    evaluate_every=1,
    n_jobs = -1,
    random_state=0,
    verbose=1)
lda.fit(bag_of_words)


iteration: 1 of max_iter: 25, perplexity: 2173.0615
iteration: 2 of max_iter: 25, perplexity: 1760.6362
iteration: 3 of max_iter: 25, perplexity: 1624.9904
iteration: 4 of max_iter: 25, perplexity: 1564.1700
iteration: 5 of max_iter: 25, perplexity: 1531.3050
iteration: 6 of max_iter: 25, perplexity: 1511.4544
iteration: 7 of max_iter: 25, perplexity: 1498.5120
iteration: 8 of max_iter: 25, perplexity: 1489.5498
iteration: 9 of max_iter: 25, perplexity: 1483.0316
iteration: 10 of max_iter: 25, perplexity: 1478.1454
iteration: 11 of max_iter: 25, perplexity: 1474.3959
iteration: 12 of max_iter: 25, perplexity: 1471.4298
iteration: 13 of max_iter: 25, perplexity: 1468.9948
iteration: 14 of max_iter: 25, perplexity: 1466.9152
iteration: 15 of max_iter: 25, perplexity: 1465.1127
iteration: 16 of max_iter: 25, perplexity: 1463.5747
iteration: 17 of max_iter: 25, perplexity: 1462.2627
iteration: 18 of max_iter: 25, perplexity: 1461.1277
iteration: 19 of max_iter: 25, perplexity: 1460.1305
it

Explore topics:

In [14]:
no_top_words = 10
no_top_documents = 5

doc_topics = lda.transform(bag_of_words)
topics = lda.components_

print("LDA Topics")
for topic_idx, topic in enumerate(topics):
    print("-"*30)
    print(" Topic ",(topic_idx)," :")
    print("["," | ".join([dictionary[i]
                    for i in topic.argsort()[:-no_top_words - 1:-1]]),"]")
    top_doc_indices = np.argsort( doc_topics[:,topic_idx] )[::-1][0:no_top_documents]
    for doc_index in top_doc_indices:
        row_index = doc_index +1
        print("[",doc_index,"] (",rows[row_index][0],") \'",rows[row_index][1],"\'", [ "{0:.5f}".format(weight) for weight in doc_topics[doc_index]])
        

LDA Topics
------------------------------
 Topic  0  :
[ quantum | matter | circuit | spin | background | chip | antenna | transfer | measurement | memory ]
[ 46 ] ( EU100046 ) ' Ensemble based advanced quantum light matter interfaces ' ['0.94759', '0.01495', '0.03746']
[ 64 ] ( EU100066 ) ' Fundamental Physics at the Low Background Frontier ' ['0.83504', '0.01413', '0.15083']
[ 5 ] ( EU100005 ) ' Coherent spin manipulation in hybrid nanostructures ' ['0.81805', '0.01677', '0.16518']
[ 69 ] ( EU100071 ) ' Monolithic Integrated Antennas ' ['0.81488', '0.01519', '0.16994']
[ 73 ] ( EU100075 ) ' Neural Circuits Underlying Visually Guided Behaviour ' ['0.71292', '0.11286', '0.17422']
------------------------------
 Topic  1  :
[ cell | have | study | disease | model | mechanism | project | response | brain | mirna ]
[ 57 ] ( EU100059 ) ' Mechanisms of Inflammation Resolution: Role of miRNAs ' ['0.00952', '0.97731', '0.01317']
[ 29 ] ( EU100029 ) ' METABOLIC CONTROL OF IMMATURE STROMAL CELL

### Doc-Topic Matrix


In [15]:
from IPython.display import display, HTML
import pandas as pd
#pd.set_option('display.max_columns', None)

topicnames = ["topic"+ str(x) for x in range(0, lda.n_components)]

df = pd.DataFrame(doc_topics, 
                  columns=topicnames, 
                  index=dataset_df['TITLE'].tolist())

data_table.DataTable(df, num_rows_per_page=10)

Unnamed: 0,topic0,topic1,topic2
Visual object population codes relating human brains to nonhuman and computational models with representational similarity analysis,0.018314,0.255246,0.726440
New Opportunities for Research Funding Agency Co-operation in Europe II,0.011337,0.012227,0.976436
USA and Europe Cooperation in Mini UAVs,0.013460,0.013440,0.973100
Sustainable Infrastructure for Resilient Urban Environments,0.013975,0.014040,0.971985
Modelling star formation in the local universe,0.019120,0.017476,0.963404
...,...,...,...
The Environmental Observation Web and its Service Applications within the Future Internet,0.012157,0.012279,0.975565
Future INternet for Smart ENergY,0.008933,0.011304,0.979764
Smart Food and Agribusiness: Future Internet for Safe and Healthy Food from Farm to Fork,0.010896,0.010979,0.978125
Instant Mobility for Passengers and Goods,0.016250,0.016362,0.967387


### Topic-Word Matrix

In [16]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis])

# Assign Column and Index
df_topic_keywords.columns = dictionary
df_topic_keywords.index = topicnames

# View
data_table.DataTable(df_topic_keywords, num_rows_per_page=10)
#df_topic_keywords.head()



Unnamed: 0,1mbit,2011,2nd,ability,ablation,absorb,absorption,abstraction,academia,acare,...,workshops,workstation,world,write,www,year,yes,yield,zebrafish,zone
topic0,0.000339,0.000335,0.000337,0.000658,0.000621,0.000368,0.000365,0.000336,0.00034,0.000336,...,0.000334,0.000339,0.000353,0.000339,0.000338,0.000348,0.000337,0.00034,0.001549,0.000357
topic1,0.000407,0.000222,0.000221,0.000755,0.00023,0.000234,0.000245,0.000223,0.000226,0.000226,...,0.000221,0.000224,0.000606,0.000223,0.000223,0.000757,0.000221,0.000224,0.000237,0.000455
topic2,0.000104,0.000184,0.000185,0.000499,9.9e-05,0.00017,0.000259,0.000276,0.000364,0.000275,...,0.000185,0.000275,0.001018,0.000183,0.000184,0.00141,0.000184,0.000274,0.000102,0.000351


### Top10 Words per Topic

In [18]:
def show_topics(vectorizer=tf_vectorizer, lda_model=lda, n_words=20):
    keywords = np.array(vectorizer.get_feature_names_out())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=tf_vectorizer, lda_model=lda, n_words=10)

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
data_table.DataTable(df_topic_keywords, num_rows_per_page=10)

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topic 0,quantum,matter,circuit,spin,background,chip,antenna,transfer,measurement,memory
Topic 1,cell,have,study,disease,model,mechanism,project,response,brain,mirna
Topic 2,research,system,project,have,develop,datum,technology,design,study,application


### Topic Inference

In [19]:
text = "we develop a project to process models in a large-scale" #@param {type:"string"}

print("Topic Distribution: ", lda.transform(tf_vectorizer.transform([text])))


Topic Distribution:  [[0.18179209 0.18654582 0.63166209]]


### Diagnose model performance

A model with higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)) is considered to be good.

In [20]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda.score(bag_of_words))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda.perplexity(bag_of_words))

# See model parameters
print(lda)

Log Likelihood:  -81997.84520747705
Perplexity:  1456.101534014788
LatentDirichletAllocation(doc_topic_prior=1.0, evaluate_every=1,
                          learning_method='online', max_iter=25, n_components=3,
                          n_jobs=-1, random_state=0, topic_word_prior=1.0,
                          verbose=1)


Determine the best LDA model:

In [21]:
from sklearn.model_selection import GridSearchCV

# Define Search Param
search_params = {'n_components': [5, 10, 15], 'doc_topic_prior': [.1, .3, .5], 'topic_word_prior': [.01, .03, .05]}

# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(bag_of_words)

Read the best configuration:

In [22]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(bag_of_words))

Best Model's Params:  {'doc_topic_prior': 0.1, 'n_components': 5, 'topic_word_prior': 0.05}
Best Log Likelihood Score:  -36129.05740067852
Model Perplexity:  8484.752532998951


### Topic-based Document Similarity

In [23]:
# Compute Jensen Shannon Divergence
from scipy.spatial import distance

for i1,d1 in enumerate(doc_topics[0:10]):
   for i2,d2 in enumerate(doc_topics[0:10]):
      print(rows[i1+1][1],"-", rows[i2+1][1],":", 1-distance.jensenshannon(d1, d2))

Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis : 1.0
Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - New Opportunities for Research Funding Agency Co-operation in Europe II : 0.72163322260343
Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - USA and Europe Cooperation in Mini UAVs : 0.7250288135188105
Visual object population codes  relating human brains to nonhuman and computational models with representational similarity analysis - Sustainable Infrastructure for Resilient Urban Environments : 0.7264859724678631
Visual object population codes  relating human brains to nonhuman and computational mode

You can tune the `CounterVectorizer` module to clean the input text, or use an already processed corpus. 

Save a copy of this [file](https://goo.gl/RF4cWB) in the `Google Colab/` folder of your Google Drive.