# Topic Modeling on Research Papers

We will do an interesting exercise here—build topic models on past research papers
from the very popular NIPS conference (now known as the NeurIPS conference). The
late professor Sam Roweis compiled an excellent collection of NIPS Conference Papers
from Volume 1 – 12, which you can find at https://cs.nyu.edu/~roweis/data.html.
An interesting fact is that he obtained this by massaging the OCR’d data from NIPS
1-12, which was actually the pre-electronic submission era. Yann LeCun made the data
available. There is an even more updated dataset available up to NIPS 17 at http://
ai.stanford.edu/~gal/data.html. However, that dataset is in the form of a MAT file, so
you might need to do some additional preprocessing before working on it in Python.


# The Main Objective

Considering our discussion so far, our main objective is pretty simple. Given a whole
bunch of conference research papers, can we identify some key themes or topics from
these papers by leveraging unsupervised learning? We do not have the liberty of labeled
categories telling us what the major themes of every research paper are. Besides that, we
are dealing with text data extracted using OCR (optical character recognition). Hence,
you can expect misspelled words, words with characters missing, and so on, which
makes our problem even more challenging

# Download Data and Dependencies

In [None]:
!wget https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz
!tar -xzf nips12raw_str602.tgz

--2023-12-21 17:57:27--  https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz
Resolving cs.nyu.edu (cs.nyu.edu)... 216.165.22.203
Connecting to cs.nyu.edu (cs.nyu.edu)|216.165.22.203|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12851423 (12M) [application/x-gzip]
Saving to: ‘nips12raw_str602.tgz’


2023-12-21 17:57:27 (24.9 MB/s) - ‘nips12raw_str602.tgz’ saved [12851423/12851423]



In [None]:
!pip install --upgrade -q gensim==3.6
!sed -i 's/collections import Mapping, defaultdict/collections.abc import Mapping;from collections import defaultdict;/' /usr/local/lib/python3.10/dist-packages/gensim/corpora/dictionary.py

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for gensim (setup.py) ... [?25l[?25hdone


In [None]:
!pip install tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import os
import numpy as np
import pandas as pd

DATA_PATH = 'nipstxt/'
print(os.listdir(DATA_PATH))

['nips05', 'nips08', 'MATLAB_NOTES', 'nips06', 'RAW_DATA_NOTES', 'nips09', 'idx', 'nips11', 'nips12', 'nips10', 'README_yann', 'nips01', 'nips02', 'nips03', 'nips07', 'nips00', 'nips04', 'orig']


# Load NIPS Research Papers Data

In [None]:
folders = ["nips{0:02}".format(i) for i in range(0,13)]
# Read all texts into a list.
papers = []
for folder in folders:
    file_names = os.listdir(DATA_PATH + folder)
    for file_name in file_names:
        with open(DATA_PATH + folder + '/' + file_name, encoding='utf-8', errors='ignore', mode='r+') as f:
            data = f.read()
        papers.append(data)
len(papers)

1740

In [None]:
print(papers[3][:1000])

72 
ANALYSIS AND COMPARISON OF DIFFERENT LEARNING 
ALGORITHMS FOR PATTERN ASSOCIATION PROBLEMS 
J. Bernasconi 
Brown Boveri Research Center 
CH-5405 Baden, Switzerland 
ABSTRACT 
We investigate the behavior of different learning algorithms 
for networks of neuron-like units. As test cases we use simple pat- 
tern association problems, such as the XOR-problem and symmetry de- 
tection problems. The algorithms considered are either versions of 
the Boltzmann machine learning rule or based on the backpropagation 
of errors. We also propose and analyze a generalized delta rule for 
linear threshold units. We find that the performance of a given 
learning algorithm depends strongly on the type of units used. In 
particular, we observe that networks with 1 units quite generally 
exhibit a significantly better learning behavior than the correspon- 
ding 0,1 versions. We also demonstrate that an adaption of the 
weight-structure to the symmetries of the problem can lead to a 
drastic increase 

# Basic Text Pre-processing

We perform some basic text wrangling or preprocessing before diving into topic
modeling. We keep things simple here

In [None]:
%%time

import tqdm
import nltk

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+') # can also use nltk.word_tokenize to get word tokens for each paper
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def normalize_corpus(papers):
    norm_papers = []
    for paper in tqdm.tqdm(papers):
        paper = paper.lower()
        paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
        paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
        paper_tokens = [token for token in paper_tokens if len(token) > 1] # removing any single character words \ numbers \ symbols
        paper_tokens = [token for token in paper_tokens if token not in stop_words]
        paper_tokens = list(filter(None, paper_tokens))
        if paper_tokens:
            norm_papers.append(paper_tokens)

    return norm_papers

norm_papers = normalize_corpus(papers)
print(len(norm_papers))

100%|██████████| 1740/1740 [00:36<00:00, 47.35it/s]

1740
CPU times: user 35.8 s, sys: 417 ms, total: 36.2 s
Wall time: 36.8 s





In [None]:
print(norm_papers[0][:50])

['neural', 'network', 'template', 'matching', 'application', 'real', 'time', 'classification', 'action', 'potential', 'real', 'neuron', 'yiu', 'fai', 'wong', 'jashojiban', 'banik', 'james', 'bower', 'division', 'engineering', 'applied', 'science', 'division', 'biology', 'california', 'institute', 'technology', 'pasadena', 'ca', 'abstract', 'much', 'experimental', 'study', 'real', 'neural', 'network', 'relies', 'proper', 'classification', 'extracellulary', 'sampled', 'neural', 'signal', 'action', 'potential', 'recorded', 'brain', 'ex', 'perimental']


# Build a Bi-gram Phrase Model

Before feature engineering and vectorization, we want to extract some useful bi-gram
based phrases from our research papers and remove some unnecessary terms. We
leverage the very useful gensim.models.Phrases class for this. This capability helps us
automatically detect common phrases from a stream of sentences, which are typically
multi-word expressions/word n-grams.

This implementation draws inspiration
from the famous paper by Mikolov, et al., “Distributed Representations of Words and
Phrases and their Compositionality,” which you can check out at https://arxiv.org/
abs/1310.4546. We start by extracting and generating words and bi-grams as phrases for
each tokenized research paper.

We leverage the `min_count` parameter, which tells us that our model ignores all words and bi-grams with total
collected count lower than 20 across the corpus (of the input paper as a list of tokenized
sentences). We also use a `threshold` of 20, which tells us that the model accepts specific
phrases based on this threshold value so that a phrase of words a followed by b is
accepted if the score of the phrase is greater than the threshold of 20. This threshold is
dependent on the scoring parameter, which helps us understand how these phrases are
scored to understand their influence.
Typically the default scorer is used and it’s pretty straightforward to understand.
You can check out further details in the documentation at https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.original_scorer and in the
previously mentioned research paper.

In [None]:
import gensim
gensim.__version__ # version 3.6.0-3.8.3 needed to run MALLET LDA models

'3.6.0'

In [None]:
import gensim

bigram = gensim.models.Phrases(norm_papers, min_count=20, threshold=20, delimiter=b'_') # higher threshold fewer phrases.
bigram_model = gensim.models.phrases.Phraser(bigram)

print(bigram_model[norm_papers[0]][:50]) # very similar to using ngram_range=(1,2) in count vectorizer

['neural_network', 'template_matching', 'application', 'real_time', 'classification', 'action_potential', 'real', 'neuron', 'yiu', 'fai', 'wong', 'jashojiban', 'banik', 'james', 'bower', 'division', 'engineering', 'applied', 'science', 'division_biology', 'california_institute', 'technology_pasadena', 'ca_abstract', 'much', 'experimental', 'study', 'real', 'neural_network', 'relies', 'proper', 'classification', 'extracellulary', 'sampled', 'neural', 'signal', 'action_potential', 'recorded', 'brain', 'ex', 'perimental', 'animal', 'neurophysiology', 'laboratory', 'classification', 'task', 'simplified', 'limiting', 'investigation', 'single', 'electrically']


In [None]:
norm_corpus_bigrams = [bigram_model[doc] for doc in norm_papers]

# Create a dictionary representation of the documents.
dictionary = gensim.corpora.Dictionary(norm_corpus_bigrams)
print('Sample word to number mappings:', list(dictionary.items())[:15])
print('Total Vocabulary Size:', len(dictionary))

Sample word to number mappings: [(0, '11o'), (1, '2a'), (2, '2c'), (3, '2d'), (4, '2nl'), (5, '3a'), (6, '3b'), (7, '3c'), (8, '4a'), (9, '4b'), (10, '4c'), (11, '4d'), (12, '4n'), (13, '4n2'), (14, '4rnn')]
Total Vocabulary Size: 78892


Looks like we have a lot of unique phrases in our corpus of research papers,
based on the preceding output. Several of these terms are not very useful since they are
specific to a paper or even a paragraph in a research paper. Hence, it is time to prune
our vocabulary and start removing terms. Leveraging document frequency is a great way
to achieve this.

In [None]:
# Filter out words that occur less than 20 documents, or more than 60% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.6) # similar to min_df and max_df in count vectorizer
print('Total Vocabulary Size:', len(dictionary))

Total Vocabulary Size: 7756


We removed all terms that occur fewer than 20 times across all documents and all
terms that occur in more than 60% of all the documents. We are interested in finding
different themes and topics and not recurring themes. Hence, this suits our scenario
perfectly.

# Transforming corpus into bag of words vectors

We can now perform feature engineering by leveraging a simple Bag of Words
model.

In [None]:
# Transforming corpus into bag of words vectors
bow_corpus = [dictionary.doc2bow(text) for text in norm_corpus_bigrams]
print(bow_corpus[1][:50])

[(11, 1), (12, 1), (19, 1), (28, 3), (38, 1), (39, 3), (40, 1), (48, 4), (49, 4), (52, 1), (58, 1), (66, 1), (70, 1), (77, 4), (83, 3), (84, 3), (85, 1), (93, 10), (94, 1), (98, 5), (112, 1), (116, 3), (118, 1), (121, 3), (122, 5), (124, 1), (127, 1), (129, 1), (130, 2), (131, 1), (134, 1), (140, 2), (142, 1), (145, 2), (150, 13), (165, 1), (170, 4), (172, 2), (175, 6), (176, 2), (182, 1), (183, 2), (195, 1), (196, 1), (198, 3), (200, 1), (206, 4), (208, 2), (217, 1), (229, 3)]


In [None]:
print([(dictionary[idx] , freq) for idx, freq in bow_corpus[1][:50]])

[('ability', 1), ('able', 1), ('accuracy', 1), ('actual', 3), ('allowing', 1), ('although', 3), ('american_institute', 1), ('application', 4), ('applied', 4), ('around', 1), ('au', 1), ('becomes', 1), ('better', 1), ('buffer', 4), ('calculate', 3), ('calculated', 3), ('calculating', 1), ('change', 10), ('channel', 1), ('chosen', 5), ('co', 1), ('combined', 3), ('common', 1), ('compared', 3), ('comparison', 5), ('complex', 1), ('component', 1), ('computation', 1), ('computer', 2), ('computing', 1), ('conclusion', 1), ('consider', 2), ('considered', 1), ('constant', 2), ('context', 13), ('could', 1), ('current', 4), ('deal', 2), ('define', 6), ('degree', 2), ('denote', 1), ('dependent', 2), ('determine', 1), ('determined', 1), ('developed', 3), ('development', 1), ('difference', 4), ('difficulty', 2), ('discrimination', 1), ('duration', 3)]


In [None]:
print('Total number of papers:', len(bow_corpus))

Total number of papers: 1740


# Topic Models with Latent Dirichlet Allocation (LDA)

The Latent Dirichlet Allocation (LDA) technique is a generative probabilistic model in
which each document is assumed to have a combination of topics similar to a probabilistic
Latent Semantic Indexing model. In this case, the latent topics contain a Dirichlet
prior over them. The math behind in this technique is pretty involved, so we will try to
summarize it since going it specific details is out of the current scope.

![](https://i.imgur.com/l23JAvE.png)

Simplyfying the LDA model process:

![](https://i.imgur.com/0BXCaUi.png)

![](https://i.imgur.com/ioiUAxX.png)

In [None]:
%%time
TOTAL_TOPICS = 10
lda_model = gensim.models.LdaModel(corpus=bow_corpus,
                                   id2word=dictionary, chunksize=1740,
                                   alpha='auto', eta='auto', random_state=42,
                                   iterations=500, num_topics=TOTAL_TOPICS,
                                   passes=20, eval_every=None)

CPU times: user 4min 8s, sys: 3.28 s, total: 4min 11s
Wall time: 4min 11s


In [None]:
for topic_id, topic in lda_model.print_topics(num_topics=10, num_words=20):
    print('Topic #'+str(topic_id+1)+':')
    print(topic)
    print()

Topic #1:
0.015*"state" + 0.008*"dynamic" + 0.008*"vector" + 0.007*"matrix" + 0.007*"equation" + 0.005*"control" + 0.005*"solution" + 0.005*"linear" + 0.004*"trajectory" + 0.004*"nonlinear" + 0.004*"step" + 0.004*"signal" + 0.004*"gradient" + 0.004*"sequence" + 0.003*"convergence" + 0.003*"eq" + 0.003*"noise" + 0.003*"component" + 0.003*"attractor" + 0.003*"source"

Topic #2:
0.020*"training" + 0.014*"unit" + 0.009*"hidden_unit" + 0.007*"net" + 0.007*"prediction" + 0.006*"task" + 0.006*"trained" + 0.005*"training_set" + 0.005*"architecture" + 0.004*"pattern" + 0.004*"expert" + 0.004*"layer" + 0.004*"test" + 0.004*"noise" + 0.003*"target" + 0.003*"back_propagation" + 0.003*"vector" + 0.003*"generalization" + 0.003*"rate" + 0.003*"table"

Topic #3:
0.009*"rule" + 0.008*"pattern" + 0.008*"unit" + 0.007*"representation" + 0.006*"structure" + 0.006*"feature" + 0.006*"image" + 0.005*"vector" + 0.004*"cluster" + 0.004*"distance" + 0.004*"constraint" + 0.003*"transformation" + 0.003*"object" +

In [None]:
topics_coherences = lda_model.top_topics(bow_corpus, topn=20)
avg_coherence_score = np.mean([item[1] for item in topics_coherences])
print('Avg. Coherence Score:', avg_coherence_score)

Avg. Coherence Score: -1.0628844210789536


Topic coherence is a complex topic in its own and it can be used to measure the
quality of topic models to some extent. Typically, a set of statements is said to be
coherent if they support each other. Topic models are unsupervised learning based
models that are trained on unstructured text data, making it difficult to measure the
quality of outputs.

Refer to Text Analytics with Python 2nd Edition for more detail on this.

In [None]:
topics_with_wts = [item[0] for item in topics_coherences]
print('LDA Topics with Weights')
print('='*50)
for idx, topic in enumerate(topics_with_wts):
    print('Topic #'+str(idx+1)+':')
    print([(term, round(wt, 3)) for wt, term in topic])
    print()

LDA Topics with Weights
Topic #1:
[('training', 0.02), ('unit', 0.014), ('hidden_unit', 0.009), ('net', 0.007), ('prediction', 0.007), ('task', 0.006), ('trained', 0.006), ('training_set', 0.005), ('architecture', 0.005), ('pattern', 0.004), ('expert', 0.004), ('layer', 0.004), ('test', 0.004), ('noise', 0.004), ('target', 0.003), ('back_propagation', 0.003), ('vector', 0.003), ('generalization', 0.003), ('rate', 0.003), ('table', 0.003)]

Topic #2:
[('class', 0.016), ('classifier', 0.016), ('feature', 0.014), ('classification', 0.014), ('training', 0.008), ('pattern', 0.008), ('tree', 0.007), ('node', 0.007), ('probability', 0.007), ('sample', 0.005), ('vector', 0.003), ('test', 0.003), ('layer', 0.003), ('mlp', 0.003), ('distribution', 0.003), ('experiment', 0.003), ('cluster', 0.003), ('level', 0.003), ('structure', 0.003), ('application', 0.003)]

Topic #3:
[('neuron', 0.025), ('cell', 0.01), ('signal', 0.008), ('response', 0.007), ('circuit', 0.007), ('spike', 0.007), ('current', 

In [None]:
print('LDA Topics without Weights')
print('='*50)
for idx, topic in enumerate(topics_with_wts):
    print('Topic #'+str(idx+1)+':')
    print([term for wt, term in topic])
    print()

LDA Topics without Weights
Topic #1:
['training', 'unit', 'hidden_unit', 'net', 'prediction', 'task', 'trained', 'training_set', 'architecture', 'pattern', 'expert', 'layer', 'test', 'noise', 'target', 'back_propagation', 'vector', 'generalization', 'rate', 'table']

Topic #2:
['class', 'classifier', 'feature', 'classification', 'training', 'pattern', 'tree', 'node', 'probability', 'sample', 'vector', 'test', 'layer', 'mlp', 'distribution', 'experiment', 'cluster', 'level', 'structure', 'application']

Topic #3:
['neuron', 'cell', 'signal', 'response', 'circuit', 'spike', 'current', 'synaptic', 'activity', 'pattern', 'neural', 'stimulus', 'voltage', 'frequency', 'firing', 'noise', 'synapsis', 'channel', 'threshold', 'effect']

Topic #4:
['distribution', 'approximation', 'probability', 'class', 'let', 'bound', 'linear', 'vector', 'variable', 'training', 'size', 'estimate', 'kernel', 'sample', 'theory', 'theorem', 'prior', 'consider', 'bayesian', 'log']

Topic #5:
['image', 'object', 'vi

## Evaluating topic model quality

We can use perplexity and coherence scores as measures to evaluate the topic
model. Typically, lower the perplexity, the better the model. Similarly, the lower the
UMass score and the higher the Cv score in coherence, the better the model.

In [None]:
cv_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus,
                                                      texts=norm_corpus_bigrams,
                                                      dictionary=dictionary,
                                                      coherence='c_v')
avg_coherence_cv = cv_coherence_model_lda.get_coherence()

umass_coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, corpus=bow_corpus,
                                                         texts=norm_corpus_bigrams,
                                                         dictionary=dictionary,
                                                         coherence='u_mass')
avg_coherence_umass = umass_coherence_model_lda.get_coherence()

perplexity = lda_model.log_perplexity(bow_corpus)

print('Avg. Coherence Score (Cv):', avg_coherence_cv)
print('Avg. Coherence Score (UMass):', avg_coherence_umass)
print('Model Perplexity:', perplexity)

Avg. Coherence Score (Cv): 0.4588543432552908
Avg. Coherence Score (UMass): -1.0628844210789534
Model Perplexity: -7.7943656886351915


# LDA Models with MALLET

The MALLET framework is a Java-based package for statistical natural language
processing, document classification, clustering, topic modeling, information extraction,
and other machine learning applications to text. MALLET stands for MAchine Learning
for LanguagE Toolkit. It was developed by Andrew McCallum along with several people
at the University of Massachusetts Amherst. The MALLET topic modeling toolkit
contains efficient, sampling-based implementations of Latent Dirichlet Allocation,
Pachinko Allocation, and Hierarchical LDA. To use MALLET’s capabilities, we need to
download the framework.

In [None]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip -q mallet-2.0.8.zip

--2023-12-21 18:04:49--  http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Resolving mallet.cs.umass.edu (mallet.cs.umass.edu)... 128.119.246.70
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://mallet.cs.umass.edu/dist/mallet-2.0.8.zip [following]
--2023-12-21 18:04:49--  https://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16184794 (15M) [application/zip]
Saving to: ‘mallet-2.0.8.zip’


2023-12-21 18:04:50 (35.3 MB/s) - ‘mallet-2.0.8.zip’ saved [16184794/16184794]



In [None]:
MALLET_PATH = 'mallet-2.0.8/bin/mallet'
lda_mallet = gensim.models.wrappers.LdaMallet(mallet_path=MALLET_PATH, corpus=bow_corpus,
                                              num_topics=TOTAL_TOPICS, id2word=dictionary,
                                              iterations=500, workers=4)

In [None]:
cv_coherence_model_lda_mallet = gensim.models.CoherenceModel(model=lda_mallet, corpus=bow_corpus,
                                                             texts=norm_corpus_bigrams,
                                                             dictionary=dictionary,
                                                             coherence='c_v')
avg_coherence_cv = cv_coherence_model_lda_mallet.get_coherence()

umass_coherence_model_lda_mallet = gensim.models.CoherenceModel(model=lda_mallet, corpus=bow_corpus,
                                                                texts=norm_corpus_bigrams,
                                                                dictionary=dictionary,
                                                                coherence='u_mass')
avg_coherence_umass = umass_coherence_model_lda_mallet.get_coherence()

# from STDOUT: <500> LL/token: -8.52683
perplexity = -8.52683
print('Avg. Coherence Score (Cv):', avg_coherence_cv)
print('Avg. Coherence Score (UMass):', avg_coherence_umass)
print('Model Perplexity:', perplexity)

Avg. Coherence Score (Cv): 0.5165427938470628
Avg. Coherence Score (UMass): -1.0997840821653622
Model Perplexity: -8.52683


![](https://i.imgur.com/yAYrq59.png)

# LDA Tuning: Finding the optimal number of topics

Finding the optimal number of topics in a topic model is tough, given that it is like a
model hyperparameter that you always have to set before training the model. We can
use an iterative approach and build several models with differing numbers of topics and
select the one that has the highest coherence score.

In [None]:
def topic_model_coherence_generator(corpus, texts, dictionary,
                                    start_topic_count=2, end_topic_count=10, step=1,
                                    cpus=1):

    models = []
    coherence_scores = []
    for topic_nums in tqdm.tqdm(range(start_topic_count, end_topic_count+1, step)):
        mallet_lda_model = gensim.models.wrappers.LdaMallet(mallet_path=MALLET_PATH, corpus=corpus,
                                                            num_topics=topic_nums, id2word=dictionary,
                                                            iterations=500, workers=cpus)
        cv_coherence_model_mallet_lda = gensim.models.CoherenceModel(model=mallet_lda_model, corpus=corpus,
                                                                     texts=texts, dictionary=dictionary,
                                                                     coherence='c_v')
        coherence_score = cv_coherence_model_mallet_lda.get_coherence()
        coherence_scores.append(coherence_score)
        models.append(mallet_lda_model)

    return models, coherence_scores

In [None]:
lda_models, coherence_scores = topic_model_coherence_generator(corpus=bow_corpus, texts=norm_corpus_bigrams,
                                                               dictionary=dictionary, start_topic_count=2,
                                                               end_topic_count=30, step=1, cpus=4)

 69%|██████▉   | 20/29 [1:09:18<31:11, 207.92s/it]


KeyboardInterrupt: ignored

In [None]:
coherence_df = pd.DataFrame({'Number of Topics': range(2, 31, 1),
                             'Coherence Score': np.round(coherence_scores, 4)})
coherence_df.sort_values(by=['Coherence Score'], ascending=False).head(10)

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

x_ax = range(2, 31, 1)
y_ax = coherence_scores
plt.figure(figsize=(12, 6))
plt.plot(x_ax, y_ax, c='r')
plt.axhline(y=0.535, c='k', linestyle='--', linewidth=2)
plt.rcParams['figure.facecolor'] = 'white'
xl = plt.xlabel('Number of Topics')
yl = plt.ylabel('Coherence Score')

We choose the optimal number of topics as 15, based on our intuition. We can retrieve the best model now

In [None]:
best_model_idx = coherence_df[coherence_df['Number of Topics'] == 15].index[0]
best_lda_model = lda_models[best_model_idx]
best_lda_model.num_topics

In [None]:
best_lda_model = gensim.models.wrappers.LdaMallet(mallet_path=MALLET_PATH, corpus=bow_corpus,
                                              num_topics=15, id2word=dictionary,
                                              iterations=500, workers=4)

In [None]:
best_lda_model.num_topics

In [None]:
topics = [[(term, round(wt, 3))
               for term, wt in best_lda_model.show_topic(n, topn=20)]
                   for n in range(0, best_lda_model.num_topics)]

for idx, topic in enumerate(topics):
    print('Topic #'+str(idx+1)+':')
    print([term for term, wt in topic])
    print()

# Viewing LDA Model topics

In [None]:
topics_df = pd.DataFrame([[term for term, wt in topic]
                              for topic in topics],
                         columns = ['Term'+str(i) for i in range(1, 21)],
                         index=['Topic '+str(t) for t in range(1, best_lda_model.num_topics+1)]).T
topics_df

In [None]:
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame([', '.join([term for term, wt in topic])
                              for topic in topics],
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, best_lda_model.num_topics+1)]
                         )
topics_df

# Interpreting Topic Model Results

An interesting point to remember is, given a corpus of documents (in the form of
features, e.g., Bag of Words) and a trained topic model, you can predict the distribution of
topics in each document (research paper in this case).

We can now get the most dominant topic per research paper with some intelligent
sorting and indexing.

In [None]:
print(bow_corpus[0][:30])

In [None]:
print([(dictionary[idx] , freq) for idx, freq in bow_corpus[0][:30]])

In [None]:
tm_results = best_lda_model[bow_corpus]

In [None]:
tm_results[0]

In [None]:
corpus_topics = [sorted(topics, key=lambda record: -record[1])[0]
                     for topics in tm_results]
corpus_topics[:5]

In [None]:
corpus_topic_df = pd.DataFrame()
corpus_topic_df['Document'] = range(0, len(papers))
corpus_topic_df['Dominant Topic'] = [item[0]+1 for item in corpus_topics]
corpus_topic_df['Contribution %'] = [round(item[1]*100, 2) for item in corpus_topics]
corpus_topic_df['Topic Desc'] = [topics_df.iloc[t[0]]['Terms per Topic'] for t in corpus_topics]
corpus_topic_df['Paper'] = [paper[:500] for paper in papers]

# Dominant Topics in Specific Research Papers

Another interesting perspective is to select specific papers, view the most dominant topic
in each of those papers, and see if that makes sense.

In [None]:
corpus_topic_df.groupby('Dominant Topic').apply(lambda topic_set: (topic_set.sort_values(by=['Contribution %'],
                                                                                         ascending=False)
                                                                             .iloc[0]))

# Inference on existing papers

In [None]:
sample_paper_patterns = ['Feudal Reinforcement Learning \nPeter', 'Illumination-Invariant Face Recognition with a', 'Improved Hidden Markov Model Speech Recognition']
sample_paper_idxs = [idx for pattern in sample_paper_patterns
                            for idx, content in enumerate(papers)
                                if pattern in content]
sample_paper_idxs

In [None]:
pd.set_option('display.max_colwidth', 200)
(corpus_topic_df[corpus_topic_df['Document']
                 .isin(sample_paper_idxs)])

# Topic Inference on New Papers (Data)

In [None]:
new_paper = """
Unsupervised Translation of Programming Languages
Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample
A transcompiler, also known as source-to-source translator, is a system that converts source code
from a high-level programming language (such as C++ or Python) to another.
Transcompilers are primarily used for interoperability, and to port codebases
written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one.
They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree.
Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions,
and require manual modifications in order to work properly.
The overall translation process is timeconsuming and requires expertise in both the source and target languages,
making code-translation projects expensive.
Although neural models significantly outperform their rule-based counterparts in the context of natural language translation,
their applications to transcompilation have been limited due to the scarcity of parallel data in this domain.
In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully
unsupervised neural transcompiler. We train our model on source code from open source GitHub projects,
and show that it can translate functions between C++, Java, and Python with high accuracy.
Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages,
and can easily be generalized to other programming languages. We also build and release a test set composed of 852
parallel functions, along with unit tests to check the correctness of translations.
We show that our model outperforms rule-based commercial baselines by a significant margin.
"""

new_paper

## Pre-process Text

In [None]:
preprocessed_papers = normalize_corpus([new_paper])
print(preprocessed_papers[0][:30])

## Generate Influential Bi-grams if any

In [None]:
bigrams_corpus = [bigram_model[doc] for doc in preprocessed_papers]
print(bigrams_corpus[0][90:100])

## Generate BOW Vectors from Training Vectorizer

In [None]:
bow_corpus = [dictionary.doc2bow(text) for text in bigrams_corpus]
print(bow_corpus[0][:30])

## Use trained topic model to predict topics

In [None]:
print([(dictionary[idx] , freq) for idx, freq in bow_corpus[0][:30]])

## Show most relevant topic

In [None]:
predicted_topics = best_lda_model[bow_corpus][0]
predicted_topics

In [None]:
top_topic = max(predicted_topics, key=lambda x: x[1])
top_topic

In [None]:
top_topic_idx = top_topic[0]
top_topic_idx

In [None]:
topics_df.iloc[[top_topic_idx]]

In [None]:
print(new_paper)