# Abstracts LDA

* Load the abstracts from the review articles identified in 05_benchmark_articles notebook.
* Create a corpus with the abstracts
* Tokenize
* Apply single LDA
* Apply ensemble LDA

## Conclusion
Doing an LDA on the abstracts just yields that these are systematic literature review papers. It does not give ainformation about what these reviews are about.

## Load the abstracts 
From the review articles on machine learning identified in 05_benchmark_articles notebook. Keep only the abstracts.

In [4]:
%%time

# load metadata extracted data in notebook 00_load_metadata
arxiv_ml_reviews = pd.read_csv('data/arxiv_ml_reviews.csv.zip')

CPU times: user 5.59 ms, sys: 190 µs, total: 5.78 ms
Wall time: 8.24 ms


In [32]:
print(f"There are {arxiv_ml_reviews.shape[0]} review articles on machine learning in the dataset")

There are 110 review articles on machine learning in the dataset


## Create a corpus with the abstracts

In [5]:
documents = arxiv_ml_reviews['abstract']

### Tokenize

In [28]:
from gensim.parsing.preprocessing import preprocess_string

# remove common words and tokenize
texts = [
    preprocess_string(document)
    for document in documents
]

In [29]:
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

In [33]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

## Apply single LDA

In [49]:
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
corpus_lda = lda_model[corpus]

In [50]:
for doc, as_text in zip(corpus_lda, documents):
    print(doc, as_text)

[(0, 0.011894052), (1, 0.98810595)]   Lately, there has been an increasing interest in hand gesture analysis
systems. Recent works have employed pattern recognition techniques and have
focused on the development of systems with more natural user interfaces. These
systems may use gestures to control interfaces or recognize sign language
gestures, which can provide systems with multimodal interaction; or consist in
multimodal tools to help psycholinguists to understand new aspects of discourse
analysis and to automate laborious tasks. Gestures are characterized by several
aspects, mainly by movements and sequence of postures. Since data referring to
movements or sequences carry temporal information, this paper presents a
literature review about temporal aspects of hand gesture analysis, focusing on
applications related to natural conversation and psycholinguistic analysis,
using Systematic Literature Review methodology. In our results, we organized
works according to type of analysis, me

In [56]:
lda_model.print_topics()

[(0,
  '0.018*"research" + 0.017*"review" + 0.013*"model" + 0.012*"paper" + 0.012*"systemat" + 0.011*"learn" + 0.010*"literatur" + 0.010*"base" + 0.008*"studi" + 0.008*"machin"'),
 (1,
  '0.016*"review" + 0.015*"literatur" + 0.014*"research" + 0.014*"data" + 0.012*"systemat" + 0.012*"model" + 0.011*"studi" + 0.010*"learn" + 0.010*"process" + 0.008*"system"')]

## Apply ensemble LDA

In [34]:
from gensim.models import LdaModel
topic_model_class = LdaModel

parameters

In [87]:
ensemble_workers = 4
num_models = ensemble_workers * 2
distance_workers = 4
num_topics = 8
passes = 2

In [88]:
from gensim.models import EnsembleLda

ensemble = EnsembleLda(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=passes,
    num_models=num_models,
    topic_model_class=topic_model_class,
    ensemble_workers=ensemble_workers,
    distance_workers=distance_workers
)

* Compute all the topics. In this setup: num_topics * num_workers = 160 models
* Compute the stable topics

In [89]:
print(len(ensemble.ttda))
print(len(ensemble.get_topics()))

64
1


In [90]:
ensemble.print_topics()

[(0,
  '0.022*"research" + 0.019*"review" + 0.015*"literatur" + 0.014*"data" + 0.013*"systemat" + 0.012*"paper" + 0.011*"studi" + 0.011*"model" + 0.009*"learn" + 0.008*"system"')]

## Tuning

In [91]:
import numpy as np

shape = ensemble.asymmetric_distance_matrix.shape
shape

(64, 64)

In [92]:
without_diagonal = ensemble.asymmetric_distance_matrix[~np.eye(shape[0], dtype=bool)].reshape(shape[0], -1)
print(without_diagonal.min(), without_diagonal.mean(), without_diagonal.max())

0.06404970780489483 0.2501196971164614 0.5316963761897096


In [93]:
ensemble.recluster(eps=0.1, min_samples=2, min_cores=2)
print(len(ensemble.get_topics()))

1


In [94]:
ensemble.print_topics()

[(0,
  '0.022*"research" + 0.019*"review" + 0.015*"literatur" + 0.013*"systemat" + 0.012*"data" + 0.012*"paper" + 0.011*"studi" + 0.010*"model" + 0.009*"learn" + 0.009*"base"')]