# Modeling topics in document collections

Here we will explore some methods to make inference about the topics contained in document collections. Topics will be represented as probability distribution on the words belonging to the entire vocabulary of the corpus. Different topics will contain different proportions of words. Documents can also be described as mixtures of topics.

## Choosing a corpus

The Zika pandemic in 2015-16 has generated a epidemic of scientific publications about it. Being an almost unknown disease at the time the pandemic started, the boom in publications was necessary to "fill in the blanks" of the knowledge about the Zika virus, the disease it caused and other related topics.

Can we find out which were the most researched topics about Zika?

In the follwoing exercise, we will analise a corpus of publication abstracts captured from the Pubmed database. We will be using the gensim package for topic modeling and to facilitate things, the corpus and the dictionary are available in the our software repository.

In [1]:
from gensim import corpora, models, similarities
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from string import punctuation
from pprint import pprint

Using gpu device 0: GeForce GT 640 (CNMeM is disabled, cuDNN 5004)


### Exercise

In [2]:
dicionario = corpora.Dictionary.load('Dicionario_zika.dict')
corpus = corpora.MmCorpus('corpus_zika')

In [3]:
print(dicionario)
print(corpus)
498*5886

Dictionary(7901 unique tokens: ['azido', 'prevalent', 'manuscripts', ').', 'subcutaneous']...)
MmCorpus(872 documents, 7901 features, 40533 non-zero entries)


2931228

As we can see, the corpus consists of 872 documents, with 7901 unique words in its vocabulary. For Gensim, a document is a list of 2-tuples (word id, frequency).

In [5]:
for doc in corpus:
    print(doc)
    break

[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 5.0), (5, 1.0), (6, 2.0), (7, 1.0), (8, 1.0), (9, 2.0), (10, 2.0), (11, 1.0), (12, 1.0), (13, 1.0), (14, 3.0), (15, 2.0), (16, 3.0), (17, 2.0), (18, 1.0), (19, 1.0), (20, 2.0), (21, 1.0), (22, 1.0), (23, 1.0), (24, 1.0), (25, 1.0), (26, 1.0), (27, 1.0), (28, 1.0), (29, 1.0), (30, 2.0), (31, 1.0), (32, 1.0), (33, 9.0), (34, 1.0), (35, 2.0), (36, 1.0), (37, 1.0), (38, 1.0), (39, 1.0), (40, 1.0), (41, 1.0), (42, 1.0), (43, 4.0), (44, 1.0), (45, 2.0), (46, 1.0), (47, 3.0), (48, 2.0), (49, 1.0), (50, 1.0), (51, 1.0), (52, 1.0), (53, 1.0), (54, 1.0), (55, 1.0), (56, 1.0), (57, 1.0), (58, 1.0), (59, 2.0), (60, 1.0), (61, 1.0), (62, 1.0), (63, 5.0), (64, 1.0), (65, 1.0), (66, 1.0), (67, 1.0), (68, 1.0), (69, 2.0), (70, 1.0), (71, 1.0), (72, 2.0), (73, 1.0), (74, 1.0), (75, 1.0), (76, 2.0), (77, 1.0), (78, 3.0), (79, 3.0), (80, 1.0), (81, 3.0), (82, 2.0), (83, 2.0), (84, 1.0), (85, 1.0), (86, 1.0), (87, 2.0), (88, 1.0), (89, 2.0), (90, 1.0), (91, 1.0

## Latent Semantic Indexing - LSI
The first method we will use to model the topics in the corpus, is the LSI method, which stands for Latent Semantic embedding. Instead of the raw frequency of the words let's use the TF-IDF value of each word in each document, as a measure of the importance of a word to a document.

In [6]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

In LSI we have to specify a priori, the number of topics we believe exist in the corpus.

In [8]:
lsi = models.LsiModel(corpus_tfidf, id2word=dicionario, num_topics=30)
corpus_lsi = lsi[corpus_tfidf]

After estimating the LSI model, we can inspect the subjects extracted. looking only at the 4 most important words in each topic.

In [33]:
lsi.show_topics(10,4)

[(0, '0.344*"ZIKV" + 0.258*"virus" + 0.198*"Zika" + 0.166*"infection"'),
 (1, '0.611*"ZIKV" + -0.210*"women" + -0.205*"Zika" + -0.200*"virus"'),
 (2, '0.196*"ZIKV" + -0.187*"fever" + -0.174*"YF" + 0.170*"microcephaly"'),
 (3, '-0.202*"ZIKV" + 0.135*"Aedes" + 0.134*"spread" + -0.123*"brain"'),
 (4, '-0.234*"Ae" + -0.171*"women" + 0.168*"cases" + 0.167*"patients"'),
 (5, '-0.213*"microcephaly" + 0.201*"ZIKV" + -0.197*"YF" + -0.144*"brain"'),
 (6,
  '-0.294*"Ae" + -0.146*"albopictus" + -0.144*"aegypti" + -0.137*"infants"'),
 (7, '0.246*"ZIKV" + -0.171*"genome" + -0.158*"virus" + 0.158*"YF"'),
 (8, '0.163*"isolated" + -0.161*"patients" + 0.156*"YF" + 0.154*"strains"'),
 (9, '-0.172*"fever" + 0.163*"Health" + -0.138*"abnormalities" + -0.123*"CT"')]

As we said before, within this paradigm, a document can be seen as a mixture, or linear combination of topics. In the object **corpus_lsi** we generate above, we can find this representation of the documents in the corpus.

In [12]:
for doc  in corpus_lsi:
    print(doc)
    break

[(0, 0.2832771940757487), (1, 0.034778412621503649), (2, 0.047663271800901949), (3, -0.3137186239666353), (4, -0.056148534274674929), (5, 0.05323903374508384), (6, -0.067096554390004878), (7, 0.11428521434543858), (8, -0.029369647498402856), (9, -0.014759474293403185), (10, 0.051569908114626209), (11, 0.044254355341419319), (12, -0.05691898773473978), (13, -0.068825112058023843), (14, -0.02195590821510561), (15, 0.0086818775684834159), (16, -0.04779029324525709), (17, 0.070210663187842887), (18, 0.081583964932671157), (19, 0.038797052128154964), (20, 0.073282907466170458), (21, -0.055100211956329297), (22, -0.028602713996770793), (23, -0.12490527006562716), (24, -0.037722543376292328), (25, -0.048239216200329033), (26, -0.013040690616555268), (27, -0.0086285804910014377), (28, -0.012187627915945793), (29, -0.010538037465571987)]


So each document can be seen as a vector in a topic space. Thus we can calculate the cosine similarity between documents using this fact. To do that, we first calculate the matrix with all the similarities.

In [15]:
index = similarities.MatrixSimilarity(corpus_lsi)

Then let's print the ten most similar documents to document $0$.

In [17]:
sims = index[doc]
#pprint(list(enumerate(sims)))
pprint(sorted(list(enumerate(sims)), key=lambda x:x[1], reverse=True)[:10])

[(0, 1.0),
 (449, 0.68444359),
 (399, 0.66428459),
 (2, 0.61082482),
 (548, 0.61057496),
 (12, 0.60451663),
 (489, 0.57837856),
 (16, 0.57104939),
 (465, 0.56657535),
 (704, 0.56532925)]


## Latent Dirichlet Allocation - LDA
LDA is a similar technique to LSI but which models documents as a probability distribution of topics and topics as a probability distribution of words. Thus the weights are always positive and add to 1. To know more about LDA read this article: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

In [19]:
lda_model = models.ldamodel.LdaModel(corpus, id2word=dicionario, num_topics=30, passes=10)

In [32]:
lda_model.show_topics(10,4)

[(22, '0.014*The + 0.014*virus + 0.013*Zika + 0.011*infection'),
 (4, '0.016*ZIKV + 0.016*virus + 0.011*Zika + 0.010*),'),
 (5, '0.039*virus + 0.023*Zika + 0.012*The + 0.009*infection'),
 (20, '0.038*Zika + 0.037*virus + 0.025*women + 0.023*transmission'),
 (19, '0.027*Zika + 0.026*virus + 0.011*2 + 0.009*1'),
 (15, '0.025*virus + 0.021*Zika + 0.012*transmission + 0.009*ZIKV'),
 (2, '0.025*ZIKV + 0.021*virus + 0.010*Ae + 0.010*),'),
 (7, '0.017*virus + 0.014*ZIKV + 0.011*Zika + 0.011*fever'),
 (12, '0.023*x80 + 0.012*insecticide + 0.012*x89Â + 0.012*±'),
 (18, '0.015*ZIKV + 0.014*Zika + 0.012*infection + 0.009*human')]

In [21]:
corpus_lda = lda_model[corpus]

As said before documents are probability distributions over the set of 30 topics specified.

In [30]:
doc_lda = corpus_lda[3]
doc_lda

[(0, 0.095488416923020167),
 (4, 0.86107272694775772),
 (22, 0.039027091423336144)]

In [31]:
sum(t[1] for t in doc_lda)

0.995588235294114