# Topic Modelling & NER

## A) Modelling of Data in Topics

### Prepare Project

   1. Load libraries
   2. Load dataset

To run this jupyter notebook the folder "nipstxt", which contain all the files needed, should exist in the same folder

In [2]:
# Import all the libraries needed.
import re
import os
import nltk
import pandas as pd
import numpy as np
import scipy
import codecs
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import pprint as pprint
from gensim.models import Phrases
from gensim import models, corpora
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WordPunctTokenizer
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [3]:
# Folder containing all NIPS papers.
directory = 'nipstxt/'

# Folders containin individual NIPS papers.
years = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
sub_dirs = ['nips' + yr for yr in years]

# Read all texts into a list.
papers = []
for element in sub_dirs:
    paper_files = os.listdir(directory + element)
    for item in paper_files:
        # Note: ignoring characters that cause encoding errors.
        with codecs.open(directory + element + '/' + item, encoding='utf-8', errors='ignore') as paper_file:
            text = paper_file.read()
        papers.append(text)

In [4]:
print len(papers)  # Check if the number of documents is the right one! It is ok!

1740


## Natural Language Processing (NLP)

### Applying basic text processing methods:

- Clean text from urls and other not useful stuff
- Tokenisation
- Lemmatization
- Stop word removal 
- Vectorising text via term frequencies (TF)
- Vectorising text via inverse document frequencies (TF-IDF)

In [5]:
# Create some useful variables like stopwords, a lemmatizer and the vectorizers.
stop_words = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

In [6]:
# Create some regural expression for pattern recognition so as to exclude or replace some words in the text.
pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
nonPunct = re.compile('.*[A-Za-z].*')

In [7]:
# Function that cleans the text according to the patterns, tokenizes it, makes all letters lowercase, 
# removes punctuation and stop words,lemmatizes the words, keeps those that are at least 3 letters and returns 
# the cleaned lemmas.

def text_cleaning(text):

    stripped = re.sub(combined_pat, '', text)
    stripped = re.sub(www_pat, '', stripped)
    tokens = nltk.word_tokenize(stripped)
    lower_case = [word.lower() for word in tokens]
    raw_words = [tok for tok in lower_case if nonPunct.match(tok)]
    filtered_result = list(filter(lambda l: l not in stop_words, raw_words))
    lemmas = [wordnet_lemmatizer.lemmatize(t) for t in filtered_result]
    lemmas = [lemma for lemma in lemmas if len(lemma) > 2]
    return lemmas

In [8]:
tokenized_data = []
for text in papers:
    tokenized_data.append(text_cleaning(text))

In [9]:
tokenized_data[:3]

[[u'5o5',
  u'connecting',
  u'past',
  u'bruce',
  u'macdonald',
  u'assistant',
  u'professor',
  u'knowledge',
  u'science',
  u'laboratory',
  u'computer',
  u'science',
  u'department',
  u'university',
  u'calgary',
  u'university',
  u'drive',
  u'calgary',
  u'alberta',
  u't2n',
  u'1n4',
  u'abstract',
  u'l\x7fecently',
  u'renewed',
  u'interest',
  u'neural-like',
  u'processing',
  u'system',
  u'evidenced',
  u'ex-',
  u'ample',
  u'two',
  u'volume',
  u'parallel',
  u'distributed',
  u'processing',
  u'edited',
  u'p\x7fumelhart',
  u'mcclelland',
  u'discussed',
  u'parallel',
  u'distributed',
  u'system',
  u'connectionist',
  u'model',
  u'neural',
  u'net',
  u'value',
  u'passing',
  u'system',
  u'multiple',
  u'context',
  u'system',
  u'dissatisfaction',
  u'symbolic',
  u'manipulation',
  u'paradigm',
  u'artificial',
  u'intelligence',
  u'seems',
  u'partly',
  u'responsible',
  u'attention',
  u'encouraged',
  u'promise',
  u'massively',
  u'parallel',
  u

In [10]:
len(tokenized_data[0])

2538

As we can observe from the list of words above, some words that as splitted in new line with the character '-' still remain like two different words after the cleaning of the data. So, with the next code block, I join the splitted words and save the new tokens in a new list of lists, the fixed_tokens list.

In [11]:
fixed_tokens = []

for item in tokenized_data:
    i = 0
    splitted_words = []
    while (i < (len(item)-1)):
        if item[i].endswith('-'):
            splitted_words.append("".join([item[i][:-1], item[i+1]]))
            i = i + 2
        else:
            splitted_words.append(item[i])
            i = i + 1
    fixed_tokens.append(splitted_words)


In [12]:
fixed_tokens[:3]

[[u'5o5',
  u'connecting',
  u'past',
  u'bruce',
  u'macdonald',
  u'assistant',
  u'professor',
  u'knowledge',
  u'science',
  u'laboratory',
  u'computer',
  u'science',
  u'department',
  u'university',
  u'calgary',
  u'university',
  u'drive',
  u'calgary',
  u'alberta',
  u't2n',
  u'1n4',
  u'abstract',
  u'l\x7fecently',
  u'renewed',
  u'interest',
  u'neural-like',
  u'processing',
  u'system',
  u'evidenced',
  u'example',
  u'two',
  u'volume',
  u'parallel',
  u'distributed',
  u'processing',
  u'edited',
  u'p\x7fumelhart',
  u'mcclelland',
  u'discussed',
  u'parallel',
  u'distributed',
  u'system',
  u'connectionist',
  u'model',
  u'neural',
  u'net',
  u'value',
  u'passing',
  u'system',
  u'multiple',
  u'context',
  u'system',
  u'dissatisfaction',
  u'symbolic',
  u'manipulation',
  u'paradigm',
  u'artificial',
  u'intelligence',
  u'seems',
  u'partly',
  u'responsible',
  u'attention',
  u'encouraged',
  u'promise',
  u'massively',
  u'parallel',
  u'system'

In [13]:
len(fixed_tokens[0])

2523

I chose to also find bigrams in the documents. By using bigrams I can get phrases like "log_likelihood" in the output. Without bigrams we would only get "log" and "likelihood".

So, I find bigrams and then add them to the original data, because I would like to keep the words "log" and "likelihood" as well as the bigram "log_likelihood".

In [14]:
# Add bigrams that appear 30 times or more to tokenized_data
bigrams = Phrases(fixed_tokens, min_count=30)
for i in range(len(fixed_tokens)):
    for token in bigrams[fixed_tokens[i]]:
        if '_' in token:
            # Token is a bigram, add to document.
            fixed_tokens[i].append(token)



In [15]:
# Join the tokens of a paper to be able to vectorize the words
joined_tokens = [" ".join(item) for item in fixed_tokens]
count_vectorized_data = count_vectorizer.fit_transform(joined_tokens)
tfidf_vectorized_data = tfidf_vectorizer.fit_transform(joined_tokens)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


## Topic Modelling with LDA

### Experimenting with the Latent Dirichlet Allocation (LDA)

LDA requires data in the form of integer counts. So I am going to use the Dictionary method of gensim package which creates a dictionary, containing the term combined with the count of this term in the documents. 

I decided to remove rare words and common words based on their document frequency. Below I remove words that appear in less than 20 documents or in more than 80% of the documents.

Very frequent words (like words that appears in >80% of the documents) are likely to be not so meaningful to explain the topic of that document. On the opposite side, words that are extremely rare cannot actually lead to any reliable association to a specific topic.

In [16]:
# Create a dictionary representation of the documents.
dictionary = corpora.Dictionary(fixed_tokens)

# Filter out words that occur less than 30 documents, or more than 80% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.8)

Gensim creates a unique id for each word in the document. The produced corpus shown below is a mapping of (word_id, word_frequency).

In [17]:
# Create the corpus
corpus = [dictionary.doc2bow(doc) for doc in fixed_tokens]

In [19]:
# Check what an item of corpus looks like
print corpus[0]

[(0, 1), (1, 1), (2, 8), (3, 4), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 31), (14, 1), (15, 1), (16, 11), (17, 1), (18, 3), (19, 3), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 2), (27, 1), (28, 4), (29, 1), (30, 1), (31, 5), (32, 2), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 3), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 4), (45, 2), (46, 3), (47, 3), (48, 1), (49, 1), (50, 1), (51, 3), (52, 2), (53, 1), (54, 1), (55, 1), (56, 2), (57, 5), (58, 18), (59, 6), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 3), (66, 5), (67, 2), (68, 1), (69, 2), (70, 3), (71, 4), (72, 1), (73, 1), (74, 1), (75, 1), (76, 4), (77, 2), (78, 1), (79, 2), (80, 4), (81, 2), (82, 1), (83, 5), (84, 5), (85, 1), (86, 2), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 5), (93, 1), (94, 2), (95, 4), (96, 24), (97, 1), (98, 2), (99, 1), (100, 1), (101, 1), (102, 1), (103, 2), (104, 2), (105, 1), (106, 8), (107, 2), (108, 2), (109, 1), (110,

In [20]:
# In a more readable form:
for cp in corpus[2]:
    print (dictionary[cp[0]], cp[1])

(u'1st', 2)
(u'able', 2)
(u'adjust', 1)
(u'adjusted', 3)
(u'algorithm', 3)
(u'along', 2)
(u'american', 1)
(u'american_institute', 1)
(u'analysis', 3)
(u'another', 1)
(u'appeared', 1)
(u'area', 6)
(u'assigned', 1)
(u'available', 3)
(u'back', 22)
(u'behavior', 1)
(u'brain', 2)
(u'change', 1)
(u'combined', 5)
(u'complex', 13)
(u'computer', 1)
(u'conclusion', 1)
(u'conference', 5)
(u'connected', 1)
(u'connection', 5)
(u'context', 1)
(u'could', 1)
(u'cybernetics', 1)
(u'demonstrated', 3)
(u'department', 1)
(u'depends', 1)
(u'descent', 1)
(u'described', 3)
(u'desired', 5)
(u'detail', 1)
(u'determine', 3)
(u'diagram', 1)
(u'difference', 2)
(u'difficult', 3)
(u'discrete', 1)
(u'discus', 1)
(u'distribution', 1)
(u'enough', 1)
(u'equation', 1)
(u'error', 14)
(u'estimate', 2)
(u'fed', 1)
(u'feedback', 1)
(u'finite', 1)
(u'fixed', 20)
(u'fixing', 1)
(u'form', 24)
(u'formed', 16)
(u'forming', 3)
(u'four', 1)
(u'framework', 2)
(u'gain', 1)
(u'general', 2)
(u'generated', 2)
(u'hidden', 15)
(u'hidden_

In Latent Dirichlet Allocation, the order of words is not important in a document, thus we use the bag of words. A document is a distribution over topics. Each topic, in turn, is a distribution over words belonging to the vocabulary. 
The process of creating a document is to choose a distribution over topics, draw a topic and choose words from the topic.

First I create a model with 10 topics and 'auto' for the 'alpha' hyper-parameter.

In [21]:
lda_model_num1 = models.ldamodel.LdaModel(corpus=corpus, 
                                          id2word=dictionary, 
                                          num_topics=10,
                                          alpha='auto')
lda_model_num1.print_topics()

[(0,
  u'0.010*"learning" + 0.008*"algorithm" + 0.007*"training" + 0.006*"data" + 0.005*"state" + 0.005*"output" + 0.004*"pattern" + 0.004*"point" + 0.004*"error" + 0.004*"feature"'),
 (1,
  u'0.007*"output" + 0.006*"data" + 0.005*"neuron" + 0.005*"learning" + 0.005*"weight" + 0.005*"algorithm" + 0.004*"state" + 0.004*"training" + 0.004*"error" + 0.004*"point"'),
 (2,
  u'0.009*"unit" + 0.009*"learning" + 0.006*"state" + 0.005*"neuron" + 0.004*"output" + 0.004*"algorithm" + 0.004*"point" + 0.004*"hidden" + 0.004*"weight" + 0.004*"parameter"'),
 (3,
  u'0.008*"learning" + 0.008*"unit" + 0.006*"algorithm" + 0.005*"error" + 0.005*"cell" + 0.005*"neuron" + 0.004*"training" + 0.004*"output" + 0.004*"image" + 0.004*"data"'),
 (4,
  u'0.009*"learning" + 0.007*"unit" + 0.007*"training" + 0.006*"algorithm" + 0.006*"vector" + 0.005*"data" + 0.005*"output" + 0.004*"layer" + 0.004*"error" + 0.004*"neuron"'),
 (5,
  u'0.009*"learning" + 0.006*"data" + 0.006*"unit" + 0.005*"algorithm" + 0.005*"train

Topic 0 is represented as: '0.010*"learning" + 0.008*"algorithm" + 0.007*"training" + 0.006*"data" + 0.005*"state" + 0.005*"output" + 0.004*"pattern" + 0.004*"point" + 0.004*"error" + 0.004*"feature"',  which means that the top 10 keywords that contribute to this topic are: 'learning', 'algorithm' 'training', etc  and the weight of 'learning' in topic 0 is 0.010.
The weights reflect how important a keyword is to that topic.

I used the model perplexity and the topic coherence to measure  how good a given topic model is.

Topic coherence is a measure used to evaluate topic models. Each generated topic consists of words, and the topic coherence is applied to the top N words from the topic. Coherence score is defined as the average of the pairwise word-similarity scores of the words in the topic.

In [34]:
# Compute Model Perplexity
print('\nPerplexity: ', lda_model_num1.log_perplexity(corpus))  # measure of how good the model is, the lower the better

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_num1, texts=fixed_tokens, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

('\nPerplexity: ', -7.911982987710751)
('\nCoherence Score: ', 0.3041973902501089)


Then, I change the number of topics to 20.

In [35]:
lda_model_num2 = models.ldamodel.LdaModel(corpus=corpus, 
                                          id2word=dictionary, 
                                          num_topics=20,
                                          alpha='auto')
lda_model_num2.print_topics()

[(0,
  u'0.006*"training" + 0.006*"learning" + 0.005*"output" + 0.005*"unit" + 0.005*"weight" + 0.005*"neuron" + 0.004*"pattern" + 0.004*"neural_network" + 0.004*"data" + 0.003*"algorithm"'),
 (1,
  u'0.008*"algorithm" + 0.007*"data" + 0.007*"training" + 0.006*"learning" + 0.005*"output" + 0.005*"state" + 0.005*"error" + 0.004*"pattern" + 0.004*"neural_network" + 0.003*"weight"'),
 (2,
  u'0.008*"data" + 0.007*"unit" + 0.007*"learning" + 0.007*"error" + 0.007*"algorithm" + 0.006*"weight" + 0.005*"training" + 0.005*"image" + 0.004*"hidden" + 0.004*"vector"'),
 (3,
  u'0.009*"learning" + 0.007*"output" + 0.007*"algorithm" + 0.006*"data" + 0.004*"cell" + 0.004*"unit" + 0.004*"neuron" + 0.004*"method" + 0.004*"state" + 0.004*"layer"'),
 (4,
  u'0.007*"learning" + 0.006*"algorithm" + 0.006*"training" + 0.005*"output" + 0.005*"data" + 0.004*"error" + 0.004*"neural_network" + 0.004*"weight" + 0.004*"image" + 0.004*"parameter"'),
 (5,
  u'0.006*"neuron" + 0.006*"training" + 0.005*"learning" + 

In [36]:
# Compute Perplexity
print('\nPerplexity: ', lda_model_num2.log_perplexity(corpus))  

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_num2, texts=fixed_tokens, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

('\nPerplexity: ', -8.006047090955388)
('\nCoherence Score: ', 0.30384286973900754)


Then, I chose the topics to be 30.

In [37]:
lda_model_num3 = models.ldamodel.LdaModel(corpus=corpus, 
                                          id2word=dictionary, 
                                          num_topics=30, 
                                          alpha='auto')
lda_model_num3.print_topics()

[(22,
  u'0.007*"learning" + 0.006*"data" + 0.005*"error" + 0.005*"unit" + 0.005*"state" + 0.005*"training" + 0.004*"probability" + 0.004*"spike" + 0.004*"parameter" + 0.004*"algorithm"'),
 (15,
  u'0.020*"unit" + 0.007*"weight" + 0.007*"output" + 0.006*"hidden" + 0.005*"neuron" + 0.005*"training" + 0.005*"learning" + 0.004*"neural_network" + 0.004*"data" + 0.004*"error"'),
 (3,
  u'0.007*"weight" + 0.007*"learning" + 0.007*"algorithm" + 0.006*"output" + 0.006*"data" + 0.005*"training" + 0.005*"neural_network" + 0.004*"neuron" + 0.004*"vector" + 0.004*"state"'),
 (5,
  u'0.006*"data" + 0.006*"learning" + 0.005*"algorithm" + 0.005*"output" + 0.005*"unit" + 0.004*"pattern" + 0.004*"training" + 0.004*"image" + 0.004*"distribution" + 0.004*"signal"'),
 (13,
  u'0.007*"learning" + 0.007*"training" + 0.005*"output" + 0.004*"field" + 0.004*"data" + 0.004*"error" + 0.004*"neural_network" + 0.004*"method" + 0.004*"image" + 0.003*"weight"'),
 (17,
  u'0.007*"data" + 0.006*"learning" + 0.006*"fea

In [38]:
# Compute Perplexity
print('Perplexity: ', lda_model_num3.log_perplexity(corpus))  

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_num3, texts=fixed_tokens, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

('Perplexity: ', -8.096684996900262)
('Coherence Score: ', 0.3018599180361591)


So, as we can see from the experiments, the higher coherence score happens when the topics are 10. 

The coherence score is generally not very high and this makes sense as the papers that the corpus was made of, all come from one of the NIPS (Neural Information Processing Systems) conference, which means that they contain more or less the same or similae words. 

Now, I tried to change the hyper-parameter 'alpha' from 'auto' to 'asymmetric' for the number of topics that had the highest coherence.

In [39]:
lda_model_num4 = models.ldamodel.LdaModel(corpus=corpus, 
                                          id2word=dictionary, 
                                          num_topics=10, 
                                          alpha='asymmetric')
lda_model_num4.print_topics()

[(0,
  u'0.008*"learning" + 0.007*"algorithm" + 0.005*"unit" + 0.005*"state" + 0.005*"output" + 0.004*"cell" + 0.004*"pattern" + 0.004*"neuron" + 0.004*"weight" + 0.004*"feature"'),
 (1,
  u'0.006*"algorithm" + 0.006*"output" + 0.005*"data" + 0.005*"learning" + 0.005*"neuron" + 0.004*"weight" + 0.004*"parameter" + 0.004*"unit" + 0.004*"training" + 0.004*"point"'),
 (2,
  u'0.007*"learning" + 0.006*"data" + 0.006*"unit" + 0.005*"error" + 0.005*"algorithm" + 0.005*"training" + 0.004*"weight" + 0.004*"image" + 0.004*"output" + 0.004*"state"'),
 (3,
  u'0.006*"learning" + 0.005*"training" + 0.005*"data" + 0.005*"parameter" + 0.005*"unit" + 0.004*"method" + 0.004*"algorithm" + 0.004*"error" + 0.004*"state" + 0.003*"probability"'),
 (4,
  u'0.010*"learning" + 0.006*"training" + 0.005*"data" + 0.005*"algorithm" + 0.005*"unit" + 0.004*"method" + 0.004*"weight" + 0.004*"image" + 0.004*"error" + 0.004*"output"'),
 (5,
  u'0.009*"learning" + 0.008*"training" + 0.007*"data" + 0.005*"algorithm" + 0

In [40]:
# Compute Perplexity
print('Perplexity: ', lda_model_num4.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_num4, texts=fixed_tokens, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

('Perplexity: ', -7.9161653051717105)
('Coherence Score: ', 0.3006473971431597)


And finally I changed the 'alpha' hyper-parameter to its default value, which is 'symmetric'

In [42]:
lda_model_num5 = models.ldamodel.LdaModel(corpus=corpus, 
                                          id2word=dictionary, 
                                          num_topics=10, 
                                          alpha='symmetric')
lda_model_num5.print_topics()

[(0,
  u'0.010*"learning" + 0.007*"training" + 0.007*"weight" + 0.006*"unit" + 0.005*"algorithm" + 0.005*"data" + 0.005*"output" + 0.005*"state" + 0.004*"probability" + 0.004*"neuron"'),
 (1,
  u'0.009*"learning" + 0.008*"algorithm" + 0.007*"output" + 0.007*"data" + 0.006*"training" + 0.005*"neural_network" + 0.004*"unit" + 0.004*"probability" + 0.004*"error" + 0.004*"distribution"'),
 (2,
  u'0.007*"algorithm" + 0.007*"data" + 0.006*"learning" + 0.006*"output" + 0.006*"weight" + 0.005*"error" + 0.005*"unit" + 0.004*"point" + 0.004*"state" + 0.004*"neuron"'),
 (3,
  u'0.012*"learning" + 0.006*"unit" + 0.006*"data" + 0.006*"state" + 0.005*"algorithm" + 0.004*"output" + 0.004*"noise" + 0.004*"parameter" + 0.004*"image" + 0.004*"vector"'),
 (4,
  u'0.008*"training" + 0.007*"learning" + 0.007*"pattern" + 0.007*"weight" + 0.006*"data" + 0.005*"image" + 0.005*"vector" + 0.005*"unit" + 0.004*"error" + 0.004*"neuron"'),
 (5,
  u'0.008*"algorithm" + 0.007*"error" + 0.006*"learning" + 0.005*"sta

In [43]:
# Compute Perplexity
print('Perplexity: ', lda_model_num5.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_num5, texts=fixed_tokens, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

('Perplexity: ', -7.916872045147157)
('Coherence Score: ', 0.30148515562937767)


So, I display the topics for the LDA model with the highest coherence score, which was the one with the 10 topics and 'auto' for the 'a' parameter, with the help of the pyLDAvis package. The visualization is saved in the file named "LDA_visualization.html".

In [22]:
vis = pyLDAvis.gensim.prepare(lda_model_num1, corpus, dictionary)
pyLDAvis.save_html(vis, "LDA_visualization.html")

If we want to find the topic that each paper belongs to, we apply the LDA model to the corpus and then for each paper we choose the maximum probability. The representation shows that the first paper probably belongs to topic number 2, second paper belongs to topic number 3, etc.

In [29]:
lda_results = lda_model_num1[corpus]

best = []
i = 0

for res in lda_results:
    bs = max(res, key=lambda item:item[1])
    best.append((bs, i))
    i+=1
best[:10]

[((2, 0.4296106), 0),
 ((3, 0.39737296), 1),
 ((1, 0.80723816), 2),
 ((5, 0.6038754), 3),
 ((8, 0.43657088), 4),
 ((2, 0.30511442), 5),
 ((8, 0.798671), 6),
 ((3, 0.35681605), 7),
 ((4, 0.34375), 8),
 ((8, 0.3247165), 9)]

Then, if we want to find which paper is the best representation of a topic we use the best probabilities from before and for each topic we find the maximum probability value from all papers. This way we find the paper that has the highest probability for each topic number.

In [31]:
topic_num = [i for i in range(10)]

for item in topic_num:
    fmax = []
    for res in best:
        if res[0][0] == item:
            fmax.append(res)
    #print fmax
    if not fmax:
        print 'None'
    else:
        res =max(fmax, key=lambda x:x[0][1])
        print res

 ((0, 0.96704495), 732)
((1, 0.9985982), 1408)
((2, 0.998661), 748)
((3, 0.9895683), 933)
((4, 0.9934871), 744)
((5, 0.9781461), 1035)
((6, 0.9711033), 358)
((7, 0.9409545), 1333)
((8, 0.9961287), 694)
((9, 0.93707544), 1475)
