### Categorizing NIPS papers using LDA topic modeling

The LDA code is adapted from Jordan Barber's blog post [Latent Dirichlet Allocation (LDA) with Python](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
import numpy as np

from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models
import gensim
from stop_words import get_stop_words

In [2]:
url= 'https://nips.cc/Conferences/2015/AcceptedPapers'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

We'll get each paper title by finding every span with `class=larger-font`. Figuring this out required looking at the html source of the page.

In [3]:
d = soup.findAll("span", {"class": "larger-font"})

Then we can pull out the text content of each of those spans.

In [4]:
titles = [ti.contents[0] for ti in d]

In [5]:
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')

def tokenize(text):
    raw = text.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in en_stop]
    return stopped_tokens

In [6]:
texts = [tokenize(i) for i in titles]

Look a the first 5:

In [7]:
texts[:5]

[[u'double',
  u'nothing',
  u'multiplicative',
  u'incentive',
  u'mechanisms',
  u'crowdsourcing'],
 [u'learning', u'symmetric', u'label', u'noise', u'importance', u'unhinged'],
 [u'algorithmic', u'stability', u'uniform', u'generalization'],
 [u'adaptive',
  u'low',
  u'complexity',
  u'sequential',
  u'inference',
  u'dirichlet',
  u'process',
  u'mixture',
  u'models'],
 [u'covariance',
  u'controlled',
  u'adaptive',
  u'langevin',
  u'thermostat',
  u'large',
  u'scale',
  u'bayesian',
  u'sampling']]

Next we'll create a dictionary of tokenized titles, and convert that into a document-term matrix.

In [8]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#### Fit the LDA model

In [9]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

In [10]:
topics = ldamodel.show_topics(formatted=False)
topics

[[(0.011732983793008859, u'learning'),
  (0.010482399776412965, u'analysis'),
  (0.010342719435481698, u'algorithms'),
  (0.0096651234244324539, u'via'),
  (0.0094983256782661969, u'online'),
  (0.009251637426688105, u'fast'),
  (0.008841814391757373, u'stochastic'),
  (0.0085808337414099382, u'neural'),
  (0.0084995785011522922, u'high'),
  (0.0072877433628405815, u'deep')],
 [(0.032258508362300202, u'learning'),
  (0.019083355907547502, u'inference'),
  (0.014040426820267148, u'optimization'),
  (0.0094000730496172375, u'variational'),
  (0.0091987922635342265, u'linear'),
  (0.0086692262732991007, u'convex'),
  (0.0081830125809068139, u'stochastic'),
  (0.0071314117398587236, u'methods'),
  (0.0071232912690603247, u'descent'),
  (0.0070634589734472758, u'via')],
 [(0.029560009462343204, u'models'),
  (0.021667680663985708, u'networks'),
  (0.018752001998676174, u'learning'),
  (0.010854141614568602, u'using'),
  (0.010815113356399364, u'neural'),
  (0.010013338409413276, u'optimizat

In [11]:
pd.DataFrame({topic:[i[1] for i in t] for topic,t in enumerate(topics)})

Unnamed: 0,0,1,2
0,learning,learning,models
1,analysis,inference,networks
2,algorithms,optimization,learning
3,via,variational,using
4,online,linear,neural
5,fast,convex,optimization
6,stochastic,stochastic,recurrent
7,neural,methods,time
8,high,descent,sample
9,deep,via,efficient
