### Categorizing NIPS papers using LDA topic modeling

The LDA code is adapted from Jordan Barber's blog post [Latent Dirichlet Allocation (LDA) with Python](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)

In [125]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
import numpy as np

from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models
import gensim
from stop_words import get_stop_words

In [126]:
url= 'https://nips.cc/Conferences/2015/AcceptedPapers'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

We'll get each paper title by finding every span with `class=larger-font`. Figuring this out required looking at the html source of the page.

In [127]:
d = soup.findAll("span", {"class": "larger-font"})

Then we can pull out the text content of each of those spans.

In [128]:
titles = [ti.contents[0] for ti in d]

In [130]:
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')

def tokenize(text):
    raw = text.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in en_stop]
    return stopped_tokens

In [131]:
texts = [tokenize(i) for i in titles]

Look a the first 5:

In [145]:
texts[:5]

[[u'double',
  u'nothing',
  u'multiplicative',
  u'incentive',
  u'mechanisms',
  u'crowdsourcing'],
 [u'learning', u'symmetric', u'label', u'noise', u'importance', u'unhinged'],
 [u'algorithmic', u'stability', u'uniform', u'generalization'],
 [u'adaptive',
  u'low',
  u'complexity',
  u'sequential',
  u'inference',
  u'dirichlet',
  u'process',
  u'mixture',
  u'models'],
 [u'covariance',
  u'controlled',
  u'adaptive',
  u'langevin',
  u'thermostat',
  u'large',
  u'scale',
  u'bayesian',
  u'sampling']]

Next we'll create a dictionary of tokenized titles, and convert that into a document-term matrix.

In [141]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#### Fit the LDA model

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

In [149]:
topics = ldamodel.show_topics(formatted=False)
topics

[[(0.021683658123877118, u'learning'),
  (0.017655087068210285, u'neural'),
  (0.016952136252133005, u'networks'),
  (0.0078411199663377361, u'efficient'),
  (0.0078132540497462665, u'convolutional'),
  (0.0076511674247561282, u'via'),
  (0.0074118556495606476, u'prediction'),
  (0.0074077285192164118, u'bayesian'),
  (0.0074000132306674691, u'linear'),
  (0.006436682404842335, u'data')],
 [(0.039029094467679597, u'learning'),
  (0.017242054421422709, u'models'),
  (0.01322435734987213, u'stochastic'),
  (0.0092802367519656016, u'gaussian'),
  (0.009205980998052251, u'sparse'),
  (0.0086590219476347513, u'using'),
  (0.0086061141425646161, u'deep'),
  (0.0083293929756830826, u'via'),
  (0.0079932981561969107, u'optimization'),
  (0.0072048578179862112, u'sample')],
 [(0.018104368297777656, u'inference'),
  (0.01605922533538762, u'optimization'),
  (0.010295478839678457, u'variational'),
  (0.0094634453838710331, u'bounds'),
  (0.0088129537506973499, u'fast'),
  (0.0076534943162640673, 

In [150]:
pd.DataFrame({topic:[i[1] for i in t] for topic,t in enumerate(topics)})

Unnamed: 0,0,1,2
0,learning,learning,inference
1,neural,models,optimization
2,networks,stochastic,variational
3,efficient,gaussian,bounds
4,convolutional,sparse,fast
5,via,using,linear
6,prediction,deep,models
7,bayesian,via,algorithms
8,linear,optimization,robust
9,data,sample,networks
