In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import *
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [16]:
filename_list = os.listdir()[0:500]

In [17]:
# Just making sure these file names are correct:
filename_list[0:5]

# Looks good!

['0301116', '0304232', '0303017', '0303225', '0302131']

Now we can read in all the files from the `filename_list`.

In [18]:
corpus = []

for i in range(len(filename_list)):
    
    filename = filename_list[i]
    
    with open(filename, 'r', errors='ignore') as file:
            file_contents = file.read()
    
    # Add document to corpus
    corpus.append(file_contents)

In [19]:
len(corpus)

500

We have exactly 500 documents to work with. Now we can turn our corpus into a matrix of TF-IDF features using `sklearn`'s `TfidfVectorizer()`.

In [23]:
vectorizer = TfidfVectorizer(decode_error='ignore', max_df=0.9, ngram_range=(2, 2), max_features = 20000)

X = vectorizer.fit_transform(corpus)
X

<500x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 762717 stored elements in Compressed Sparse Row format>

Let's take a look at the vocabulary that was learned by the vectorizer.

In [29]:
vectorizer.vocabulary_

{'the paper': 16867,
 'documentstyle 12pt': 5075,
 'newcommand beq': 10718,
 'beq begin': 2453,
 'begin equation': 2403,
 'equation newcommand': 5727,
 'newcommand eeq': 10731,
 'eeq end': 5308,
 'end equation': 5436,
 'begin displaymath': 2400,
 'end displaymath': 5433,
 'beqa begin': 2464,
 'begin eqnarray': 2402,
 'eqnarray newcommand': 5619,
 'eeqa end': 5322,
 'end eqnarray': 5435,
 'partial newcommand': 12394,
 'renewcommand thefootnote': 13720,
 'thefootnote fnsymbol': 17385,
 'fnsymbol footnote': 6394,
 'footnote renewcommand': 6448,
 'setcounter footnote': 14470,
 'renewcommand theequation': 13719,
 'theequation arabic': 17382,
 'arabic section': 1567,
 'section arabic': 14356,
 'arabic equation': 1565,
 'equation def': 5683,
 'def nn': 4581,
 'nn nonumber': 10776,
 'nonumber def': 10864,
 'hbox to': 7701,
 'value equation': 18859,
 'setcounter equation': 14469,
 'equation renewcommand': 5751,
 'equation begin': 5670,
 'begin flushright': 2406,
 '11 end': 71,
 'end flushright'

In [27]:
vectorizer.stop_words_

{'g_2 subset',
 'eqletter where',
 'again within',
 '229 1991',
 'needs accelerations',
 'j_k the',
 '30 32',
 'peebles phys',
 'phys leipzig',
 'those neutral',
 'theta_k theta_',
 '36 based',
 'parameters la_b',
 'these papers',
 'mathbf in',
 'rigorously by',
 'physical terminology',
 'term being',
 'else protect',
 'version section',
 'matrix a_1',
 'requires specification',
 'people it',
 'so obvious',
 'goldwirth 1991rj',
 'curved backgrounds',
 'calling these',
 'cartan atiyah',
 'with solitonic',
 'lev in',
 'hawking predict',
 'bakasi we',
 'exs t_',
 'providing efficient',
 'vk dagger',
 'following hyperbolic',
 'maxim textsc',
 'phys b556',
 'integrals begin',
 'not homogeneous',
 'cite brustein01',
 'behavior using',
 'flux 64',
 'cc ast',
 'regard to',
 'superfields n_',
 'matter clearpage',
 'eqn svarl',
 'all sure',
 'by pp',
 'ox1 3lb',
 'th 0211017',
 'regularized quantum',
 'a_f pi',
 'connected element',
 'with asking',
 'eqnarray luckily',
 'karmanov93 we',
 'indica

We can see from the stopwords learned by the vectorizer that it can safely exclude a lot of the LaTeX terms. A few physics terms were caught in the filter, but the benefit should outweigh this cost. (After all, they must not be very distinctive phrases if they're occurring in 90% of papers.)

The top of the vocabulary shows some additional LaTeX phrases that occur frequently and are not informative to us. We'll go ahead and remove those low-hanging fruit using by retraining the vectorizer and passing these stop-phrases as a list.

In [32]:
additional_stopwords = list(vectorizer.vocabulary_)[0:50]

In [34]:
vectorizer = TfidfVectorizer(decode_error='ignore', max_df=0.9, ngram_range=(2, 2), 
                             max_features = 20000, stop_words=additional_stopwords)

X = vectorizer.fit_transform(corpus)
X

<500x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 762717 stored elements in Compressed Sparse Row format>

Now we'll do some topic modeling.

In [35]:
# Initialize NMF
nmf_model = NMF(n_components = 50, solver = 'mu')

# Create variable to make it easy to retrieve topics
idx_to_word = np.array(vectorizer.get_feature_names())

In [36]:
nmf_model.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=50, random_state=None, shuffle=False, solver='mu',
  tol=0.0001, verbose=0)

In [37]:
nmf_components = nmf_model.components_

In [38]:
for i, topic in enumerate(nmf_components):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: gauge theory, nucl phys, string theory, jhep bf, phys bf, begin equation, end equation, arxiv hep, citation hep, hep th
Topic 2: frac partial, partial partial, right end, equation the, equation frac, equation begin, equation where, equation label, begin equation, end equation
Topic 3: f_ mu, h_ mu, partial_ mu, nu rho, b_ mu, partial mu, alpha beta, nu mu, a_ mu, mu nu
Topic 4: hole entropy, the horizon, schwarzschild black, the entropy, of black, s_ bh, gr qc, the black, black holes, black hole
Topic 5: ee the, lambda_ alpha, in cite, al beta, a_ rm, right ee, rm cr, ee be, ee where, be label
Topic 6: the plane, hep th, the penrose, bea la, frac mu, penrose limit, light cone, wave background, pp wave, plane wave
Topic 7: closed strings, the boundary, the tachyon, the open, field theory, the closed, boundary state, open string, string field, closed string
Topic 8: right rangle_, mathrm exp, ee sigma, u_ mu, displaystyle frac, cite rf, the proton, eq ref, label eq, ref eq
Topic

These topics are interesting, but we note that a few of them are composed entirely of LaTeX. This is not very interesting, so in order to get more cohesive topics, it's possible that we need to collapse the 50 NMF components into fewer (maybe, for example, 15). 

Let's try that next.

In [60]:
nmf_model_15 = NMF(n_components = 15, solver = 'mu')
nmf_model_15.fit(X)
nmf_components_15 = nmf_model_15.components_

In [61]:
for i, topic in enumerate(nmf_components_15):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: nucl phys, the brane, gauge theory, phys bf, matrix model, begin equation, end equation, arxiv hep, citation hep, hep th
Topic 2: mu nu, equation the, equation frac, alpha beta, right end, equation begin, equation where, equation label, begin equation, end equation
Topic 3: g_ mu, f_ mu, nu rho, partial mu, b_ mu, partial_ mu, nu mu, alpha beta, a_ mu, mu nu
Topic 4: arxiv gr, schwarzschild black, of black, the horizon, the entropy, s_ bh, gr qc, the black, black holes, black hole
Topic 5: the space, mathscr newcommand, the lie, hh 2_, there is, bf cp, space of, lie algebra, phase space, hilbert space
Topic 6: the pp, bmn operators, penrose limit, citation hep, frac mu, light cone, wave background, hep th, pp wave, plane wave
Topic 7: the boundary, the open, pi alpha, field theory, the closed, the tachyon, boundary state, open string, string field, closed string
Topic 8: bibitem rf, begin eqnarray, end eqnarray, u_ mu, displaystyle frac, the proton, eq ref, cite rf, label eq, 

Reducing the number of components seemed to help a bit, but we still have multiple topics that are entirely LaTeX formatting specifications, and the physics-related topics still have some LaTeX artifacts.

Let's turn to LDA and see if it can do any better here.

In [58]:
lda_model = LatentDirichletAllocation(max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda_model.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [59]:
for i, topic in enumerate(lda_model.components_):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: to infty, eq ref, alpha beta, end eqnarray, with respect, begin eqnarray, hep th, begin equation, mu nu, end equation
Topic 2: the orbifold, we can, frac pi, arxiv hep, gauge theory, mu nu, begin equation, citation hep, end equation, hep th
Topic 3: bar bar, ref eq, plain left, the term, size label, decor size, scs dist, hep th, fmfv decor, fmf plain
Topic 4: black hole, r_ r_, over sqrt, in cite, mu nu, arxiv hep, end equation, hep th, begin equation, citation hep
Topic 5: it must, function has, end align, end equation, line in, power spectrum, hep th, def half, ref newcommand, mu nu
Topic 6: cal imath, hep th, theta cal, begin equation, cal ast, theta stackrel, imath theta, cal theta, gamma theta, stackrel circ
Topic 7: here the, vspace 5mm, ricci flat, cr bar, bibitem metr, varphi theta, hep th, cmss kern, and corresponding, z_1 z_1
Topic 8: hermitean metric, wave equation, end equation, obtains begin, _2 hat, dimensional and, the approach, which requires, mu nu, be label
T

Yikes. LDA didn't do much better. These topics are not especially informative either...