In [9]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import *
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [27]:
filename_list = os.listdir()[0:300]

In [11]:
# Just making sure these file names are correct:
filename_list[0:5]

# Looks good!

['0301116', '0304232', '0303017', '0303225', '0302131']

Now we can read in all the files from the `filename_list`.

In [28]:
corpus = []

for i in range(len(filename_list)):
    
    filename = filename_list[i]
    
    with open(filename, 'r', errors='ignore') as file:
            file_contents = file.read()
    
    # Add document to corpus
    corpus.append(file_contents)

In [30]:
len(corpus)

300

We have exactly 300 documents to work with. Now we can turn our corpus into a matrix of TF-IDF features using `sklearn`'s `TfidfVectorizer()`.

In [57]:
vectorizer = TfidfVectorizer(decode_error='ignore', stop_words=latex_stopwords, max_df=0.6, max_features = 10000)

X = vectorizer.fit_transform(corpus)
X

<300x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 255164 stored elements in Compressed Sparse Row format>

Let's take a look at the vocabulary that was learned by the vectorizer.

In [58]:
vectorizer.vocabulary_

{'cosmology': 2774,
 'particle': 6959,
 'department': 3121,
 'york': 9905,
 'ny': 6599,
 'date': 2976,
 'today': 9159,
 'maketitle': 5935,
 'thickness': 9102,
 'graviton': 4507,
 'kinetic': 5436,
 'coupled': 2794,
 'bulk': 2095,
 'develops': 3187,
 'solitonic': 8477,
 'vacuum': 9483,
 'expectation': 3865,
 'couplings': 2797,
 'kaluza': 5374,
 'klein': 5449,
 'modes': 6176,
 'localized': 5790,
 'matter': 6021,
 'suppressed': 8858,
 'giving': 4416,
 'rise': 7955,
 'distance': 3308,
 'r_c': 7567,
 '4d': 839,
 '5d': 942,
 'behavior': 1821,
 'viewed': 9587,
 'brane': 2046,
 'regularization': 7751,
 'dvali': 3457,
 'gabadadze': 4280,
 'porrati': 7199,
 'newpage': 6431,
 'worlds': 9778,
 'infinite': 5032,
 'extra': 3911,
 'recently': 7681,
 'provided': 7379,
 'insight': 5070,
 'problems': 7306,
 'high': 4710,
 'relations': 7767,
 'models': 6174,
 'emerge': 3608,
 'fundamental': 4241,
 'strings': 8679,
 'realized': 7660,
 'nature': 6380,
 'valuable': 9488,
 'testing': 9040,
 'ground': 4527,
 '

In [55]:
latex_stopwords = list(vectorizer.vocabulary_)[0:59]

Let's retrain with with the LaTeX stopwords collected from the initial vocabulary.

In [None]:
vectorizer = TfidfVectorizer(decode_error='ignore', stop_words=latex_stopwords, max_df=0.6, max_features = 10000)

X = vectorizer.fit_transform(corpus)
X

As we can see, a -lot- of the vocabulary learned by the vectorizer has to do with LaTeX formatting for these papers. 

In [59]:
# Initialize NMF
nmf_model = NMF(n_components = 50, solver = 'mu')

# Create variable to make it easy to retrieve topics
idx_to_word = np.array(vectorizer.get_feature_names())

In [60]:
nmf_model.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=50, random_state=None, shuffle=False, solver='mu',
  tol=0.0001, verbose=0)

In [61]:
nmf_components = nmf_model.components_

In [62]:
for i, topic in enumerate(nmf_components):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: plane, ads, branes, supergravity, dilaton, background, boundary, wave, arxiv, citation
Topic 2: deformed, e_n, epsilon_, ee, fuzzy, quad, 2k, partial_, phi_, hat
Topic 3: cn, foot, kern, tr, arxiv, superpotential, eqalign, cr, lref, eqn
Topic 4: lambda_4, citation, universe, hat, m_5, cosmological, kappa_5, branes, bulk, brane
Topic 5: perp, gamma_1, tunneling, mid, cr, gamma_, cdot, sigma_, underline, vec
Topic 6: flux, irp, planes, phantom, n_a, hline, brane, orientifold, d6, branes
Topic 7: qquad, branes, horizon, dx, killing, metric, r_0, supergravity, ee, wedge
Topic 8: inflationary, cosmology, cosmic, born, radiation, infeld, density, universe, rho_, inflation
Topic 9: waldmann, sitter, mbw, multiplets, star, textbf, mathbb, emph, newblock, mathcal
Topic 10: gl, ja, w_, superpotential, star, rd, bea, eea, cr, ee
Topic 11: chiral, tr, la_a, su, superpotential, flavor, monopoles, 2n_f, n_c, n_f
Topic 12: cr, quad, epsilon_, topological, eta, omega_, cd, sp, abcd, ab
Topic 

Sadly, these topics are not very informative. Quite a few individual topics seem to correspond to symbols and Greek letters. One question I have following this is why some of these would not simply be in the same category. It's possible that 50 components is too many, considering we've only read in 25 papers, and that we need to collapse the 50 NMF components into fewer (maybe, for example, 10). 

Let's try that next.

In [63]:
nmf_model_10 = NMF(n_components = 10, solver = 'mu')
nmf_model_10.fit(X)
nmf_components_10 = nmf_model_10.components_

In [64]:
for i, topic in enumerate(nmf_components_10):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: background, f_, r_, wave, boundary, citation, array, prime, partial_, tau
Topic 2: ee, xi, sphere, 2k, phi_, quad, partial_, fuzzy, big, hat
Topic 3: w_, foot, kern, tr, arxiv, eqalign, superpotential, cr, lref, eqn
Topic 4: abs, org, kappa_5, href, hole, bulk, black, citation, arxiv, brane
Topic 5: gamma_1, tunneling, mid, cr, sigma_, cdot, gamma_, big, underline, vec
Topic 6: orbifold, su, twisted, supersymmetric, flux, d6, wall, orientifold, brane, branes
Topic 7: eea, f_, al, xi, bea, ij, wedge, big, ab, ee
Topic 8: ben, overline, density, een, tachyon, universe, rho_, varphi, inflation, dot
Topic 9: t_, star, symplectic, bundle, cp, textbf, newblock, emph, mathbb, mathcal
Topic 10: dag, superpotential, tr, prod_, nu_a, partition, ee, n_c, noncommutative, n_f


Reducing the number of components seemed to help a bit, but we still have multiple topics that are entirely LaTeX formatting specifications, and the physics-related topics still have some LaTeX artifacts.

Let's turn to LDA and see if it can do any better here.

In [65]:
lda_model = LatentDirichletAllocation(max_iter = 5,
                                learning_method = 'online',
                                learning_offset = 50.,
                                random_state = 0)

lda_model.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [66]:
for i, topic in enumerate(lda_model.components_):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-20:]]])))

Topic 1: bais, acla, l_s, kp, mathbf, dlt, l_, lmd, tau, defa, ui, brc, tht_2, der_2, acl, tht_1, eql, vph, brkt, x_2
Topic 2: 566, 0212096, 643, psi_s, cohen, 35pt, http, 223, 575, eqn, m_r, big, dot, zf, character, arxiv, footnotesize, d1, wick, hat
Topic 3: 608, hline, ze_d, imr, ee, families, fam, curve, disconnected, 0111, qr, rho_, bbs, l_3, 0truecm, cartan, treated, subset, suffices, brane
Topic 4: bone, monopoles, gaugino, ben, eqn, h_o, quantization, eta, lorentz, chiral, xi, dot, monopole, ee, eventually, susy, ij, breaking, overline, alice
Topic 5: mathcal, estimates, rg, wn, similarity, setup, line, schwarzschild, big, hi, cr, citation, gam, te, eqno, hline, qquad, vec, brane, hat
Topic 6: r_, varphi, qquad, su, quad, dot, eqn, cr, f_, tau, arxiv, mathcal, big, branes, partial_, vec, citation, brane, ee, hat
Topic 7: drastically, partitiondef, eac, 516, dot, located, qc, rindler, dec, motivations, hh, lj, amns, rb, dipolevaccond, numerical, phi_, sss, ze, vec
Topic 8: citat

Yikes. LDA didn't do much better. These topics are not especially informative either...