In [61]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import *
from sklearn.decomposition import NMF, LatentDirichletAllocation

# Importing the Arxiv Data

We're going to use a bunch of Arxiv physics papers for this task.

In [62]:
# We're already in the directory with the papers, so we can use os.listdir() to get the file names
filename_list = os.listdir()[0:500]

In [63]:
# Check that these file names are correct:
filename_list[0:5]

['0301116', '0304232', '0303017', '0303225', '0302131']

Now we can read in all the files from the `filename_list`.

In [64]:
corpus = []

for i in range(len(filename_list)):
    
    filename = filename_list[i]
    
    # errors='ignore' is added to deal with UnicodeDecodeErrors  
    with open(filename, 'r', errors='ignore') as file:
            file_contents = file.read()
          
    # Add document to corpus
    corpus.append(file_contents)

In [65]:
# Removing LaTeX and other formatting artifacts that will cause issues with NMF and LDA

from nltk.stem.wordnet import WordNetLemmatizer
import re
import gensim.parsing.preprocessing as genpre

lmtzr = WordNetLemmatizer()

def prep_text(text):
     # this removes LaTeX formatting, citations, splits hyphens
    myreg = r'\\[\w]+[\{| ]|\$[^\$]+\$|\(.+\, *\d{2,4}\w*\)|\S*\/\/\S*|[\\.,\/#!$%\^&\*;:{}=_`\'\"~()><\|]|\[.+\]|\d+|\b\w{1,2}\b'
    parsed_data = text.replace('-', ' ')
    parsed_data = re.sub(myreg, '', parsed_data)
    parsed_data = [lmtzr.lemmatize(w) for w in parsed_data.lower().split() if w not in genpre.STOPWORDS]
    return parsed_data

In [66]:
corpus = [prep_text(document) for document in corpus]

`prep_text` didn't remove -everything-, but we will have many fewer artifacts than if we didn't run it at all. We can also scrape off some very common LaTeX phrases by passing them as stopwords when retraining the `TfIdfVectorizer`, and also by setting `max_df` to exclude words that occur in more than 90% of documents.

See this [excellent blog post](https://medium.com/@omar.abdelbadie1/processing-text-for-topic-modeling-c355e907ab23) on why `prep_text` works to remove LaTeX artifacts. All credit goes to author Omar Abdelbadie for this method.

Note that by using `prep_text` we've caused every entry in `corpus` to become a list containing a number of strings, rather than one big string for each entry. This is a problem for when we want to create our feature matrix, as `TfIdfVectorizer` is not compatible with a list of lists. We'll need to use `join` (a string method) to change each entry back to a string instead of a list.

In [67]:
for i in range(len(corpus)):
    corpus[i] = ' '.join(corpus[i])

In [68]:
len(corpus)

500

# Creating a Feature Matrix

We have exactly 500 documents to work with. Now we can turn our corpus into a matrix of Term Frequency Inverse Document Frequency (TF-IDF) features using `sklearn`'s `TfidfVectorizer()`.

In [69]:
# Again ignoring any UnicodeDecodeErrors
vectorizer = TfidfVectorizer(decode_error = 'ignore', max_df = 0.9, ngram_range = (2, 2), max_features = 20000)

X = vectorizer.fit_transform(corpus)
X

<500x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 203169 stored elements in Compressed Sparse Row format>

Let's take a look at the vocabulary that was learned by the vectorizer.

In [70]:
# Cast the vocab dict to a list so we can print just a subset of the dict

first20_vocab = {k: vectorizer.vocabulary_[k] for k in list(vectorizer.vocabulary_)[:20]}
first20_vocab

{'latex file': 9941,
 'beqequation eeqequation': 1152,
 'beqaeqnarray eeqaeqnarray': 1150,
 'footnote thefootnotefootnote': 7334,
 'document draft': 4858,
 'particle physic': 13017,
 'physic department': 13265,
 'department physic': 4202,
 'new york': 11861,
 'university new': 19151,
 'today maketitle': 18727,
 'maketitle abstract': 10736,
 'abstract consider': 45,
 'finite thickness': 7109,
 'kinetic term': 9729,
 'bulk scalar': 1843,
 'vacuum expectation': 19264,
 'expectation value': 6366,
 'value coupling': 19316,
 'kaluza klein': 9651}

And the stopwords:

In [71]:
# Similar to above, use itertools to avoid printing the entire (massive) set to screen
import itertools

for i, val in enumerate(itertools.islice(vectorizer.stop_words_, 20)):
    print(i, val)

0 word exactly
1 non lymext
2 entropy like
3 baye phys
4 main difficulty
5 covariant quantized
6 covariant admits
7 give displayed
8 dltkphi unlike
9 remember non
10 specifying exact
11 order green
12 hole predicts
13 lqcmartin bojowaldisotropic
14 start description
15 let separate
16 bigvevbigllanglebigrrangle bigcomm
17 seiberg gauged
18 chosen phase
19 superfield read


A number of LaTeX and nonsense terms, as well as some physics terms, were caught in the filter created by `max_df`. The benefit should outweigh the cost of excluding these particular physics terms. (After all, they must not be very distinctive phrases if they're occurring in 90% of papers.)

The top of the vocabulary showed some additional phrases that occur frequently and are not informative to us. We'll go ahead and remove those low-hanging fruit using by retraining the vectorizer and passing these stop-phrases as a list.

(Normally it would not make sense to slice a dictionary this way, but after having run the vectorizer repeatedly and seeing the order the terms are stored in memory, we can make the call to exclude the 15 phrases that tend to float to the top.)

In [72]:
additional_stopwords = list(vectorizer.vocabulary_)[0:15]

In [73]:
vectorizer = TfidfVectorizer(decode_error='ignore', max_df=0.9, ngram_range=(2, 2), 
                             max_features = 20000, stop_words=additional_stopwords)

X = vectorizer.fit_transform(corpus)
X

<500x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 203169 stored elements in Compressed Sparse Row format>

Now we'll do some topic modeling.

# Topic Modeling

In [74]:
# Initialize NMF
nmf_model = NMF(n_components = 10, solver = 'mu')

# Create variable to make it easy to retrieve topics
idx_to_word = np.array(vectorizer.get_feature_names())

In [75]:
nmf_model.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=10, random_state=None, shuffle=False, solver='mu',
  tol=0.0001, verbose=0)

In [76]:
nmf_components = nmf_model.components_

In [77]:
for i, topic in enumerate(nmf_components):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: def def, rev citation, jhep citation, arxivhep citation, gauge theory, lett citation, nucl phys, phys citation, hep citation, citation hep
Topic 2: defrelax defem, lie group, lie algebra, moody algebra, kac moody, defrelax defrelax, defssbchar defssbchar, em em, defem defem, def def
Topic 3: einstein equation, momentum tensor, energy momentum, scale factor, cosmic string, extra dimension, energy density, cosmological constant, phys rev, scalar field
Topic 4: sitter space, reissner nordstr, quasinormal mode, near horizon, quasinormal frequency, hole entropy, phys rev, quantum gravity, schwarzschild black, black hole
Topic 5: universal solution, non bps, rolling tachyon, string theory, tachyon condensation, field theory, boundary state, open string, string field, closed string
Topic 6: bmn operator, zero mode, maximally supersymmetric, gamma gamma, type iib, cone gauge, penrose limit, light cone, wave background, plane wave
Topic 7: riemann surface, witten curve, effective super

Exciting! We have a few topics here that are composed of LaTeX specifications (like 1, 2, 8, and 10), but the others are clearly relevant to particular areas of physics.

Let's try out some LDA as well.

In [78]:
lda_model = LatentDirichletAllocation(max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda_model.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [79]:
for i, topic in enumerate(lda_model.components_):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))

Topic 1: bethe ansatz, field theory, left right, trivial galois, math phys, form factor, lambda lambda, right left, galois current, om om
Topic 2: schwarzschild black, left right, theequationsectionequationlyter eqnarray, alice string, brick wall, equation lyter, phys rev, eqnarray lyter, def def, black hole
Topic 3: hep citation, black hole, gauge theory, left right, equation equation, nucl phys, phys rev, field theory, citation hep, def def
Topic 4: displaymath displaymath, decorsize label, reference thebibliography, phin phin, eigenvalue problem, hilbert space, function invariant, inner product, psin phin, wh wh
Topic 5: scattering amplitude, cosmological horizon, internal space, black hole, left right, eq eeq, phys rev, sitter space, equation equation, photon photon
Topic 6: spacelike infinity, gauge theory, quantum grav, little group, gauge transformation, citation hep, van proeyen, theta theta, physic publishing, preprint hep
Topic 7: alpha alpha, random force, wilson line, gauge

Sadly, the topics generated by LDA are not very interesting or distinct. It looks like NMF is a more appropriate topic modeling method for this dataset.