NLP analysis of all math articles from arXiv.
xml.gz files in `~/.metha`, loaded in Postgres.

344111 articles.

Identifier format before March 2007 differed from the present, though issue reconciled in the pulled data?

- tf-idf to vectorize each document
- visualize with SVD or tSNE
- cluster by k-means or agglomerative
- Text summarization within each cluster

## Clustering
- Pulling data in db is painful. Load data in pd.DataFrame for prototyping.
- try 2-grams and 3-grams for vectorization
- t-SNE (or SVD with 2~3 components) to vizualize
- finding the right number of K is hard. Try K = (# of categories) * M, where 20 < M < 50 so M subjects per category.

In [50]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib
%matplotlib inline

In [2]:
import pickle
with open('../metha-all-math.pkl','rb') as f:
    df = pickle.load(f)

In [24]:
# work with a smaller set
(df['submitted'] >= '2017-01-01').sum() # => 27278
(df['submitted'] >= '2016-01-01').sum() # => 63639
(df['submitted'] >= '2015-01-01').sum() # => 98358
(df['submitted'] >= '2014-01-01').sum() # => 130453

27278

In [26]:
df.shape, df.columns, df_small.shape

((344111, 6),
 Index(['abstract', 'authors', 'categories', 'identifier', 'submitted',
        'title'],
       dtype='object'),
 (27278, 6))

In [27]:
len(df['categories'].unique()), len(df_small['categories'].unique())

(19335, 3547)

143

In [25]:
df_17 = df[df['submitted'] >= '2017-01-01']
abstracts_17 = df_17['abstract']
tfidf_vect = TfidfVectorizer()
abs_vect_17 = tfidf_vect.fit_transform(abstracts_17)

In [3]:
df_16 = df[df['submitted'] >= '2016-01-01']
abstracts_16 = df_16['abstract']
tfidf_vect = TfidfVectorizer()
abs_vect_16 = tfidf_vect.fit_transform(abstracts_16)

In [48]:
tfidf_vect = TfidfVectorizer()
abstracts = df['abstract']
abs_vect = tfidf_vect.fit_transform(abstracts)

In [4]:
abs_vect_16.shape
# 157615 words in corpus; 69408 for df_16; 46187 for df_17

(63639, 69408)

## Are clusters compatible with categories?
- There are 33 math categories; 133 non-math.

### kmeans runtime
- (dataset, `k_per_category`, `n_categories`, `n_init`, `n_jobs`) -> time
- `timestamp >= 2017-01-01`, 1, 143, 1, _  ->  3m50s
- full, 1, 150, 1, _ -> 1h14m
- `df_161`, 1, 1, 150, 1, _ ->  7m

*Note*: `n_jobs` greater than 1 irrelevant when `n_init` is 1.

In [5]:
k_per_category = 1 # see if categories correspond to clusters
n_categories = 150 # 163 for full df; 143 for df_17
%time kmeans = KMeans(n_clusters=k_per_category * n_categories,\
                      n_init=1,\
                     n_jobs=1).fit(abs_vect_16)

CPU times: user 7min 1s, sys: 3.57 s, total: 7min 4s
Wall time: 7min 3s


In [6]:
#import twilio.rest
#tw_cli = twilio.rest.Client('ACa7d89d1e86ec5a863118f0860e0a9b74',
#                           'c060df5f1e0e64144a59ea9f58d3f7c7')
tw_cli.messages.create(to='+15037531533',
                      from_='+19713022281',
                      body='kmeans finished running')

NameError: name 'tw_cli' is not defined

In [13]:
abs_clusters = kmeans.predict(abs_vect_16)

In [47]:
abs_clusters[:5]

array([ 92,  84, 114,  51,  50], dtype=int32)

In [93]:
df_16['cluster'] = abs_clusters

In [104]:
categories_16 = set(
    [item for sublist in df_16['categories'] 
     for item in sublist]
) # 147 unique categories
cluster_dict = {i: {k: 0 for k in categories_16} for i in range(150)}

In [105]:
for _, (_, _, categories, _, _, _, cluster) in df_16.iterrows():
    d = cluster_dict[int(cluster)]
    for c in categories:
        d[c] = d[c]+1

In [106]:
max_categories = [ max(cluster_dict[i], key=cluster_dict[i].get)
                  for i in range(150) ]

In [110]:
for i, c in enumerate(max_categories):
    if c == 'math.DS':
        print(i)
# DS clusters are 35 and 126

35
126


In [111]:
cluster_dict[35]

{'astro-ph.CO': 1,
 'astro-ph.EP': 3,
 'astro-ph.GA': 0,
 'astro-ph.HE': 0,
 'astro-ph.IM': 1,
 'astro-ph.SR': 0,
 'cond-mat.dis-nn': 3,
 'cond-mat.mes-hall': 3,
 'cond-mat.mtrl-sci': 1,
 'cond-mat.other': 1,
 'cond-mat.quant-gas': 1,
 'cond-mat.soft': 2,
 'cond-mat.stat-mech': 25,
 'cond-mat.str-el': 3,
 'cond-mat.supr-con': 0,
 'cs.AI': 5,
 'cs.AR': 0,
 'cs.CC': 0,
 'cs.CE': 1,
 'cs.CG': 1,
 'cs.CL': 0,
 'cs.CR': 0,
 'cs.CV': 0,
 'cs.CY': 0,
 'cs.DB': 0,
 'cs.DC': 2,
 'cs.DL': 0,
 'cs.DM': 2,
 'cs.DS': 1,
 'cs.ET': 2,
 'cs.FL': 2,
 'cs.GR': 0,
 'cs.GT': 1,
 'cs.HC': 0,
 'cs.IR': 0,
 'cs.IT': 39,
 'cs.LG': 5,
 'cs.LO': 1,
 'cs.MA': 2,
 'cs.MM': 0,
 'cs.MS': 1,
 'cs.NA': 3,
 'cs.NE': 1,
 'cs.NI': 2,
 'cs.OH': 0,
 'cs.OS': 0,
 'cs.PF': 0,
 'cs.PL': 0,
 'cs.RO': 8,
 'cs.SC': 2,
 'cs.SD': 0,
 'cs.SE': 0,
 'cs.SI': 0,
 'cs.SY': 80,
 'eess.SP': 0,
 'gr-qc': 5,
 'hep-lat': 1,
 'hep-ph': 3,
 'hep-th': 14,
 'math-ph': 103,
 'math.AC': 3,
 'math.AG': 9,
 'math.AP': 65,
 'math.AT': 2,
 'math.CA'

In [117]:
df_16[df_16['cluster'] == 35]['title']

3350      Planar S-systems: Global stability and the cen...
3366      Interacting and noninteracting integrable systems
3481                         A Core Theory of Delay Systems
3557             Compositionality of the Runge-Kutta Method
3582      A homogenization theorem for Langevin systems ...
3610      Differential-functional dynamical systems, the...
3927      Where and When Orbits of Chaotic Systems Prefe...
3999                  How do nonholonomic integrators work?
5557      A Characterization of Integral ISS for Switche...
5665      Output Average Consensus Over Heterogeneous Mu...
5668      Systems of four coupled one sided Sylvester-ty...
5715      The center problem for the Lotka reactions wit...
5967      Robustness Analysis of Systems' Safety through...
5987      Robust Stability of Optimization-based State E...
6108      Dynamics and evolution of planets in mean-moti...
6128                                 Approximation Dynamics
6130      Lax orthogonal factorisations 

## Interacting with Postgres

In [27]:
import sys
sys.path.append('..') if '..' not in sys.path else None
from app import db
from models import Article, Author, Category, article_author, article_category

In [4]:
query = db.session().query(Article)

In [30]:
query.filter(
    (Article.submitted >= '2016-01-01') & (Article.submitted <= '2016-12-31')
).count()

36361

In [33]:
query.filter(Article.submitted >= '2017-01-01').count()

27278

In [52]:
query.filter(
    (Article.submitted >= '2016-01-01') & (Article.submitted <= '2016-12-31')
).count()

36361

In [34]:
# query with join example. need Table.c to access columns 
db.session.query(Article.id, Category.name) \
.filter(Article.id == 1) \
.filter(Article.id == article_category.c.article_id) \
.filter(Category.id == article_category.c.category_id) \
.all()

[(1, 'cs.SY'), (1, 'cs.AI'), (1, 'math.OC')]

In [37]:
# num of distinct categories = 13627
def get_categories(i):
    return [c.name for c in db.session.query(Article.id, Category.name) \
    .filter(Article.id == i) \
    .filter(Article.id == article_category.c.article_id) \
    .filter(Category.id == article_category.c.category_id) \
    .all()]

In [44]:
cs = [get_categories(i+1) for i in range(query.count())]
# this takes a while!

In [50]:
len(set( [tuple(x) for x in cs] )) # list is not hashable but tuple is

13627

In [53]:
# 163 categories
db.session.query(Category).count()

163

In [57]:
# 33 math categories
db.session.query(Category.name).filter(Category.name.like('math%')).count()

33

In [58]:
# all articles are math
db.session.query(article_category.c.article_id)\
.filter(Category.id == article_category.c.category_id)\
.filter(Category.name.like('math%'))\
.distinct().count()

344110