# arXiv heatmap
## Summary
### Goal
Analyze the connections between arXiv categories.

**Ultimate goal.** Create a forecasting algorithm for arXiv categories.

### Final output
A sequence of graphs, indexed by date.
- The nodes in each graph are the papers in each category, the size of the node is the size of the category, the color represents the trend.
- The edges are the shared papers, the thickness of the edge is the number of shared papers, the color is the trend.

### Data representation
A correlation-like matrix: the non-diagonal entries are the shared papers, the diagonal entries are the non-shared papers.  A separate array for total papers (don't sum to total in the matrix because of duplicates).

## Implementation
### Dataset
The arXiv dataset is downloadable from [Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv/data).  The file is 4.62 GB uncompressed (`json`): filtering everything but `id`, `update_date`, and `categories` reduces the data to a manageable 24.4 MB `parquet` file.

The `categories` entry is a single string, listing the categories with a white space in between.  A list of all categories can be found in `data/arxiv-categories.json`: might want to convert into a list.

#### Storage
Use `parquet` for storage for the moment.  Might use Delta Lake for day-to-day operation (especially looking forward at the forecasting part).

### Data processing
Read through the stripped table, creating a list of correlation matrices and an array of totals.

### Data analysis
This I still have to figure out.  Might get more clear after the data science boot camp.

### Data visualization
Also this needs to be figured out.  The `igraph` library might be useful, together with `matplotlib`, for the moment.  In a future implementation, might improve visualization, but need to learn more.  Also need to learn more for eventual website production (LATER).

# Playground

## Cleaning data

In [3]:
import pandas as pd

In [None]:
df = pd.read_json("arxiv-metadata-oai-snapshot.json", lines=True)

In [None]:
df.sample(1)

In [None]:
noabstract_df = df.drop(columns=['abstract'])

In [None]:
noabstract_df.to_parquet('data/arxiv-metadata-noabstract.parquet')

In [None]:
strip_df = noabstract_df[['id', 'update_date', 'categories']]

In [None]:
strip_df.loc[1]

In [None]:
strip_df.to_json('data/arxiv-metadata-id-date-categories.json')

In [None]:
strip_df.to_parquet('data/arxiv-metadata-id-date-categories.parquet')

## Creating the correlation tables, indexed by date

First we convert the `update_date` column to `datetime`.

In [None]:
strip_df['update_date'] = pd.to_datetime(strip_df['update_date'])

We also need to collect the labels for the arXiv categories.

In [4]:
categories_db = pd.read_json('../data/arxiv-categories.json')
categories_db

Unnamed: 0,tag,name
0,cs.AI,Artificial Intelligence
1,cs.AR,Hardware Architecture
2,cs.CC,Computational Complexity
3,cs.CE,"Computational Engineering, Finance, and Science"
4,cs.CG,Computational Geometry
...,...,...
150,stat.CO,Computation
151,stat.ME,Methodology
152,stat.ML,Machine Learning
153,stat.OT,Other Statistics


Now we transform the column of categories (`str`) into a column of lists containing the categories.

In [None]:
strip_df['categories'] = strip_df['categories'].apply((lambda s : s.split()))

In [None]:
strip_df

In [None]:
strip_df.head()

In [None]:
strip_df.to_parquet('data/arxiv-metadata-id-date-categories.parquet')

In [None]:
strip_df['categories']

We import the stripped data as `arxiv_metadata`.

In [6]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-id-date-categories.parquet')

In [145]:
arxiv_metadata

Unnamed: 0,id,update_date,categories
0,0704.0001,2008-11-26,[hep-ph]
1,0704.0002,2008-12-13,"[math.CO, cs.CG]"
2,0704.0003,2008-01-13,[physics.gen-ph]
3,0704.0004,2007-05-23,[math.CO]
4,0704.0005,2013-10-15,"[math.CA, math.FA]"
...,...,...,...
2710801,supr-con/9608008,2009-10-30,"[supr-con, cond-mat.supr-con]"
2710802,supr-con/9609001,2016-11-18,"[supr-con, cond-mat.supr-con]"
2710803,supr-con/9609002,2009-10-30,"[supr-con, cond-mat.supr-con]"
2710804,supr-con/9609003,2009-10-30,"[supr-con, cond-mat.supr-con]"


Finally, we traverse `arxiv_metadata` and produce a new dataframe (possibly indexed by `update_date`) with entries given by the "correlation" matrices.  We also create another dataframe containing the total publications.

We need to do something like this
```
create new dataframe
for each row in strip_df:
    if there is no row with that date in new dataframe:
        create new row
    add to the date row according to the categories present in the row
```

**Note.** This probably is not the most efficient way of traversing the dataframe.  I need to understand `.group_by()` better.  I also need to understand how to index by `datetime`.

We create a list of all edges in the arXiv graph.

In [143]:
from itertools import combinations_with_replacement
import json

with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat['tag'] for cat in arxiv_categories_descriptions]
arxiv_categories_combinations = list(combinations_with_replacement(arxiv_categories, 2))

In [144]:
graph_edges = {tuple(sorted(index)): 0 for index in arxiv_categories_combinations}

The arXiv categories changed over the years.  Below we find all categories that are not the current ones.

In [146]:
missing_categories = set()

for index, row in arxiv_metadata.iterrows():
    for category in row['categories']:
        if category not in arxiv_categories:
            missing_categories.add(category)

In [147]:
missing_categories

{'acc-phys',
 'adap-org',
 'alg-geom',
 'ao-sci',
 'astro-ph',
 'atom-ph',
 'bayes-an',
 'chao-dyn',
 'chem-ph',
 'cmp-lg',
 'comp-gas',
 'cond-mat',
 'dg-ga',
 'funct-an',
 'mtrl-th',
 'patt-sol',
 'plasm-ph',
 'q-alg',
 'q-bio',
 'solv-int',
 'supr-con'}

We create a new generator `arxiv_snapshot`: each `yield` is a `dict` containing the date and the cross-listings.

We group the entries of `arxiv_metadata` by date and iterate through the groups to populate `arxiv_snapshot`.

In [137]:
import copy

def arxiv_snapshot():
    for date, group in arxiv_metadata.groupby('update_date'):
        graph_edges['date'] = date
        for index, row in group.iterrows():
            if len(row.categories) == 1:
                key = list(combinations_with_replacement(row.categories, 2))[0]
                graph_edges[key] += 1
            else:
                for edge in combinations_with_replacement(row.categories, 2):
                    graph_edges[tuple(sorted(edge))] += 1
        yield copy.deepcopy(graph_edges)

In [138]:
df = pd.DataFrame(arxiv_snapshot())
df.set_index('date')

KeyError: ('alg-geom', 'alg-geom')