# arXiv heatmap
## Summary
### Goal
Analyze the connections between arXiv categories.

**Ultimate goal.** Create a forecasting algorithm for arXiv categories.

### Final output
A sequence of graphs, indexed by date.
- The nodes in each graph are the papers in each category, the size of the node is the size of the category, the color represents the trend.
- The edges are the shared papers, the thickness of the edge is the number of shared papers, the color is the trend.

### Data representation
A correlation-like matrix: the non-diagonal entries are the shared papers, the diagonal entries are the non-shared papers.  A separate array for total papers (don't sum to total in the matrix because of duplicates).

## Implementation
### Dataset
The arXiv dataset is downloadable from [Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv/data).  The file is 4.62 GB uncompressed (`json`): filtering everything but `id`, `update_date`, and `categories` reduces the data to a manageable 24.4 MB `parquet` file.

The `categories` entry is a single string, listing the categories with a white space in between.  A list of all categories can be found in `data/arxiv-categories.json`: might want to convert into a list.

#### Storage
Use `parquet` for storage for the moment.  Might use Delta Lake for day-to-day operation (especially looking forward at the forecasting part).

### Data processing
Read through the stripped table, creating a list of correlation matrices and an array of totals.

### Data analysis
This I still have to figure out.  Might get more clear after the data science boot camp.

### Data visualization
Also this needs to be figured out.  The `igraph` library might be useful, together with `matplotlib`, for the moment.  In a future implementation, might improve visualization, but need to learn more.  Also need to learn more for eventual website production (LATER).

# Playground

## Cleaning data

In [1]:
import pandas as pd

In [None]:
df = pd.read_json("arxiv-metadata-oai-snapshot.json", lines=True)

In [None]:
df.sample(1)

In [None]:
noabstract_df = df.drop(columns=['abstract'])

In [None]:
noabstract_df.to_parquet('data/arxiv-metadata-noabstract.parquet')

In [None]:
strip_df = noabstract_df[['id', 'update_date', 'categories']]

In [None]:
strip_df.loc[1]

In [None]:
strip_df.to_json('data/arxiv-metadata-id-date-categories.json')

In [None]:
strip_df.to_parquet('data/arxiv-metadata-id-date-categories.parquet')

## Creating the correlation tables, indexed by date

First we convert the `update_date` column to `datetime`.

In [None]:
strip_df['update_date'] = pd.to_datetime(strip_df['update_date'])

We also need to collect the labels for the arXiv categories.

In [None]:
categories_db = pd.read_json('data/arxiv-categories.json')
categories_db

Now we transform the column of categories (`str`) into a column of lists containing the categories.

In [None]:
strip_df['categories'] = strip_df['categories'].apply((lambda s : s.split()))

In [None]:
strip_df

In [None]:
strip_df.head()

In [None]:
strip_df.to_parquet('data/arxiv-metadata-id-date-categories.parquet')

In [None]:
strip_df['categories']

We import the stripped data as `arxiv_metadata`.

In [None]:
arxiv_metadata = pd.read_parquet('data/arxiv-metadata-id-date-categories.parquet')

Finally, we traverse `arxiv_metadata` and produce a new dataframe (possibly indexed by `update_date`) with entries given by the "correlation" matrices.  We also create another dataframe containing the total publications.

We need to do something like this
```
create new dataframe
for each row in strip_df:
    if there is no row with that date in new dataframe:
        create new row
    add to the date row according to the categories present in the row
```

**Note.** This probably is not the most efficient way of traversing the dataframe.  I need to understand `.group_by()` better.  I also need to understand how to index by `datetime`.

We create a list of all edges in the arXiv graph.

In [None]:
from itertools import combinations
import json

with open('data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat['tag'] for cat in arxiv_categories_descriptions]
arxiv_categories_combinations = list(combinations(arxiv_categories, 2))

In [51]:
graph_edges = {index: 0 for index in arxiv_categories_combinations}

We create a new generator `arxiv_snapshot`: each `yield` is a `dict` containing the date and the cross-listings.

We group the entries of `arxiv_metadata` by date and iterate through the groups to populate `arxiv_snapshot`.

In [None]:
import copy

def arxiv_snapshot():
    for date, group in arxiv_metadata.head(100).groupby('update_date'):
        print(date)
        print(group)
        print()

arxiv_snapshot()

2007-05-23 00:00:00
           id update_date                                         categories
3   0704.0004  2007-05-23                                          [math.CO]
9   0704.0010  2007-05-23                                          [math.CO]
11  0704.0012  2007-05-23                                          [math.NT]
17  0704.0018  2007-05-23                                           [hep-th]
33  0704.0034  2007-05-23                     [q-bio.PE, q-bio.CB, quant-ph]
36  0704.0037  2007-05-23                  [physics.optics, physics.comp-ph]
48  0704.0049  2007-05-23                                          [math.CO]
49  0704.0050  2007-05-23                                     [cs.NE, cs.AI]
51  0704.0052  2007-05-23                                           [hep-th]
60  0704.0061  2007-05-23                                          [math.FA]
65  0704.0066  2007-05-23                                           [hep-th]
70  0704.0071  2007-05-23                               

In [None]:
for index, row in arxiv_metadata.head().iterrows():
    print(row.update_date)
    for category in row.categories:
        print(category)
    print()

2008-11-26 00:00:00
hep-ph

2008-12-13 00:00:00
math.CO
cs.CG

2008-01-13 00:00:00
physics.gen-ph

2007-05-23 00:00:00
math.CO

2013-10-15 00:00:00
math.CA
math.FA

