# arXiv heatmap

### Data crunching
##### Starting point
- the cleaned arXiv metadata: `data/arxiv-metadata-cleaned.parquet`
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
- a dataframe, indexed by `date`, of cross-listings
- a dataframe, indexed by `date`, of total listings per category

## The code

In [1]:
import pandas as pd
import json

We start by importing the cleaned metadata from `data/arxiv-metadata-cleaned.parquet` to `arxiv_metadata`.

In [2]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-cleaned.parquet')

We also create a list `graph_edges_keys` of tuples (with repetitions) of `arxiv_categories` (including the extra categories): they will be keys for a dictionary whose entries represent the daily entries in that cross-listing (the tuple with repetition are the papers listed in only one category).  We instantiate a multi-index `graph_edges_index` containing the indexing for the edges.

In [3]:
from itertools import combinations_with_replacement as cwr

with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = sorted([cat['tag'] for cat in arxiv_categories_descriptions] + ['q-bio', 'cond-mat', 'astro-ph'])

arxiv_categories_combinations = cwr(arxiv_categories, 2)

# use sorted to make sure the tuples are in a consistent ordering
graph_edges_keys = [tuple(sorted(index)) for index in arxiv_categories_combinations]
graph_edges_index = pd.MultiIndex.from_tuples(graph_edges_keys)

### Dataframe of cross-listings
The goal of this section is to produce a new dataframe, indexed by `date` whose rows are the cross-listings.  We also want another dataframe containing the total daily publications in each category.

We begin by defining a function `take_snapshot` that takes a series of listings for one day and returns a dictionary containing the cross listings.

In [4]:
def take_snapshot(group: pd.Series) -> dict:
    graph_edges = dict.fromkeys(graph_edges_keys, 0)
    for entry in group:
        for edge in cwr(entry, 2):
            graph_edges[tuple(sorted(edge))] += 1
    return graph_edges

Now we create a dataframe `arxiv_snapshots` containing the daily snapshots of arXiv cross-listings.  The new dataframe is obtained by grouping `arxiv_metadata` by `date` and aggregating each group via the `take_snapshot` function.

We start by creating a dataframe `arxiv_snapshot` with dict entries in categories, representing the graph for the day.

In [5]:
arxiv_snapshots = arxiv_metadata.drop(columns=['id']).groupby('date').agg({'categories': take_snapshot})

Here is how the new dataframe looks like.

In [6]:
arxiv_snapshots

Unnamed: 0_level_0,categories
date,Unnamed: 1_level_1
1986-04-25,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
1988-11-11,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
1989-04-15,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
1989-10-26,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
1989-11-09,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
...,...
2025-04-06,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
2025-04-07,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
2025-04-08,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."
2025-04-09,"{('astro-ph', 'astro-ph'): 0, ('astro-ph', 'as..."


Then we explode the `categories` column into its components.

In [7]:
arxiv_snapshots = pd.DataFrame(
    arxiv_snapshots['categories'].tolist(), 
    columns=graph_edges_index, index=arxiv_snapshots.index)

Here is how `arxiv_snapshots` looks like now.

In [8]:
arxiv_snapshots

Unnamed: 0_level_0,astro-ph,astro-ph,astro-ph,astro-ph,astro-ph,astro-ph,astro-ph,astro-ph,astro-ph,astro-ph,...,stat.ME,stat.ME,stat.ME,stat.ME,stat.ML,stat.ML,stat.ML,stat.OT,stat.OT,stat.TH
Unnamed: 0_level_1,astro-ph,astro-ph.CO,astro-ph.EP,astro-ph.GA,astro-ph.HE,astro-ph.IM,astro-ph.SR,cond-mat,cond-mat.dis-nn,cond-mat.mes-hall,...,stat.ME,stat.ML,stat.OT,stat.TH,stat.ML,stat.OT,stat.TH,stat.OT,stat.TH,stat.TH
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1986-04-25,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1988-11-11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1989-04-15,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1989-10-26,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1989-11-09,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-04-06,0,0,0,0,0,0,0,0,0,0,...,13,2,0,1,12,0,1,0,0,3
2025-04-07,0,0,0,0,0,0,0,0,0,0,...,19,1,0,2,15,0,3,1,0,9
2025-04-08,0,0,0,0,0,0,0,0,0,0,...,10,1,0,0,12,0,3,0,0,6
2025-04-09,0,0,0,0,0,0,0,0,0,0,...,14,0,1,1,5,0,0,2,0,4


Now we save the snapshots to `data/arxiv-snapshots.parquet`.

In [9]:
arxiv_snapshots.to_parquet('../data/arxiv-snapshots.parquet')

### Dataframe of totals
The goal now is to produce another dataframe, indexed by `date`, containing the daily totals for each category.

We begin by defining a function `take_totals` that takes a dataframe of listings for one day and returns a dictionary containing the totals.

In [10]:
from itertools import chain
from collections import Counter

def take_totals(group: iter) -> dict:
    """ Takes an iterable of iterables, returns a dictionary counting entries. """
    return Counter(chain.from_iterable(group))

We now create a dataframe `arxiv_totals`, with dict entries in categories representing the daily totals, by grouping `arxiv_metadata` by `update_date` and aggregating with `take_totals`.

In [11]:
arxiv_totals = arxiv_metadata.drop(columns=['id']).groupby('date').agg({'categories': take_totals})

Here is how the new dataframe looks like.

In [12]:
arxiv_totals

Unnamed: 0_level_0,categories
date,Unnamed: 1_level_1
1986-04-25,"{'hep-th': 1, 'physics.pop-ph': 1}"
1988-11-11,{'hep-th': 1}
1989-04-15,{'math.LO': 1}
1989-10-26,"{'math.FA': 3, 'math.MG': 3}"
1989-11-09,"{'math.FA': 1, 'math.MG': 1}"
...,...
2025-04-06,"{'physics.bio-ph': 1, 'astro-ph.HE': 11, 'phys..."
2025-04-07,"{'astro-ph.CO': 10, 'cs.LG': 127, 'q-bio.QM': ..."
2025-04-08,"{'cs.AI': 102, 'cs.CL': 56, 'astro-ph.CO': 17,..."
2025-04-09,"{'cs.AI': 73, 'quant-ph': 55, 'physics.chem-ph..."


As before, we reset the index of `arxiv_totals` so that the `date` becomes a column.

In [13]:
arxiv_totals.reset_index(inplace=True)

Then we pop the `categories` column, we explode it into its components, and join it to `arxiv_totals`.

In [14]:
arxiv_totals = arxiv_totals.join(pd.DataFrame(arxiv_totals.pop('categories').tolist()))

If there is no posting in a category we get `NaN`: we replace them with 0.

In [15]:
arxiv_totals.fillna(value=0, inplace=True)

Finally, we re-index `arxiv_snapshots` to `date`.

In [16]:
arxiv_totals.set_index('date', inplace=True)

Now `arxiv_totals` looks like this.

In [17]:
arxiv_totals

Unnamed: 0_level_0,hep-th,physics.pop-ph,math.LO,math.FA,math.MG,cs.CC,math.CO,math.PR,math.DS,cs.GR,...,econ.EM,stat.CO,stat.OT,q-fin.EC,eess.SY,econ.GN,eess.AS,eess.IV,eess.SP,q-fin.MF
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1986-04-25,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1988-11-11,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1989-04-15,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1989-10-26,0.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1989-11-09,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-04-06,5.0,0.0,4.0,3.0,2.0,2.0,11.0,4.0,4.0,4.0,...,1.0,2.0,0.0,1.0,16.0,1.0,3.0,5.0,8.0,0.0
2025-04-07,23.0,1.0,5.0,16.0,6.0,8.0,18.0,18.0,11.0,10.0,...,2.0,2.0,1.0,3.0,29.0,3.0,5.0,10.0,25.0,0.0
2025-04-08,34.0,0.0,1.0,6.0,3.0,3.0,18.0,17.0,12.0,6.0,...,2.0,1.0,0.0,3.0,35.0,3.0,6.0,14.0,13.0,2.0
2025-04-09,21.0,1.0,4.0,11.0,2.0,4.0,18.0,15.0,14.0,4.0,...,1.0,2.0,2.0,2.0,23.0,2.0,4.0,7.0,11.0,3.0


We save the totals to `data/arxiv-totals.parquet`.

In [18]:
arxiv_totals.to_parquet('../data/arxiv-totals.parquet')