# arXiv heatmap

### Data crunching
##### Starting point
- the cleaned arXiv metadata: `data/arxiv-metadata-cleaned.parquet`
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
A `pandas` dataframe:
- indexed by `update_date`
- with columns the 2x2 combinations of categories

## The code

In [1]:
import pandas as pd
import json

We start by importing the cleaned metadata from `data/arxiv-metadata-cleaned.parquet` to `arxiv_metadata`.

In [2]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-cleaned.parquet')

We also recreate the graph edges, including the extra categories.

In [3]:
from itertools import combinations_with_replacement as cwr

with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = sorted([cat['tag'] for cat in arxiv_categories_descriptions] + ['q-bio', 'cond-mat', 'astro-ph'])

arxiv_categories_combinations = cwr(arxiv_categories, 2)

# use sorted to make sure the tuples are in a consistent ordering
graph_edges_keys = [tuple(sorted(index)) for index in arxiv_categories_combinations]

### Dataframe of cross-listings
The goal of this section is to produce a new dataframe, indexed by `update_date` whose rows are the cross-listings.  We also want another dataframe containing the total daily publications in each category.

We begin by defining a function `take_snapshot` that takes a dataframe of listings for one day and returns a dictionary containing the cross listings.

In [4]:
def take_snapshot(group: pd.Series) -> dict:
    graph_edges = dict.fromkeys(graph_edges_keys, 0)
    for entry in group:
        for edge in cwr(entry, 2):
            graph_edges[tuple(sorted(edge))] += 1
    return graph_edges

Now we create a dataframe `arxiv_snapshots` containing the daily snapshots of arXiv cross-listings.  The new dataframe is obtained by grouping `arxiv_metadata` by `update_date` and aggregating each group via the `take_snapshot` function.

We start by creating a dataframe `arxiv_snapshot` with dict entries in categories, representing the graph for the day.

In [5]:
arxiv_snapshots = arxiv_metadata.drop(columns=['id']).groupby('update_date').agg({'categories': take_snapshot})

Next we reset the index of `arxiv_snapshot_dicts` so that the `update_date` becomes a column.

In [6]:
arxiv_snapshots.reset_index(inplace=True)

Then we pop the `categories` column, we explode it into its components, and join it to `arxiv_snapshots`.

In [7]:
arxiv_snapshots = arxiv_snapshots.join(pd.DataFrame(arxiv_snapshots.pop('categories').tolist()))

Finally, we re-index `arxiv_snapshots` to `update_date`.

In [8]:
arxiv_snapshots.set_index('update_date', inplace=True)

Now we save the snapshots to `data/arxiv-snapshots.parquet`.

In [9]:
arxiv_snapshots.to_parquet('../data/arxiv-snapshots.parquet')

  table = self.api.Table.from_pandas(df, **from_pandas_kwargs)


**Note:** the first date in `arxiv_snapshots` seems to contain all entries before May 23, 2007.  We should  drop it when we do the analysis.

### Dataframe of totals
The goal now is to produce another dataframe, indexed by `update_date`, containing the daily totals for each category.

We begin by defining a function `take_totals` that takes a dataframe of listings for one day and returns a dictionary containing the totals.

In [11]:
def take_totals(group: pd.Series) -> dict:
    daily_totals = dict.fromkeys(arxiv_categories, 0)
    for entry in group:
        for edge in cwr(entry, 2):
            daily_totals[tuple(sorted(edge))] += 1
    return daily_totals