# arXiv heatmap
### Data cleaning

##### Starting point
- the arXiv metadata stripped of all columns except for `id` (`string`), `update_date` (`datetime`), and `categories` (`list`): `data/arxiv-metadata-id-categories.parquet`
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
A `pandas` dataframe:
- indexed by `update_date`
- with columns the 2x2 combinations of categories

## The code

In [1]:
import pandas as pd
import json
import copy
from itertools import combinations_with_replacement as cwr

### Cleaning

First we import the list of current arXiv category tags and store it in the list `arxiv_categories`.

In [2]:
with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat['tag'] for cat in arxiv_categories_descriptions]

Now we import the stripped data as `arxiv_metadata`.

In [18]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-id-date-categories.parquet')

The arXiv categories changed over the years: we find all categories that are not the current ones and store them in the set `missing_categories`.

In [4]:
missing_categories = set()

for index, row in arxiv_metadata.iterrows():
    for category in row['categories']:
        if category not in arxiv_categories:
            missing_categories.add(category)

print(missing_categories)

{'alg-geom', 'dg-ga', 'chem-ph', 'plasm-ph', 'ao-sci', 'mtrl-th', 'funct-an', 'comp-gas', 'q-alg', 'cond-mat', 'acc-phys', 'astro-ph', 'atom-ph', 'supr-con', 'chao-dyn', 'bayes-an', 'cmp-lg', 'q-bio', 'patt-sol', 'adap-org', 'solv-int'}


We need to decide what to do for each of the missing categories.  The most reasonable choice to me seems to find the closest matching current category and replace each missing category with that.

| Old        |  New                | To add? |
| ---------- | ------------------- | :-----: |
| `mtrl-th`  | `cond-mat.mtrl-sci` |         |
| `q-bio`    | ---                 | X       |
| `acc-phys` | `physics.acc-ph`    |         |
| `dg-ga`    | `math.DG`           |         |
| `cond-mat` | ---                 | X       |
| `chem-ph`  | `physics.chem-ph`   |         |
| `astro-ph` | ---                 | X       |
| `comp-gas` | `nlin.CG`           |         |
| `funct-an` | `math.FA`           |         |
| `patt-sol` | `nlin.PS`           |         |
| `solv-int` | `nlin.SI`           |         |
| `alg-geom` | `math.AG`           |         |
| `adap-org` | `nlin.AO`           |         |
| `supr-con` | `cond-mat.supr-con` |         |
| `plasm-ph` | `physics.plasm-ph`  |         |
| `chao-dyn` | `nlin.CD`           |         |
| `bayes-an` | `physics.data-an`   |         |
| `q-alg`    | `math.QA`           |         |
| `ao-sci`   | `physics.ao-ph`     |         |
| `atom-ph`  | `physics.atom-ph`   |         |
| `cmp-lg`   | `cs.CL`             |         |

The categories `q-bio`, `cond-mat`, and `astro-ph` have been over the years split into subcategories.  Hence, some preprints are classified into what are now meta-categories.  We add these three categories, and we will use them only for those preprints dating to before the splitting.

We then create a list `graph_edges_keys` of tuples (with repetitions) of `arxiv_categories_extra`: they will be keys for a dictionary whose entries represent the daily entries in that cross-listing (the tuple with repetition are the papers listed in only one category).

In [3]:
arxiv_categories_extra = arxiv_categories + ['q-bio', 'cond-mat', 'astro-ph']
arxiv_categories_combinations = cwr(arxiv_categories_extra, 2)

# use sorted to make sure the tuples are in a consistent ordering
graph_edges_keys = [tuple(sorted(index)) for index in arxiv_categories_combinations]

Now `arxiv_categories_extra` contains the current categories plus the three legacy meta-categories, and `graph_edges` has been updated to include the three new nodes.

The goal now is to go through `arxiv_metadata` again and replace the missing categories with the new ones.  We start by creating a dictionary `cat_dictionary` to map old categories to new categories, and a function `cat_translator` to translate a list of categories to the new ones using the dictionary (removing duplicates).

In [20]:
cat_dictionary = {
    'alg-geom': 'math.AG',
    'dg-ga': 'math.DG',
    'chem-ph': 'physics.chem-ph',
    'plasm-ph': 'physics.plasm-ph',
    'ao-sci': 'physics.ao-ph',
    'mtrl-th': 'cond-mat.mtrl-sci',
    'funct-an': 'math.FA',
    'comp-gas': 'nlin.CG',
    'q-alg': 'math.QA',
    'acc-phys': 'physics.acc-ph',
    'atom-ph': 'physics.atom-ph',
    'supr-con': 'cond-mat.supr-con',
    'chao-dyn': 'nlin.CD',
    'bayes-an': 'physics.data-an',
    'cmp-lg': 'cs.CL',
    'patt-sol': 'nlin.PS',
    'adap-org': 'nlin.AO',
    'solv-int': 'nlin.SI'
}

def cat_translator(categories: 'list') -> 'list':
    return sorted(set([cat_dictionary[cat] if cat in cat_dictionary else cat for cat in categories]))

Then we traverse the `categories` column in `arxiv_metadata` and use the dictionary `cat_translator` to update categories.

In [21]:
arxiv_metadata['categories'] = arxiv_metadata['categories'].apply(cat_translator)

We save the new cleaned file to `data/arxiv-metadata-cleaned.parquet`.

In [23]:
arxiv_metadata.to_parquet('../data/arxiv-metadata-cleaned.parquet')

### Data crunching
The goal now is to produce a new dataframe, indexed by `update_date` whose rows are the cross-listings.  We also want another dataframe containing the total daily publications in each category.

We start by importing the cleaned metadata from `data/arxiv-metadata-cleaned.parquet` to `arxiv_metadata`.

In [4]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-cleaned.parquet')

Next we define a function `take_snapshot` that takes a dataframe of listings for one day and returns a dictionary containing the cross listings.

In [32]:
def take_snapshot(group: pd.Series) -> dict:
    graph_edges = dict.fromkeys(graph_edges_keys, 0)
    for entry in group:
        for edge in cwr(entry, 2):
            graph_edges[tuple(sorted(edge))] += 1
    return graph_edges

Now we create a dataframe `arxiv_snapshots` containing the daily snapshots of arXiv cross-listings.  The new dataframe is obtained by grouping `arxiv_metadata` by `update_date` and aggregating each group via the `take_snapshot` function.

We start by creating a dataframe `arxiv_snapshot` with dict entries in categories, representing the graph for the day.

In [52]:
arxiv_snapshots = arxiv_metadata.drop(columns=['id']).groupby('update_date').agg({'categories': take_snapshot})

Next we reset the index of `arxiv_snapshot_dicts` so that the `update_date` becomes a column.

In [53]:
arxiv_snapshots.reset_index(inplace=True)

Then we pop the `categories` column, we explode it into its components, and join it to `arxiv_snapshots`.

In [56]:
arxiv_snapshots = arxiv_snapshots.join(pd.DataFrame(arxiv_snapshots.pop('categories').tolist()))

Finally, we re-index `arxiv_snapshots` to `update_date`.

In [58]:
arxiv_snapshots.set_index('update_date', inplace=True)

Now we save the snapshots to `data/arxiv-snapshots.parquet`.

In [60]:
arxiv_snapshots.to_parquet('../data/arxiv-snapshots.parquet')

  table = self.api.Table.from_pandas(df, **from_pandas_kwargs)


**Note:** the first date in `arxiv_snapshots` seems to contain all entries before May 23, 2007.  We should  drop it when we do the analysis.