# arXiv heatmap
### Data cleaning

##### Starting point
- the arXiv metadata stripped of all columns except for `id` (`string`), `update_date` (`datetime`), and `categories` (`list`): `data/arxiv-metadata-id-categories.parquet`
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
A `pandas` dataframe:
- indexed by `update_date`
- with columns the 2x2 combinations of categories

## The code

In [2]:
import pandas as pd
import json
import copy
from itertools import combinations_with_replacement as cwr

### Cleaning

First we import the list of current arXiv category tags and store it in the list `arxiv_categories`.
We also create a dictionary `graph_edges` whose keys are the tuples (with repetitions) of `arxiv_categories` and whose entries represent the daily entries in that cross-listing (the tuple with repetition are the papers listed in only one category).

In [4]:
with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat['tag'] for cat in arxiv_categories_descriptions]
arxiv_categories_combinations = list(cwr(arxiv_categories, 2))

# use sorted to make sure the tuples are in a consistent ordering
graph_edges = {tuple(sorted(index)): 0 for index in arxiv_categories_combinations}

Now we import the stripped data as `arxiv_metadata`.

In [5]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-id-date-categories.parquet')

The arXiv categories changed over the years: we find all categories that are not the current ones and store them in the set `missing_categories`.

In [7]:
missing_categories = set()

for index, row in arxiv_metadata.iterrows():
    for category in row['categories']:
        if category not in arxiv_categories:
            missing_categories.add(category)

print(missing_categories)

{'mtrl-th', 'q-bio', 'acc-phys', 'dg-ga', 'cond-mat', 'chem-ph', 'astro-ph', 'comp-gas', 'funct-an', 'patt-sol', 'solv-int', 'alg-geom', 'adap-org', 'supr-con', 'plasm-ph', 'chao-dyn', 'bayes-an', 'q-alg', 'ao-sci', 'atom-ph', 'cmp-lg'}


We need to decide what to do for each of the missing categories.  The most reasonable choice to me seems to find the closest matching current category and replace each missing category with that.

| Old        |  New    |
| ---------- | --------|
| `mtrl-th`  ||
| `q-bio`    ||
| `acc-phys` ||
| `dg-ga`    ||
| `cond-mat` ||
| `chem-ph`  ||
| `astro-ph` ||
| `comp-gas` ||
| `funct-an` ||
| `patt-sol` ||
| `solv-int` ||
| `alg-geom` ||
| `adap-org` ||
| `supr-con` ||
| `plasm-ph` ||
| `chao-dyn` ||
| `bayes-an` ||
| `q-alg`    ||
| `ao-sci`   ||
| `atom-ph`  ||
| `cmp-lg`   ||

Now we go through `arxiv_metadata` again and replace the missing categories with the new ones.

In [None]:
# TODO

### Data crunching
The goal now is to produce a new dataframe, indexed by `update-data` whose rows are the cross-listings.  We also want another dataframe containing the total daily publications in each category.

In [10]:
# TODO