# arXiv heatmap
### Data cleaning

##### Starting point
- the arXiv metadata stripped of all columns except for `id` (`string`), `update_date` (`datetime`), and `categories` (`list`): `data/arxiv-metadata-id-categories.parquet`
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
A `pandas` dataframe:
- indexed by `update_date`
- with columns the 2x2 combinations of categories

## The code

In [1]:
import pandas as pd
import json
import copy
from itertools import combinations_with_replacement as cwr

### Cleaning

First we import the list of current arXiv category tags and store it in the list `arxiv_categories`.

In [2]:
with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat['tag'] for cat in arxiv_categories_descriptions]

Now we import the stripped data as `arxiv_metadata`.

In [3]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-id-date-categories.parquet')

The arXiv categories changed over the years: we find all categories that are not the current ones and store them in the set `missing_categories`.

In [4]:
missing_categories = set()

for index, row in arxiv_metadata.iterrows():
    for category in row['categories']:
        if category not in arxiv_categories:
            missing_categories.add(category)

print(missing_categories)

{'alg-geom', 'dg-ga', 'chem-ph', 'plasm-ph', 'ao-sci', 'mtrl-th', 'funct-an', 'comp-gas', 'q-alg', 'cond-mat', 'acc-phys', 'astro-ph', 'atom-ph', 'supr-con', 'chao-dyn', 'bayes-an', 'cmp-lg', 'q-bio', 'patt-sol', 'adap-org', 'solv-int'}


We need to decide what to do for each of the missing categories.  The most reasonable choice to me seems to find the closest matching current category and replace each missing category with that.

| Old        |  New                | To add? |
| ---------- | ------------------- | :-----: |
| `mtrl-th`  | `cond-mat.mtrl-sci` |         |
| `q-bio`    | `q-bio`             | X       |
| `acc-phys` | `physics.acc-ph`    |         |
| `dg-ga`    | `math.DG`           |         |
| `cond-mat` | `cond-mat`          | X       |
| `chem-ph`  | `physics.chem-ph`   |         |
| `astro-ph` | `astro-ph`          | X       |
| `comp-gas` | `nlin.CG`           |         |
| `funct-an` | `math.FA`           |         |
| `patt-sol` | `nlin.PS`           |         |
| `solv-int` | `nlin.SI`           |         |
| `alg-geom` | `math.AG`           |         |
| `adap-org` | `nlin.AO`           |         |
| `supr-con` | `cond-mat.supr-con` |         |
| `plasm-ph` | `physics.plasm-ph`  |         |
| `chao-dyn` | `nlin.CD`           |         |
| `bayes-an` | `physics.data-an`   |         |
| `q-alg`    | `math.QA`           |         |
| `ao-sci`   | `physics.ao-ph`     |         |
| `atom-ph`  | `physics.atom-ph`   |         |
| `cmp-lg`   | `cs.CL`             |         |

The categories `q-bio`, `cond-mat`, and `astro-ph` have been over the years split into subcategories.  Hence, some preprints are classified into what are now meta-categories.  We add these three categories, and we will use them only for those preprints dating to before the splitting.

We then create a dictionary `graph_edges` whose keys are the tuples (with repetitions) of `arxiv_categories_extra` and whose entries represent the daily entries in that cross-listing (the tuple with repetition are the papers listed in only one category).

In [5]:
arxiv_categories_extra = arxiv_categories + ['q-bio', 'cond-mat', 'astro-ph']
arxiv_categories_combinations = cwr(arxiv_categories_extra, 2)

# use sorted to make sure the tuples are in a consistent ordering
graph_edges = {tuple(sorted(index)): 0 for index in arxiv_categories_combinations}

Now `arxiv_categories_extra` contains the current categories plus the three legacy meta-categories, and `graph_edges` has been updated to include the three new nodes.

We finally go through `arxiv_metadata` again and replace the missing categories with the new ones.

In [None]:
# TODO

### Data crunching
The goal now is to produce a new dataframe, indexed by `update_date` whose rows are the cross-listings.  We also want another dataframe containing the total daily publications in each category.

In [10]:
# TODO