# arXiv heatmap
### Data cleaning

##### Starting point
- the arXiv metadata stripped of all columns except for `id` (`string`), `update_date` (`datetime`), and `categories` (`list`): `data/arxiv-metadata-id-categories.parquet`
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
A cleaned metadata file `data/arxiv-metadata-cleaned.parquet` whose inconsistencies issues have been resolved.

## The code

In [1]:
import pandas as pd

### Cleaning

First we import the list of current arXiv category tags and store it in the list `arxiv_categories`.

In [2]:
import json

with open('../data/arxiv-categories.json', 'r') as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat['tag'] for cat in arxiv_categories_descriptions]

Now we import the stripped data as `arxiv_metadata`.

In [3]:
arxiv_metadata = pd.read_parquet('../data/arxiv-metadata-id-date-categories.parquet')

The arXiv categories changed over the years: we find all categories that are not the current ones and store them in the set `missing_categories`.

In [4]:
missing_categories = set()

for index, row in arxiv_metadata.iterrows():
    for category in row['categories']:
        if category not in arxiv_categories:
            missing_categories.add(category)

print(missing_categories)

{'plasm-ph', 'dg-ga', 'cmp-lg', 'bayes-an', 'patt-sol', 'chem-ph', 'comp-gas', 'chao-dyn', 'astro-ph', 'q-alg', 'atom-ph', 'cond-mat', 'alg-geom', 'supr-con', 'funct-an', 'q-bio', 'adap-org', 'acc-phys', 'mtrl-th', 'ao-sci', 'solv-int'}


We need to decide what to do for each of the missing categories.  The most reasonable choice to me seems to find the closest matching current category and replace each missing category with that.

| Old        |  New                | To add? |
| ---------- | ------------------- | :-----: |
| `mtrl-th`  | `cond-mat.mtrl-sci` |         |
| `q-bio`    | ---                 | X       |
| `acc-phys` | `physics.acc-ph`    |         |
| `dg-ga`    | `math.DG`           |         |
| `cond-mat` | ---                 | X       |
| `chem-ph`  | `physics.chem-ph`   |         |
| `astro-ph` | ---                 | X       |
| `comp-gas` | `nlin.CG`           |         |
| `funct-an` | `math.FA`           |         |
| `patt-sol` | `nlin.PS`           |         |
| `solv-int` | `nlin.SI`           |         |
| `alg-geom` | `math.AG`           |         |
| `adap-org` | `nlin.AO`           |         |
| `supr-con` | `cond-mat.supr-con` |         |
| `plasm-ph` | `physics.plasm-ph`  |         |
| `chao-dyn` | `nlin.CD`           |         |
| `bayes-an` | `physics.data-an`   |         |
| `q-alg`    | `math.QA`           |         |
| `ao-sci`   | `physics.ao-ph`     |         |
| `atom-ph`  | `physics.atom-ph`   |         |
| `cmp-lg`   | `cs.CL`             |         |

The categories `q-bio`, `cond-mat`, and `astro-ph` have been over the years split into subcategories.  Hence, some preprints are classified into what are now meta-categories.  We add these three categories, and we will use them only for those preprints dating to before the splitting.

The goal now is to go through `arxiv_metadata` again and replace the missing categories with the new ones.  We start by creating a dictionary `cat_dictionary` to map old categories to new categories, and a function `cat_translator` to translate a list of categories to the new ones using the dictionary (removing duplicates).

In [5]:
cat_dictionary = {
    'alg-geom': 'math.AG',
    'dg-ga': 'math.DG',
    'chem-ph': 'physics.chem-ph',
    'plasm-ph': 'physics.plasm-ph',
    'ao-sci': 'physics.ao-ph',
    'mtrl-th': 'cond-mat.mtrl-sci',
    'funct-an': 'math.FA',
    'comp-gas': 'nlin.CG',
    'q-alg': 'math.QA',
    'acc-phys': 'physics.acc-ph',
    'atom-ph': 'physics.atom-ph',
    'supr-con': 'cond-mat.supr-con',
    'chao-dyn': 'nlin.CD',
    'bayes-an': 'physics.data-an',
    'cmp-lg': 'cs.CL',
    'patt-sol': 'nlin.PS',
    'adap-org': 'nlin.AO',
    'solv-int': 'nlin.SI'
}

def cat_translator(categories: 'list') -> 'list':
    return sorted(set([cat_dictionary[cat] if cat in cat_dictionary else cat for cat in categories]))

Then we traverse the `categories` column in `arxiv_metadata` and use the dictionary `cat_translator` to update categories.

In [6]:
arxiv_metadata['categories'] = arxiv_metadata['categories'].apply(cat_translator)

We save the new cleaned file to `data/arxiv-metadata-cleaned.parquet`.

In [7]:
arxiv_metadata.to_parquet('../data/arxiv-metadata-cleaned.parquet')