# arXiv heatmap
### Data cleaning

##### Starting point
- the arXiv metadata from Kaggle (not in the repo because the file is too big) [should be refactored to use `kagglehub` instead]
- the list of all current categories: `data/arxiv-categories.json`

##### End goal
A cleaned metadata file `data/arxiv-metadata-cleaned.parquet` whose inconsistencies issues have been resolved.

## The code

In [1]:
import pandas as pd

### Pre-cleaning
We begin by removing the unused data, to make the dataset lighter.

We start by dropping the abstracts only.  We save a copy of the dataset without abstracts to `data/arxiv-metadata-noabstract.parquet`.

In [None]:
arxiv_dataset = pd.read_json("../data/arxiv-metadata-oai-snapshot.json", lines=True)
arxiv_dataset.drop(columns=["abstract"], inplace=True)
arxiv_dataset.to_parquet("../../data/arxiv-metadata-noabstract.parquet")

In [None]:
arxiv_dataset = pd.read_parquet("../../data/arxiv-metadata-noabstract.parquet")

Then we extract the `id`, `versions`, and `categories` columns and save the new dataset to `data/arxiv-metadata-id-versions-categories.parquet`.

In [None]:
arxiv_stripped = arxiv_dataset[["id", "versions", "categories"]]
arxiv_stripped.to_parquet("../../data/arxiv-metadata-id-versions-categories.parquet")

### Cleaning

In [None]:
arxiv_stripped = pd.read_parquet(
    "../../data/arxiv-metadata-id-versions-categories.parquet"
)

#### Date of v1 extraction
We use the date of v1 as reference points.  This is for several reasons:
- `update_date` is not a reliable source since all papers got updated in May 2007, and thus does not record older dates
- v1 is easy to extract from each entry in versions: `version[0]['created']`

We begin by defining a function `date_extractor` that takes the list of versions and returns a `pd.Timestamp` containing the first date appearing in the list (i.e. the date of v1), shifted according to the [arXiv announcement schedule](https://info.arxiv.org/help/availability.html#announcement-schedule):
- if v1 is before 18:00:00 UTC, then the announcement date (`date`) is the next business day,
- if v1 is after 18:00:00 UTC, then the announcement date is two business days after.

Note that this does not take into account DST shifts in announcement schedules.  We ignore this for the moment because it probably has a relatively small effect, but it introduces a bias.

In [None]:
def date_extractor(versions):
    timestamp = pd.Timestamp(versions[0]["created"])
    if timestamp.hour < 18:
        return pd.Timestamp(
            timestamp.year, timestamp.month, timestamp.day
        ) + pd.offsets.BusinessDay(1)
    else:
        return pd.Timestamp(
            timestamp.year, timestamp.month, timestamp.day
        ) + pd.offsets.BusinessDay(2)

Next, we create a new `date` column by applying `date_extractor` to the `versions` column.

In [4]:
arxiv_stripped["date"] = arxiv_stripped["versions"].apply(date_extractor)

Finally, we drop the `versions` column and save the new `arxiv_metadata` dataset to `data/arxiv-id-date-categories.parquet`.

In [5]:
arxiv_metadata = arxiv_stripped.drop(columns=["versions"])
arxiv_metadata.to_parquet("../../data/arxiv-metadata-id-date-categories.parquet")

This is how our dataset looks now.

In [6]:
arxiv_metadata

Unnamed: 0,id,categories,date
0,0704.0001,hep-ph,2007-04-04
1,0704.0002,math.CO cs.CG,2007-04-02
2,0704.0003,physics.gen-ph,2007-04-03
3,0704.0004,math.CO,2007-04-02
4,0704.0005,math.CA math.FA,2007-04-04
...,...,...,...
2710801,supr-con/9608008,supr-con cond-mat.supr-con,1996-08-27
2710802,supr-con/9609001,supr-con cond-mat.supr-con,1996-09-02
2710803,supr-con/9609002,supr-con cond-mat.supr-con,1996-09-04
2710804,supr-con/9609003,supr-con cond-mat.supr-con,1996-09-19


#### Categories cleaning

First we import the list of current arXiv category tags and store it in the list `arxiv_categories`.

In [8]:
import json

with open("../../data/arxiv-categories.json", "r") as f:
    arxiv_categories_descriptions = json.load(f)

arxiv_categories = [cat["tag"] for cat in arxiv_categories_descriptions]

Now we import the stripped data as `arxiv_metadata`.

In [9]:
arxiv_metadata = pd.read_parquet("../../data/arxiv-metadata-id-date-categories.parquet")

Categories are written as a simple string listing all categories separated by white spaces: we split them into a list of words (each word is one category).

In [10]:
arxiv_metadata["categories"] = arxiv_metadata["categories"].apply((lambda s: s.split()))

The arXiv categories changed over the years: we find all categories that are not the current ones and store them in the set `missing_categories`.

In [11]:
missing_categories = set()

for index, row in arxiv_metadata.iterrows():
    for category in row["categories"]:
        if category not in arxiv_categories:
            missing_categories.add(category)

print(missing_categories)

{'alg-geom', 'funct-an', 'plasm-ph', 'comp-gas', 'cmp-lg', 'dg-ga', 'astro-ph', 'patt-sol', 'bayes-an', 'mtrl-th', 'q-alg', 'ao-sci', 'q-bio', 'chao-dyn', 'solv-int', 'supr-con', 'atom-ph', 'cond-mat', 'adap-org', 'acc-phys', 'chem-ph'}


We need to decide what to do for each of the missing categories.  The most reasonable choice to me seems to find the closest matching current category and replace each missing category with that.

| Old        |  New                | To add? |
| ---------- | ------------------- | :-----: |
| `mtrl-th`  | `cond-mat.mtrl-sci` |         |
| `q-bio`    | ---                 | X       |
| `acc-phys` | `physics.acc-ph`    |         |
| `dg-ga`    | `math.DG`           |         |
| `cond-mat` | ---                 | X       |
| `chem-ph`  | `physics.chem-ph`   |         |
| `astro-ph` | ---                 | X       |
| `comp-gas` | `nlin.CG`           |         |
| `funct-an` | `math.FA`           |         |
| `patt-sol` | `nlin.PS`           |         |
| `solv-int` | `nlin.SI`           |         |
| `alg-geom` | `math.AG`           |         |
| `adap-org` | `nlin.AO`           |         |
| `supr-con` | `cond-mat.supr-con` |         |
| `plasm-ph` | `physics.plasm-ph`  |         |
| `chao-dyn` | `nlin.CD`           |         |
| `bayes-an` | `physics.data-an`   |         |
| `q-alg`    | `math.QA`           |         |
| `ao-sci`   | `physics.ao-ph`     |         |
| `atom-ph`  | `physics.atom-ph`   |         |
| `cmp-lg`   | `cs.CL`             |         |

The categories `q-bio`, `cond-mat`, and `astro-ph` have been over the years split into subcategories.  Hence, some preprints are classified into what are now meta-categories.  We add these three categories, and we will use them only for those preprints dating to before the splitting.

The goal now is to go through `arxiv_metadata` again and replace the missing categories with the new ones.  We start by creating a dictionary `cat_dictionary` to map old categories to new categories, and a function `translate` to translate a list of categories to the new ones using a given dictionary (removing duplicates).

In [12]:
cat_dictionary = {
    "alg-geom": "math.AG",
    "dg-ga": "math.DG",
    "chem-ph": "physics.chem-ph",
    "plasm-ph": "physics.plasm-ph",
    "ao-sci": "physics.ao-ph",
    "mtrl-th": "cond-mat.mtrl-sci",
    "funct-an": "math.FA",
    "comp-gas": "nlin.CG",
    "q-alg": "math.QA",
    "acc-phys": "physics.acc-ph",
    "atom-ph": "physics.atom-ph",
    "supr-con": "cond-mat.supr-con",
    "chao-dyn": "nlin.CD",
    "bayes-an": "physics.data-an",
    "cmp-lg": "cs.CL",
    "patt-sol": "nlin.PS",
    "adap-org": "nlin.AO",
    "solv-int": "nlin.SI",
}


def translate(categories: list, dictionary: dict) -> list:
    return sorted(
        set([dictionary[cat] if cat in dictionary else cat for cat in categories])
    )

Then we traverse the `categories` column in `arxiv_metadata` and use the dictionary `cat_translator` to update categories.

In [13]:
arxiv_metadata["categories"] = arxiv_metadata["categories"].apply(
    lambda x: translate(x, cat_dictionary)
)

Here is how the cleaned dataset looks like.

In [14]:
arxiv_metadata

Unnamed: 0,id,categories,date
0,0704.0001,[hep-ph],2007-04-04
1,0704.0002,"[cs.CG, math.CO]",2007-04-02
2,0704.0003,[physics.gen-ph],2007-04-03
3,0704.0004,[math.CO],2007-04-02
4,0704.0005,"[math.CA, math.FA]",2007-04-04
...,...,...,...
2710801,supr-con/9608008,[cond-mat.supr-con],1996-08-27
2710802,supr-con/9609001,[cond-mat.supr-con],1996-09-02
2710803,supr-con/9609002,[cond-mat.supr-con],1996-09-04
2710804,supr-con/9609003,[cond-mat.supr-con],1996-09-19


We save the new cleaned file to `data/arxiv-metadata-cleaned.parquet`.

In [15]:
arxiv_metadata.to_parquet("../../data/arxiv-metadata-cleaned.parquet")