<img src="https://github.com/mrocklin/arxiv-matplotlib/raw/main/results.png?raw=true"
     align="right"
     width="50%"/>

# How Popular is Matplotlib?

Anecdotally the Matplotlib maintainers were told 

*"About 15% of arXiv papers use Matplotlib"*

arXiv is the preeminent repository for scholarly prepreint articles.  It stores millions of journal articles used across science.  It's also public access, and so we can just scrape the entire thing given enough compute power.

## Watermark

Starting in the early 2010s, Matplotlib started including the bytes `b"Matplotlib"` in every PNG and PDF that they produce.  These bytes persist in PDFs that contain Matplotlib plots, including the PDFs stored on arXiv.  As a result, it's pretty simple to check if a PDF contains a Matplotlib image.  All we have to do is scan through every PDF and look for these bytes; no parsing required.

## Data

The data is stored in a requester pays bucket at s3://arxiv (more information at https://arxiv.org/help/bulk_data_s3 ) and also on GCS hosted by Kaggle (more information at https://www.kaggle.com/datasets/Cornell-University/arxiv).  

The data is about 1TB in size.  We're going to use Dask for this.

This is a good example of writing plain vanilla Python code to solve a problem, running into issues of scale, and then using Dask to easily jump over those problems.

### Get all filenames

Our data is stored in a requester pays S3 bucket in the `us-east-1` region.  Each file is a tar file which contains a directory of papers.

In [None]:
import s3fs
s3 = s3fs.S3FileSystem(requester_pays=True)

directories = s3.ls("s3://arxiv/pdf")
directories[:10]

In [None]:
len(directories)

There are lots of these

In [None]:
s3.du("s3://arxiv/pdf") / 1e12

## Process one file with plain Python

Mostly we have to muck about with tar files.  This wasn't hard.  The `tarfile` library is in the stardard library.  It's not beautiful, but it's also not hard to use.

In [None]:
import tarfile
import io

def extract(filename: str):
    """ Extract and process one directory of arXiv data
    
    Returns
    -------
    filename: str
    contains_matplotlib: boolean
    """
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            try:
                with tarfile.TarFile(fileobj=bio) as tf:
                    for member in tf.getmembers():
                        if member.isfile() and member.name.endswith(".pdf"):
                            data = tf.extractfile(member).read()
                            out.append((
                                member.name, 
                                b"matplotlib" in data.lower()
                            ))
            except tarfile.ReadError:
                pass
            return out

In [None]:
%%time

# See an example of its use
extract(directories[20])[:20]

# Scale processing to full dataset

Great, we can get a record of each file and whether or not it used Matplotlib.  Each of these takes about a minute to run on my local machine.  Processing all 5000 files would take 5000 minutes, or around 100 hours.  

We can accelerate this in two ways:

1.  **Process closer to the data** by spinning up resources in the same region on the cloud (this also reduces data transfer costs)
2.  **Use hundreds of workers** in parallel

We can do this easily with [Coiled Functions](https://docs.coiled.io/user_guide/usage/functions/index.html).

## Run function on the cloud in parallel

We annotate our `extract` function with the `@coiled.function` decorator to have it run on AWS in the same region where the data is stored.

In [None]:
import coiled

@coiled.function(
    region="us-east-1",  # Local to data.  Faster and cheaper.
    vm_type="m6i.xlarge",
    threads_per_worker=4,
)
def extract(filename: str):
    """ Extract and process one directory of arXiv data
    
    Returns
    -------
    filename: str
    contains_matplotlib: boolean
    """
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            try:
                with tarfile.TarFile(fileobj=bio) as tf:
                    for member in tf.getmembers():
                        if member.isfile() and member.name.endswith(".pdf"):
                            data = tf.extractfile(member).read()
                            out.append((
                                member.name, 
                                b"matplotlib" in data.lower()
                            ))
            except tarfile.ReadError:
                pass
            return out

### Map function across every directory

Let's scale up this work across all of the directories in our dataset.

Hopefully it will also be faster because the cloud VMs are in the same region as the dataset itself.

In [None]:
%%time

results = extract.map(directories)
lists = list(results)

Now that we're done with the large data problem we can turn off Coiled and proceed with pure Pandas. There's no reason to deal with scalable tools if we don't have to.

## Enrich Data

Let's enhance our data a bit.  The filenames of each file include the year and month when they were published.  After extracting this data we'll be able to see a timeseries of Matplotlib adoption.

In [None]:
# Convert to Pandas

import pandas as pd

dfs = [
    pd.DataFrame(list, columns=["filename", "has_matplotlib"]) 
    for list in lists
]

df = pd.concat(dfs)

df

In [None]:
def date(filename):
    year = int(filename.split("/")[0][:2])
    month = int(filename.split("/")[0][2:4])
    if year > 80:
        year = 1900 + year
    else:
        year = 2000 + year
    
    return pd.Timestamp(year=year, month=month, day=1)

date("0005/astro-ph0001322.pdf")

Yup.  That seems to work.  Let's map this function over our dataset.

In [None]:
df["date"] = df.filename.map(date)
df.head()

## Plot

Now we can just fool around with Pandas and Matplotlib.

In [None]:
df.groupby("date").has_matplotlib.mean().plot(
    title="Matplotlib Usage in arXiv", 
    ylabel="Fraction of papers"
).get_figure().savefig("results.png")

I did the plot above.  Then Thomas Caswell (matplotlib maintainer) came by and, in true form, made something much better 🙂

In [None]:
import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

import pandas as pd

by_month = df.groupby("date").has_matplotlib.mean()

# get figure
fig, ax = plt.subplots(layout="constrained")
# plot the data
ax.plot(by_month, "o", color="k", ms=3)

# over-ride the default auto limits
ax.set_xlim(left=datetime.date(2004, 1, 1))
ax.set_ylim(bottom=0)

# turn on a horizontal grid
ax.grid(axis="y")

# remove the top and right spines
ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)

# format y-ticks a percent
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))

# add title and labels
ax.set_xlabel("date")
ax.set_ylabel("% of all papers")
ax.set_title("Matplotlib usage on arXiv");

Yup.  Matplotlib is used pretty commonly on arXiv.  Go team.

## Save results

This data was slighly painful to procure.  Let's save the results locally for future analysis.  That way other researchers can further analyze the results without having to muck about with parallelism or cloud stuff.

In [None]:
df.to_csv("arxiv-matplotlib.csv")

In [None]:
!du -hs arxiv-matplotlib.csv

In [None]:
df.to_parquet("arxiv-matplotlib.parquet", compression="snappy")

In [None]:
!du -hs arxiv-matplotlib.parquet

## Conclusion

### Matplotlib + arXiv

It's incredible to see the steady growth of Matplotlib across arXiv.  It's worth noting that this is *all* papers, even from fields like theoretical mathematics that are unlikely to include computer generated plots.  Is this matplotlib growing in popularity?  Is it Python generally?

For future work, we should break this down by subfield.  The filenames actually contained the name of the field for a while, like "hep-ex" for "high energy physics, experimental", but it looks like arXiv stopped doing this at some point.  My guess is that there is a list mapping filenames to fields somewhere though.  The filenames are all in the Pandas dataframe / parquet dataset, so doing this analysis shouldn't require any scalable computing.

### Coiled

Coiled was built to make it easy to answer large questions.  

We started this notebook with some generic Python code. When we wanted to scale up we invoked Coiled, did some work, and then tore things down, all in about ten minutes. The problem of scale or "big data" didn't get in the way of us analyzing data and making a delightful discovery. 

This is exactly why these projects exist.