# Cargo dependency breakdown

## Analogy

so to give you an analogy, 
* MHRA is like a pizza shop where pizzas are made
* there are many chefs working in the MHRA pizza shop and there are many pizza shops doing similar work to MHRA
* a pizza (backend web server) has cheese (http library) and pizza base (database library)
  * pizza base has dough (low-level database library)
    * dough has flour (network library), salt (encryption library) and yeast (database query formatter) 
  * cheese has milk (http library) and salt (same encryption library)
* *but*
  * there are lots of different ways to make pizza base
    * there are lots of different ways to make flour
    * there are lots of different ways to make yeast
  * there are lots of different ways to make cheese
    * there are lots of different ways to make milk
    * there are lots of different ways to make salt
* it costs money to *pre-make different types of* dough, cheese and pizza base so we want to only *pre-make* the ingredients for the pizzas that people want
* everything pre-built has a shelf life of 6 weeks (trust me on this one)
* *so*
* We want to pre-make only the most popular ingredients to maximise the *amount of time saved by* chefs making pizzas
* and then we can have quick pizza in rust land for the masses and no one needs to go hungry again
* ...does that make sense??"

In [None]:
import plotly.io as pio
print(pio.renderers.default)

In [None]:
import plotly.express

In [None]:
import pandas

In [None]:
import psycopg2


In [None]:
trees = pandas.read_parquet("../subtrees-clean.parquet")

tree_counts = trees.groupby(
    ['package_name', 'package_version', 'hash'],
).agg(
    tree_size=('deps_count', 'first'),
    example_repo_path=('repo_path', 'first'),
    tree_occurrences=('repo_path', 'count'),
)

version_counts = tree_counts.groupby(
    ['package_name', 'package_version'],
).agg(
    version_occurrences=('tree_occurrences', 'sum'),
)


In [None]:
def call_cached(fn):
    import inspect
    import hashlib
    digest = hashlib.sha256(inspect.getsource(get_downloads_data).encode()).hexdigest()
    cache_filename = f"../cache/{fn.__name__}-{digest}.parquet"

    try:
        data = pandas.read_parquet(cache_filename)
        return data
    except FileNotFoundError:
        pass

    data = fn()
    data.to_parquet(cache_filename)
    # TODO:
    # * Delete every file matching f"../cache/{fn.__name__}-*.parquet"
    #   other than `cache_filename`.
    #   (or touch the file we used, if we want an LRU cache with size bigger than 1)
    # * Log cache misses and timings.
    return data

In [None]:
def get_downloads_data():
    conn = psycopg2.connect(
        database="cratesio",
    )
    downloads = pandas.read_sql_query("""
        select
            c.name as package_name, v.num as package_version, d.downloads
        from
            version_downloads as d
        join
            versions as v on v.id = d.version_id
        join
            crates as c on c.id = v.crate_id
        where
            date = '2021-03-29'
        order by
            package_name, package_version
        ;
    """, conn)
    return downloads

downloads = call_cached(get_downloads_data).set_index(['package_name', 'package_version'])

downloads

In [None]:
combined = combined = tree_counts.join(
    version_counts,
    how='inner',
    on=['package_name', 'package_version']
).join(
    downloads,
    how='inner',
    on=['package_name', 'package_version'],
)
combined['estimated_daily_downloads'] = (
    combined['downloads'] * combined['tree_occurrences'] / combined['version_occurrences']
)


In [None]:
def plot_unscaled(df, *, x, y):
    plot_data = df.reset_index().drop_duplicates(subset=[x, y])
    fig = plotly.express.scatter(
        plot_data, x='tree_occurrences', y='tree_size',
        # hover_name='package_name',
    )
    fig.update_traces(hovertemplate=None, hoverinfo='skip')
    return fig

plot_unscaled(combined, x='tree_occurrences', y='tree_size')

In [None]:
def plot_loglog(df, *, x, y, hover_name='package_name'):
    df = df.reset_index().drop_duplicates(subset=[x, y])
    plot_data = df.reset_index()
    fig = plotly.express.scatter(
        plot_data, x=x, y=y, hover_name=hover_name,
        log_x=True, log_y=True,
    )
    # fig.update_traces(hovertemplate=None, hoverinfo='skip')
    return fig

plot_loglog(combined, x='tree_occurrences', y='tree_size')

In [None]:
plot_loglog(combined, x='estimated_daily_downloads', y='tree_size')

## Analysis

the utility gained by building a package tree is proportional to the number of users (tree_ocurrences) * the size of the tree (tree_size).

    utility = tree_ocurrences * tree_size
    log(utility) = log(tree_ocurrences * tree_size)
    log(utility) = log(tree_ocurrences) + log(tree_size)

therefore, lines of constant utility are straight downwards-sloping lines in the above log plot (we want to build the packages that are furthest to the top-right on this plot, like `frame-benchmarking`, `mio` and `winapi`)

In practice, tree_ocurrences only says how many *GitHub Repos* are using particular configurations of the crates. We want to know how many *people* are using them. For this, we need to use the crates.io download data.

## Caveats

The set of requested features do not seem to appear in the lockfile. They only appear in Cargo.toml. In a lot of cases, changing features will add extra dependencies, but not always. We may need to re-do the analysis and create a new `subtrees-clean.parquet` that adds a "features" field to each dependency in the tree. This could be a lot of work with not great payoff though. Might be best to just keep in mind that the fragmentation will be a bit worse than what you see on these graphs.

I am assuming that build time is proportional to dependency tree size. In reality, it is likely to also scale proportional to crate size (available as `versions.crate_size` in the crates.io postgresql dump), and whether it is using a lot of proc-macros and generics from its dependencies. Predicting crate build times would be a really interesting project, if anyone has a dataset (maybe crater has one?).