# Exploration of python packages extended/mixed with rust

Let's analyse the adoption of rust on "popular" on python packages to understand how mainstream packages are using it
and then use github + pypi data to explore packages using similar approaches.

# Data collection

All data that we will explore here was collected in [another notebook](download_data.ipynb).
Unfortunately, I didn't add comments on explanations on why/how things were done there, but I'll comment
about it as we explore it.

# Methodology

## Identifying a rust package
We will consider a python + rust package a python package with a `Cargo.toml` in its source code.
In order to check a source code from a package, data was collected from pypi api for packages
the source code was uploaded.

## What was collected from python sources
- path to `Cargo.toml` and `Cargo.lock` files (ignoring case)
- build dependencies (using a development version of [pybuild-deps](https://pypi.org/project/pybuild-deps/)
- content of `Cargo.lock` files
- errors retrieving/parsing the source code (if any)

# Popular packages

For all intentions and purposes, I'm gonna consider "popular" as a synonym to "most downloaded". That's not necessarily
true, but we need to start somewhere.

Data was collected from https://pythonwheels.com, which lists the 360 most downloaded packages in the last 30 days (for all
intentions and purposes, I'm going to consider "popular" as a synonym to "most downloaded")


The dataset from pythonwheels.com only contains the package name and the number of downloads.
Sources were collected for the latest (as of Oct 11 2023) version of the packages listed there.


In [1]:
import json
from pathlib import Path

import pandas as pd

In [2]:
wheels_dataset = json.loads(Path("results/cargo-research-pythonwheel.json").read_text())
df_wheels = pd.DataFrame(wheels_dataset)
df_wheels.head()

Unnamed: 0,pkg_name,pkg_version,build_deps,cargo_toml_qty,cargo_lock_qty,cargo_toml,cargo_lock,cargo_lock_contents,downloads,error,exc_class
0,boto3,1.28.62,[],0.0,0.0,[],[],[],750231089,,
1,urllib3,2.0.6,[hatchling],0.0,0.0,[],[],[],378371521,,
2,botocore,1.31.62,[],0.0,0.0,[],[],[],330557585,,
3,requests,2.31.0,[],0.0,0.0,[],[],[],323394974,,
4,setuptools,68.2.2,[],0.0,0.0,[],[],[],297651249,,


In [3]:
print("number of packages not collected/parsed:", df_wheels[~df_wheels.error.isna()].shape[0])
df_wheels[~df_wheels.error.isna()][["pkg_name", "pkg_version", "downloads", "error"]]

number of packages not collected/parsed: 11


Unnamed: 0,pkg_name,pkg_version,downloads,error
138,azure-common,1.1.28,31844917,file could not be opened successfully:\n- meth...
143,msrest,0.7.1,31550696,file could not be opened successfully:\n- meth...
165,azure-identity,1.14.1,28553071,file could not be opened successfully:\n- meth...
174,tb-nightly,2.15.0a20231010,26890751,PyPI doesn't have the source code for package ...
205,redshift-connector,2.0.914,22448408,PyPI doesn't have the source code for package ...
226,tensorboard,2.14.1,20139353,PyPI doesn't have the source code for package ...
266,tensorflow,2.14.0,16119009,PyPI doesn't have the source code for package ...
285,debugpy,1.8.0,14581674,file could not be opened successfully:\n- meth...
302,tensorflow-estimator,2.14.0,13325343,PyPI doesn't have the source code for package ...
329,tensorboard-data-server,0.7.1,12094937,PyPI doesn't have the source code for package ...


Starting looking at the failures (I like to know the caveats of the process) before jumping to the more interesting stuff:
out of the 360 most "popular" packages, only 11 coudln't be parsed/downloaded.

The ones that couldn't be parsed are a bug on `pybuild-deps`. It naively assumes sources on pypi are always `tar.gz`. Those seem to be `zip`.

The rest of the errors are for packages that don't have sources available on pypi.

# rusty popular python packages

In [4]:
df_wheels[(df_wheels.cargo_toml_qty > 0) | (df_wheels.cargo_lock_qty > 0)]\
[["pkg_name", "pkg_version", "downloads", "build_deps", "cargo_toml_qty", "cargo_lock_qty"]]

Unnamed: 0,pkg_name,pkg_version,downloads,build_deps,cargo_toml_qty,cargo_lock_qty
14,cryptography,41.0.4,186136558,"[wheel, cffi, setuptools, setuptools-rust]",4.0,1.0
98,rpds-py,0.10.4,44442080,[maturin],1.0,1.0
113,bcrypt,4.0.1,38478981,"[wheel, setuptools, setuptools-rust]",1.0,1.0
242,pydantic-core,2.10.1,18787799,"[maturin, typing-extensions]",1.0,1.0
317,orjson,3.9.8,12653856,[maturin],37.0,7.0
338,tokenizers,0.14.1,11359598,[maturin],2.0,1.0
355,pre-commit,3.4.0,10899315,[],1.0,0.0


Excluding pre-commit, all popular rusty popular packages are being built with either `setuptools-rust` or `maturin`.

Interestingly some of thoese packages have multiple manifest (`Cargo.toml`) files. Will better look at this later.

In the future I'd like to look at a bigger dataset of pypi packages to find out what else is used to build
python+rust (after a quick google search I found [rust-cpython](https://github.com/dgrunwald/rust-cpython), but it is deprecated). 
For now, lets look at packages that use `setuptools-rust` and/or `maturin`.

# github projects that depend on setuptools-rust and/or maturin

Using `github-dependents-info` cli, repos depending on `maturin` or `setuptools-rust` were collected. The data from `github-dependents-info` reports the repo name and number of stars.

Packages with at least 10 stars were selected and we searched on pypi (we used the repo name as a search and, if there's nothing on pypi with that exact name, README.md/README.txt was parsed looking for a link to pypi). We could have parsed data directly from github instead of resorting to pypi only - we mostly went with this approach to reuse code used to gather data for wheels dataset. We could get data directly from github in the future.

In [5]:
github_dataset = json.loads(Path("results/cargo-research-dependants-unique.json").read_text())
df_github = pd.DataFrame(github_dataset)
df_github.shape

(358, 12)

In [6]:
is_rust_mask = (df_github.cargo_toml_qty > 0) | (df_github.cargo_lock_qty > 0)
df_github[is_rust_mask].shape

(74, 12)

In [7]:
df_github.error.notna().sum()

57

In [8]:
df_github[df_github.error.notna()].error.str.count("PyPI doesn't have the source code for package").sum()

55

In [9]:
df_github.is_on_pypi.sum()

206

In [10]:
df_github[df_github.is_on_pypi & ~is_rust_mask & df_github.error.isna()]

Unnamed: 0,repo,is_on_pypi,pkg_name,pkg_version,build_deps,cargo_toml_qty,cargo_lock_qty,cargo_toml,cargo_lock,cargo_lock_contents,error,exc_class
0,certbot/certbot,True,certbot,2.7.1,[],0.0,0.0,[],[],[],,
3,InstaPy/InstaPy,True,instapy,0.6.16,[],0.0,0.0,[],[],[],,
4,ansible/awx,True,awx,0.1.1,[],0.0,0.0,[],[],[],,
6,edgedb/edgedb,True,edgedb,1.7.0,[setuptools],0.0,0.0,[],[],[],,
7,matrix-org/synapse,True,synapse,2.151.0,"[wheel, setuptools]",0.0,0.0,[],[],[],,
...,...,...,...,...,...,...,...,...,...,...,...,...
320,christoftorres/Elysium,True,elysium,0.1.0,[poetry-core],0.0,0.0,[],[],[],,
333,chyalexcheng/grainLearning,True,grainlearning,2.0.1,[poetry-core],0.0,0.0,[],[],[],,
346,UnitedTraders/nginxauthdaemon,True,nginxauthdaemon,1.1.0,[],0.0,0.0,[],[],[],,
354,thane98/ignis,True,ignis,0.0.10,[],0.0,0.0,[],[],[],,


In total, we were able to parse 358 packages. Rust sources were found in only 74 of those. Why?

First of all only 206 packages were available on PyPI. Out of those, 55 errored when fetching pypi sources, 2 were packaged as zip instead of tar (`pybuild-deps` will need an update to support zip sources), and 75 actually don't have any rust sources. Checking some of the latter, it seems they depend on libraries like cryptography, which in turn depends on setuptools-rust.

In [11]:
df_ = df_wheels.merge(df_github[["repo", "pkg_name"]], how="left", on="pkg_name")
df_ = pd.concat([df_, df_github])
df_ = df_[~df_.pkg_name.duplicated()].copy()
has_rust_src = (df_.cargo_toml_qty > 0) | (df_.cargo_lock_qty > 0)
df = df_[has_rust_src].copy()
df = df.reset_index()
df = df[['pkg_name', 'pkg_version','repo', 'downloads', 'build_deps', 'cargo_toml_qty',
       'cargo_lock_qty', 'cargo_toml', 'cargo_lock', 'cargo_lock_contents',
       ]].copy()
df.shape

(79, 10)

For the following analysis only the "rustified" packages were considered, summing a total of 79 when merging wheels + github datasets.

# Build dependencies

In [12]:
set(d for deps in df.build_deps for d in deps)

{'cffi',
 'colorama',
 'maturin',
 'numpy',
 'poetry-core',
 'protoc-wheel-0',
 'setuptools',
 'setuptools-rust',
 'setuptools-scm',
 'setuptools-scm-git-archive',
 'setuptools_rust',
 'setuptools_scm',
 'toml',
 'tomli',
 'tqdm',
 'typing-extensions',
 'wheel'}

In [13]:
depends_on_maturin = df.build_deps.apply(lambda x: "maturin" in x)
depends_on_setuptools_rust = df.build_deps.apply(lambda x: "setuptools-rust" in x or "setuptools_rust" in x)
print("packages depending on maturin: ", depends_on_maturin.sum())
print("packages depending on setuptools-rust: ", depends_on_setuptools_rust.sum())
print("packages depending on both: ", (depends_on_maturin & depends_on_setuptools_rust).sum())
print("packages depending on neither: ", (~depends_on_maturin & ~depends_on_setuptools_rust).sum())

packages depending on maturin:  48
packages depending on setuptools-rust:  25
packages depending on both:  2
packages depending on neither:  8


In [17]:
df[depends_on_maturin & depends_on_setuptools_rust]

Unnamed: 0,pkg_name,pkg_version,repo,downloads,build_deps,cargo_toml_qty,cargo_lock_qty,cargo_toml,cargo_lock,cargo_lock_contents
50,fast_mail_parser,0.2.5,namecheap/fast_mail_parser,,"[wheel, maturin, setuptools-rust]",1.0,0.0,[fast_mail_parser-0.2.5/Cargo.toml],[],[]
78,peace-performance-python,2.0.0,Pure-Peace/peace-performance-python,,"[wheel, maturin, toml, setuptools-rust, setupt...",1.0,0.0,[peace-performance-python-2.0.0/Cargo.toml],[],[]


In [14]:
df[~depends_on_maturin & ~depends_on_setuptools_rust]

Unnamed: 0,pkg_name,pkg_version,repo,downloads,build_deps,cargo_toml_qty,cargo_lock_qty,cargo_toml,cargo_lock,cargo_lock_contents
6,pre-commit,3.4.0,,10899315.0,[],1.0,0.0,[pre_commit-3.4.0/pre_commit/resources/empty_t...,[],[]
16,autopy,4.0.0,autopilot-rs/autopy,,[],1.0,0.0,[autopy-4.0.0/Cargo.toml],[],[]
18,setuptools-rust,1.7.0,PyO3/setuptools-rust,,"[setuptools_scm, setuptools]",6.0,6.0,[setuptools-rust-1.7.0/examples/hello-world/Ca...,[setuptools-rust-1.7.0/examples/hello-world/Ca...,"[{'version': 3, 'package': [{'name': 'autocfg'..."
26,tantivy-py,0.11.0-rc.7,quickwit-oss/tantivy-py,,[],1.0,0.0,[Cargo.toml],[],[]
54,perde,0.0.2,YushiOMOTE/perde,,[],1.0,0.0,[Cargo.toml],[],[]
70,snips-nlu-parsers,0.4.3,snipsco/snips-nlu-parsers,,[],1.0,0.0,[snips_nlu_parsers-0.4.3/ffi/Cargo.toml],[],[]
71,flaco,0.6.0,milesgranger/flaco,,"[wheel, setuptools]",1.0,1.0,[flaco-0.6.0/Cargo.toml],[flaco-0.6.0/Cargo.lock],"[{'version': 3, 'package': [{'name': 'ahash', ..."
74,snips-nlu-utils,0.9.1,snipsco/snips-nlu-utils,,[],1.0,0.0,[snips_nlu_utils-0.9.1/ffi/Cargo.toml],[],[]


# Number of manifest files (Cargo.toml)

In [15]:
df.cargo_toml_qty.describe()

count     79.000000
mean       5.227848
std       18.012780
min        1.000000
25%        1.000000
50%        1.000000
75%        1.500000
max      146.000000
Name: cargo_toml_qty, dtype: float64

In [16]:
df[df.cargo_toml_qty > 1]

Unnamed: 0,pkg_name,pkg_version,repo,downloads,build_deps,cargo_toml_qty,cargo_lock_qty,cargo_toml,cargo_lock,cargo_lock_contents
0,cryptography,41.0.4,pyca/cryptography,186136558.0,"[wheel, cffi, setuptools, setuptools-rust]",4.0,1.0,"[cryptography-41.0.4/src/rust/Cargo.toml, cryp...",[cryptography-41.0.4/src/rust/Cargo.lock],"[{'version': 3, 'package': [{'name': 'Inflecto..."
4,orjson,3.9.8,ijl/orjson,12653856.0,[maturin],37.0,7.0,"[orjson-3.9.8/Cargo.toml, orjson-3.9.8/include...","[orjson-3.9.8/Cargo.lock, orjson-3.9.8/include...","[{'version': 3, 'package': [{'name': 'ahash', ..."
5,tokenizers,0.14.1,,11359598.0,[maturin],2.0,1.0,"[tokenizers-0.14.1/tokenizers/Cargo.toml, toke...",[tokenizers-0.14.1/bindings/python/Cargo.lock],"[{'version': 3, 'package': [{'name': 'aho-cora..."
7,polars,0.19.8,pola-rs/polars,,[maturin],19.0,1.0,"[polars-0.19.8/crates/polars-io/Cargo.toml, po...",[polars-0.19.8/py-polars/Cargo.lock],"[{'version': 3, 'package': [{'name': 'addr2lin..."
10,trustfall,0.1.6,obi1kenobi/trustfall,,"[maturin, poetry-core]",3.0,0.0,[trustfall-0.1.6/local_dependencies/trustfall_...,[],[]
13,chidori,0.1.26,ThousandBirdsInc/chidori,,[maturin],3.0,2.0,[chidori-0.1.26/local_dependencies/prompt-grap...,[chidori-0.1.26/local_dependencies/prompt-grap...,"[{'version': 3, 'package': [{'name': 'ahash', ..."
14,bagua,0.9.2,BaguaSys/bagua,,"[wheel, colorama, setuptools_scm, tqdm, setupt...",6.0,2.0,"[bagua-0.9.2/rust/bagua-core/Cargo.toml, bagua...","[bagua-0.9.2/rust/bagua-core/Cargo.lock, bagua...","[{'version': 3, 'package': [{'name': 'addr2lin..."
15,pyoxigraph,0.3.19,oxigraph/oxigraph,,[maturin],7.0,1.0,[pyoxigraph-0.3.19/local_dependencies/oxrdf/Ca...,[pyoxigraph-0.3.19/Cargo.lock],"[{'version': 3, 'package': [{'name': 'adler', ..."
17,stencila,2.0.0a15,stencila/stencila,,[maturin],31.0,1.0,"[stencila-2.0.0a15/rust/codec/Cargo.toml, sten...",[stencila-2.0.0a15/Cargo.lock],"[{'version': 3, 'package': [{'name': 'Inflecto..."
18,setuptools-rust,1.7.0,PyO3/setuptools-rust,,"[setuptools_scm, setuptools]",6.0,6.0,[setuptools-rust-1.7.0/examples/hello-world/Ca...,[setuptools-rust-1.7.0/examples/hello-world/Ca...,"[{'version': 3, 'package': [{'name': 'autocfg'..."
