# Exploration of python packages extended/mixed with rust

Let's analyse the adoption of rust on "popular" on python packages to understand how mainstream packages are using it
and then use github + pypi data to explore packages using similar approaches.

# Data collection

All data that we will explore here was collected in [another notebook](download_data.ipynb).
Unfortunately, I didn't add comments on explanations on why/how things were done there, but I'll comment
about it as we explore it.

# Methodology

## Identifying a rust package
We will consider a python + rust package a python package with a `Cargo.toml` in its source code.
In order to check a source code from a package, data was collected from pypi api for packages
the source code was uploaded.

## What was collected from python sources
- path to `Cargo.toml` and `Cargo.lock` files (ignoring case)
- build dependencies (using a development version of [pybuild-deps](https://pypi.org/project/pybuild-deps/)
- content of `Cargo.lock` files
- errors retrieving/parsing the source code (if any)

# Popular packages

For all intentions and purposes, I'm gonna consider "popular" as a synonym to "most downloaded". That's not necessarily
true, but we need to start somewhere.

Data was collected from https://pythonwheels.com, which lists the 360 most downloaded packages in the last 30 days (for all
intentions and purposes, I'm going to consider "popular" as a synonym to "most downloaded")


The dataset from pythonwheels.com only contains the package name and the number of downloads.
Sources were collected for the latest (as of Oct 11 2023) version of the packages listed there.


In [1]:
import json
from pathlib import Path

import pandas as pd

In [2]:
wheels_dataset = json.loads(Path("results/cargo-research-pythonwheel.json").read_text())
df_wheels = pd.DataFrame(wheels_dataset)
df_wheels.head()

Unnamed: 0,pkg_name,pkg_version,build_deps,cargo_toml_qty,cargo_lock_qty,cargo_toml,cargo_lock,cargo_lock_contents,downloads,error,exc_class
0,boto3,1.28.62,[],0.0,0.0,[],[],[],750231089,,
1,urllib3,2.0.6,[hatchling],0.0,0.0,[],[],[],378371521,,
2,botocore,1.31.62,[],0.0,0.0,[],[],[],330557585,,
3,requests,2.31.0,[],0.0,0.0,[],[],[],323394974,,
4,setuptools,68.2.2,[],0.0,0.0,[],[],[],297651249,,


In [3]:
print("number of packages not collected/parsed:", df_wheels[~df_wheels.error.isna()].shape[0])
df_wheels[~df_wheels.error.isna()][["pkg_name", "pkg_version", "downloads", "error"]]

number of packages not collected/parsed: 11


Unnamed: 0,pkg_name,pkg_version,downloads,error
138,azure-common,1.1.28,31844917,file could not be opened successfully:\n- meth...
143,msrest,0.7.1,31550696,file could not be opened successfully:\n- meth...
165,azure-identity,1.14.1,28553071,file could not be opened successfully:\n- meth...
174,tb-nightly,2.15.0a20231010,26890751,PyPI doesn't have the source code for package ...
205,redshift-connector,2.0.914,22448408,PyPI doesn't have the source code for package ...
226,tensorboard,2.14.1,20139353,PyPI doesn't have the source code for package ...
266,tensorflow,2.14.0,16119009,PyPI doesn't have the source code for package ...
285,debugpy,1.8.0,14581674,file could not be opened successfully:\n- meth...
302,tensorflow-estimator,2.14.0,13325343,PyPI doesn't have the source code for package ...
329,tensorboard-data-server,0.7.1,12094937,PyPI doesn't have the source code for package ...


Starting looking at the failures (I like to know the caveats of the process) before jumping to the more interesting stuff:
out of the 360 most "popular" packages, only 11 coudln't be parsed/downloaded.

The ones that couldn't be parsed are a bug on `pybuild-deps`. It naively assumes sources on pypi are always `tar.gz`. Those seem to be `zip`.

The rest of the errors are for packages that don't have sources available on pypi.

# rusty popular python packages

In [4]:
df_wheels[(df_wheels.cargo_toml_qty > 0) | (df_wheels.cargo_lock_qty > 0)]\
[["pkg_name", "pkg_version", "downloads", "build_deps", "cargo_toml_qty", "cargo_lock_qty"]]

Unnamed: 0,pkg_name,pkg_version,downloads,build_deps,cargo_toml_qty,cargo_lock_qty
14,cryptography,41.0.4,186136558,"[wheel, cffi, setuptools, setuptools-rust]",4.0,1.0
98,rpds-py,0.10.4,44442080,[maturin],1.0,1.0
113,bcrypt,4.0.1,38478981,"[wheel, setuptools, setuptools-rust]",1.0,1.0
242,pydantic-core,2.10.1,18787799,"[maturin, typing-extensions]",1.0,1.0
317,orjson,3.9.8,12653856,[maturin],37.0,7.0
338,tokenizers,0.14.1,11359598,[maturin],2.0,1.0
355,pre-commit,3.4.0,10899315,[],1.0,0.0


Excluding pre-commit, all popular rusty popular packages are being built with either `setuptools-rust` or `maturin`.

Interestingly some of thoese packages have multiple manifest (`Cargo.toml`) files. Will better look at this later.

In the future I'd like to look at a bigger dataset of pypi packages to find out what else is used to build
python+rust (after a quick google search I found [rust-cpython](https://github.com/dgrunwald/rust-cpython)). 
For now, lets look at packages that use `setuptools-rust` and/or `maturin`.