# Basic JUMP data access

Alán F. Muñoz  
2024-03-26

This is a tutorial on how to access JUMP-Cellpainting data. We will use
polars to fetch the data frame lazily, with the help of s3fs and
pyarrow. We prefer lazy loading because the data can be too big to be
handled in memory.

In [1]:
import site
print(site.getsitepackages())
import sys
print(sys.path)

import polars as pl
from pyarrow.dataset import dataset
from s3fs import S3FileSystem

['/nix/store/am8v178hrl64k1kc4x5mwd7dn0s28id7-python3-3.11.8-env/lib/python3.11/site-packages', '/nix/store/7wz6hm9i8wljz0hgwz1wqmn2zlbgavrq-python3-3.11.8/lib/python3.11/site-packages']
['/nix/store/7wz6hm9i8wljz0hgwz1wqmn2zlbgavrq-python3-3.11.8/lib/python311.zip', '/nix/store/7wz6hm9i8wljz0hgwz1wqmn2zlbgavrq-python3-3.11.8/lib/python3.11', '/nix/store/7wz6hm9i8wljz0hgwz1wqmn2zlbgavrq-python3-3.11.8/lib/python3.11/lib-dynload', '', '/nix/store/7wz6hm9i8wljz0hgwz1wqmn2zlbgavrq-python3-3.11.8/lib/python3.11/site-packages', '/nix/store/am8v178hrl64k1kc4x5mwd7dn0s28id7-python3-3.11.8-env/lib/python3.11/site-packages']

The shapes of the available datasets are:

1.  crispr: Knock-out genetic perturbations.
2.  orf: Overexpression genetic perturbations.
3.  compounds: Chemical genetic perturbations.

The aws paths of the dataframes are shown below:

In [2]:
prefix = (
    "s3://cellpainting-gallery/cpg0016-jump-integrated/source_all/workspace/profiles"
)
filepaths = dict(
    crispr=f"{prefix}/chandrasekaran_2024_0000000/crispr/wellpos_cellcount_mad_outlier_nan_featselect_harmony.parquet",
    orf=f"{prefix}/chandrasekaran_2024_0000000/orf/wellpos_cellcount_mad_outlier_nan_featselect_harmony.parquet",
    compound=f"{prefix}/arevalo_2023_e834481/compound/mad_int_featselect_harmony.parquet/",
)

We use a S3FileSystem to avoid the need of credentials.

In [3]:
def lazy_load(path: str) -> pl.DataFrame:
    fs = S3FileSystem(anon=True)
    myds = dataset(path, filesystem=fs)
    df = pl.scan_pyarrow_dataset(myds)
    return df

We will lazy-load the dataframes and print the number of rows and
columns

In [4]:
info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = lazy_load(path)
    n_rows = data.select(pl.count()).collect().item()
    metadata_cols = data.select(pl.col("^Metadata.*$")).columns
    n_cols = data.width
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

Let us now focus on the crispr dataset and use a regex to select the
metadata columns. We will then sample rows and display the overview.
Note that the collect() method enforces loading some data into memory.

In [5]:
data = lazy_load(filepaths["crispr"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

The following line excludes the metadata columns:

In [6]:
data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

Finally, we can convert this to pandas if we want to perform analyses
with that tool. Keep in mind that this loads the entire dataframe into
memory.

In [7]:
data_only.to_pandas()