# Basic JUMP data access

Alán F. Muñoz  
2024-03-19

This is a tutorial on how to access We will use polars to fetch the data
frame lazily, with the help of s3fs and pyarrow. We prefer lazy loading
because the data can be too big to be handled in memory.

In [1]:
import polars as pl
from pyarrow.dataset import dataset
from s3fs import S3FileSystem

The shapes of the available datasets are:

1.  crispr: Knock-out genetic perturbations.
2.  orf: Overexpression genetic perturbations.
3.  compounds: Chemical genetic perturbations.

The aws paths of the dataframes are shown below:

In [2]:
prefix = (
    "s3://cellpainting-gallery/cpg0016-jump-integrated/source_all/workspace/profiles"
)
filepaths = dict(
    crispr=f"{prefix}/chandrasekaran_2024_0000000/crispr/wellpos_cellcount_mad_outlier_nan_featselect_harmony.parquet",
    orf=f"{prefix}/chandrasekaran_2024_0000000/orf/wellpos_cellcount_mad_outlier_nan_featselect_harmony.parquet",
    compound=f"{prefix}/arevalo_2023_e834481/compound/mad_int_featselect_harmony.parquet/",
)

We use a S3FileSystem to avoid the need of credentials.

In [3]:
def lazy_load(path: str) -> pl.DataFrame:
    fs = S3FileSystem(anon=True)
    myds = dataset(path, filesystem=fs)
    df = pl.scan_pyarrow_dataset(myds)
    return df

We will lazy load the data to visualise its columns

In [4]:
print("Width, or number of columns.")
for name, path in filepaths.items():
    data = lazy_load(path)
    metadata_cols = [col for col in data.columns if col.startswith("Metadata")]
    print(f"{name}: {data.width}, containing {len(metadata_cols)} metadata columns")

Width, or number of columns.
crispr: 1119, containing 8 metadata columns
orf: 882, containing 20 metadata columns
compound: 979, containing 10 metadata columns

Let us now focus on the crispr dataset

In [5]:
data = lazy_load(filepaths["crispr"])