# Basic JUMP data access

This is a tutorial on how to access profiles from the [JUMP Cell
Painting datasets](https://github.com/jump-cellpainting/datasets). We
will use polars to fetch the data frames lazily, with the help of `s3fs`
and `pyarrow`. We prefer lazy loading because the data can be too big to
be handled in memory.

In [2]:
import polars as pl
from pyarrow.dataset import dataset
from s3fs import S3FileSystem

The shapes of the available datasets are:

1.  `cpg0016-jump[crispr]`: CRISPR knockouts genetic perturbations.
2.  `cpg0016-jump[orf]`: Overexpression genetic perturbations.
3.  `cpg0016-jump[compound]`: Chemical perturbations.

Their explicit location is determined by the transformations that
produce the datasets. The aws paths of the dataframes are built from a
prefix below:

In [4]:
_PREFIX = (
    "s3://cellpainting-gallery/cpg0016-jump-assembled/source_all/workspace/profiles"
)
_RECIPE = "jump-profiling-recipe_2024_a917fa7"

transforms = (
    (
        "CRISPR",
        "profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected",
    ),
    ("ORF", "profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony"),
    ("COMPOUND", "profiles_var_mad_int_featselect_harmony"),
)

filepaths = {
    dset: f"{_PREFIX}/{_RECIPE}/{dset}/{transform}/profiles.parquet"
    for dset, transform in transforms
}

We use a S3FileSystem to avoid the need of credentials.

In [5]:
def lazy_load(path: str) -> pl.LazyFrame:
    fs = S3FileSystem(anon=True)
    myds = dataset(path, filesystem=fs)
    df = pl.scan_pyarrow_dataset(myds)
    return df

We will lazy-load the dataframes and print the number of rows and
columns

In [6]:
info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = lazy_load(path)
    n_rows = data.select(pl.count()).collect().item()
    metadata_cols = data.select(pl.col("^Metadata.*$")).columns
    n_cols = data.width
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

dataset,#rows,#cols,#Metadata cols,Size (MB)
str,i64,i64,i64,i64
"""CRISPR""",51185,3677,4,758
"""ORF""",81663,3677,4,1210
"""COMPOUND""",804844,3677,4,11926


Let us now focus on the `crispr` dataset and use a regex to select the
metadata columns. We will then sample rows and display the overview.
Note that the collect() method enforces loading some data into memory.

In [7]:
data = lazy_load(filepaths["CRISPR"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

Metadata_Source,Metadata_Plate,Metadata_Well,Metadata_JCP2022
str,str,str,str
"""source_13""","""CP-CC9-R1-06""","""M07""","""JCP2022_806374…"
"""source_13""","""CP-CC9-R1-28""","""B03""","""JCP2022_800001…"
"""source_13""","""CP-CC9-R2-23""","""P20""","""JCP2022_802185…"
"""source_13""","""CP-CC9-R3-15""","""J15""","""JCP2022_800322…"
"""source_13""","""CP-CC9-R6-28""","""O23""","""JCP2022_800002…"


The following line excludes the metadata columns:

In [8]:
data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

Cells_AreaShape_Area,Cells_AreaShape_BoundingBoxArea,Cells_AreaShape_BoundingBoxMaximum_X,Cells_AreaShape_BoundingBoxMaximum_Y,Cells_AreaShape_BoundingBoxMinimum_X,Cells_AreaShape_BoundingBoxMinimum_Y,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_Eccentricity,Cells_AreaShape_EquivalentDiameter,Cells_AreaShape_EulerNumber,Cells_AreaShape_Extent,Cells_AreaShape_FormFactor,Cells_AreaShape_MajorAxisLength,Cells_AreaShape_MaxFeretDiameter,Cells_AreaShape_MaximumRadius,Cells_AreaShape_MeanRadius,Cells_AreaShape_MedianRadius,Cells_AreaShape_MinFeretDiameter,Cells_AreaShape_MinorAxisLength,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,Cells_AreaShape_Solidity,Cells_AreaShape_Zernike_0_0,Cells_AreaShape_Zernike_1_1,Cells_AreaShape_Zernike_2_0,Cells_AreaShape_Zernike_2_2,Cells_AreaShape_Zernike_3_1,Cells_AreaShape_Zernike_3_3,Cells_AreaShape_Zernike_4_0,Cells_AreaShape_Zernike_4_2,Cells_AreaShape_Zernike_4_4,Cells_AreaShape_Zernike_5_1,Cells_AreaShape_Zernike_5_3,Cells_AreaShape_Zernike_5_5,Cells_AreaShape_Zernike_6_0,…,Nuclei_Texture_Variance_DNA_5_03_256,Nuclei_Texture_Variance_ER_10_00_256,Nuclei_Texture_Variance_ER_10_01_256,Nuclei_Texture_Variance_ER_10_02_256,Nuclei_Texture_Variance_ER_10_03_256,Nuclei_Texture_Variance_ER_3_00_256,Nuclei_Texture_Variance_ER_3_01_256,Nuclei_Texture_Variance_ER_3_02_256,Nuclei_Texture_Variance_ER_3_03_256,Nuclei_Texture_Variance_ER_5_00_256,Nuclei_Texture_Variance_ER_5_01_256,Nuclei_Texture_Variance_ER_5_02_256,Nuclei_Texture_Variance_ER_5_03_256,Nuclei_Texture_Variance_Mito_10_00_256,Nuclei_Texture_Variance_Mito_10_01_256,Nuclei_Texture_Variance_Mito_10_02_256,Nuclei_Texture_Variance_Mito_10_03_256,Nuclei_Texture_Variance_Mito_3_00_256,Nuclei_Texture_Variance_Mito_3_01_256,Nuclei_Texture_Variance_Mito_3_02_256,Nuclei_Texture_Variance_Mito_3_03_256,Nuclei_Texture_Variance_Mito_5_00_256,Nuclei_Texture_Variance_Mito_5_01_256,Nuclei_Texture_Variance_Mito_5_02_256,Nuclei_Texture_Variance_Mito_5_03_256,Nuclei_Texture_Variance_RNA_10_00_256,Nuclei_Texture_Variance_RNA_10_01_256,Nuclei_Texture_Variance_RNA_10_02_256,Nuclei_Texture_Variance_RNA_10_03_256,Nuclei_Texture_Variance_RNA_3_00_256,Nuclei_Texture_Variance_RNA_3_01_256,Nuclei_Texture_Variance_RNA_3_02_256,Nuclei_Texture_Variance_RNA_3_03_256,Nuclei_Texture_Variance_RNA_5_00_256,Nuclei_Texture_Variance_RNA_5_01_256,Nuclei_Texture_Variance_RNA_5_02_256,Nuclei_Texture_Variance_RNA_5_03_256
f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,…,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32
3437.205322,6777.026367,655.655762,553.805725,574.797363,475.20105,614.663025,513.953979,2.809995,0.766263,63.48436,0.934583,0.5282,0.393083,90.885178,96.299362,21.411591,7.525086,6.554964,57.310593,52.115498,0.739814,330.359772,0.785753,0.455676,0.056063,0.157158,0.05564,0.033548,0.019762,0.021171,0.03315,0.021444,0.015374,0.016197,0.010043,0.010896,…,0.287671,2.316995,2.255697,2.31177,2.281517,2.137642,2.17198,2.135844,2.171006,2.198508,2.268348,2.196704,2.264076,8.106384,8.647932,8.149483,8.731106,8.056567,7.904062,8.053588,7.903239,7.893166,7.876201,7.89196,7.928691,5.882857,5.725977,5.863044,5.74948,5.372434,5.473141,5.380401,5.467793,5.536143,5.733851,5.548314,5.717349
2723.384521,5158.307617,677.795349,557.942017,606.970093,488.857605,641.975159,522.861877,2.561128,0.741647,56.989681,0.950307,0.548386,,78.137085,83.441475,19.805571,6.985418,6.062629,51.839073,47.675426,-2.123358,286.608917,0.801257,0.485367,0.056452,0.164563,0.055538,0.031914,0.020555,0.020632,0.031065,0.021271,0.014857,0.015759,0.010461,0.01096,…,0.425449,3.520926,3.387909,3.527264,3.404135,3.201259,3.272938,3.202795,3.268549,3.315624,3.447717,3.323368,3.439064,7.47143,8.032343,7.453537,8.07777,7.264492,7.104119,7.227602,7.152196,7.132525,7.170595,7.114265,7.223675,10.33992,9.888246,10.363727,9.988524,9.360723,9.590192,9.381898,9.56544,9.692972,10.118198,9.752176,10.0841
3654.835693,7013.448242,631.593323,550.501038,550.170898,470.806641,590.420654,510.163788,2.775027,0.761809,64.935699,0.922245,0.536226,,90.942703,97.066788,22.177958,7.783074,6.760633,58.599976,53.501411,-0.699903,337.737854,0.793812,0.466231,0.055939,0.160713,0.055947,0.032861,0.019502,0.021562,0.033062,0.021108,0.014772,0.015794,0.009829,0.010827,…,0.233601,1.046697,1.018803,1.049524,1.028847,0.985821,1.001579,0.988126,0.999641,1.007238,1.031999,1.011429,1.031196,7.501048,8.037206,7.473342,7.99159,9.592624,9.640192,9.525067,9.178057,9.411607,9.306776,9.303323,9.292846,3.829718,3.703847,3.848579,3.7504,3.545621,3.615607,3.552575,3.612055,3.651154,3.754131,3.665798,3.757694
3173.76416,6010.796875,671.297913,581.970459,594.554871,508.082428,632.437805,544.703125,2.488786,0.750593,61.232468,0.92419,0.546198,,84.904556,90.029091,21.05418,7.430189,6.480733,55.409756,51.01392,1.157861,304.193604,0.803121,0.480031,0.056833,0.161786,0.056623,0.032846,0.02063,0.020398,0.031956,0.021722,0.014996,0.016372,0.010458,0.011155,…,0.272457,2.945141,2.821762,2.947247,2.814691,2.711728,2.773225,2.715023,2.772156,2.807252,2.891628,2.810266,2.891836,11.273093,12.210885,11.366681,11.915511,12.807759,12.333618,12.660041,12.259472,11.12972,11.053115,11.020443,11.067571,13.501965,12.914554,13.534065,12.93867,12.413904,12.706355,12.442088,12.701751,12.854201,13.250565,12.877565,13.260153
2861.939453,5583.744629,649.689819,555.609497,577.530701,483.638062,612.984192,519.09375,2.492543,0.756743,57.922684,0.927374,0.539043,,81.930603,86.510559,19.579346,6.928279,6.051001,52.5947,48.145298,2.168251,288.56369,0.794415,0.47224,0.056577,0.157565,0.057176,0.033039,0.0208,0.020877,0.032619,0.022094,0.01525,0.016698,0.0104,0.010929,…,0.324927,2.831556,2.778327,2.820665,2.75987,2.649589,2.689806,2.649093,2.689713,2.716524,2.779911,2.712338,2.776352,25.20035,27.100161,24.932827,27.098526,24.926645,24.321119,25.088202,24.544903,24.126326,24.32305,24.369564,24.48403,13.461011,13.215631,13.423254,13.113597,12.539919,12.765607,12.564599,12.7421,12.87962,13.208318,12.885329,13.164612


Finally, we can convert this to `pandas` if we want to perform analyses
with that tool. Keep in mind that this loads the entire dataframe into
memory.

In [7]:
data_only.to_pandas()