# Basic JUMP data access

This is a tutorial on how to access profiles from the [JUMP Cell
Painting datasets](https://github.com/jump-cellpainting/datasets). We
will use polars to fetch the data frames lazily, with the help of `s3fs`
and `pyarrow`. We prefer lazy loading because the data can be too big to
be handled in memory.

In [1]:
import polars as pl
from pyarrow.dataset import dataset
from s3fs import S3FileSystem

The shapes of the available datasets are:

1.  `cpg0016-jump[crispr]`: CRISPR knockouts genetic perturbations.
2.  `cpg0016-jump[orf]`: Overexpression genetic perturbations.
3.  `cpg0016-jump[compound]`: Chemical perturbations.

Their explicit location is determined by the transformations that
produce the datasets. The aws paths of the dataframes are built from a
prefix below:

In [2]:
_PREFIX = (
    "s3://cellpainting-gallery/cpg0016-jump-assembled/source_all/workspace/profiles"
)
_RECIPE = "jump-profiling-recipe_2024_a917fa7"

transforms = (
    (
        "CRISPR",
        "profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected",
    ),
    ("ORF", "profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony"),
    ("COMPOUND", "profiles_var_mad_int_featselect_harmony"),
)

filepaths = {
    dset: f"{_PREFIX}/{_RECIPE}/{dset}/{transform}/profiles.parquet"
    for dset, transform in transforms
}

We use a S3FileSystem to avoid the need of credentials.

In [3]:
def lazy_load(path: str) -> pl.LazyFrame:
    fs = S3FileSystem(anon=True)
    myds = dataset(path, filesystem=fs)
    df = pl.scan_pyarrow_dataset(myds)
    return df

We will lazy-load the dataframes and print the number of rows and
columns

In [4]:
info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = lazy_load(path)
    n_rows = data.select(pl.count()).collect().item()
    metadata_cols = data.select(pl.col("^Metadata.*$")).columns
    n_cols = data.width
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

  n_rows = data.select(pl.count()).collect().item()
  metadata_cols = data.select(pl.col("^Metadata.*$")).columns
  n_cols = data.width

Let us now focus on the `crispr` dataset and use a regex to select the
metadata columns. We will then sample rows and display the overview.
Note that the collect() method enforces loading some data into memory.

In [5]:
data = lazy_load(filepaths["CRISPR"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

The following line excludes the metadata columns:

In [6]:
data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

Finally, we can convert this to `pandas` if we want to perform analyses
with that tool. Keep in mind that this loads the entire dataframe into
memory.

In [7]:
data_only.to_pandas()