# Retrieve JUMP profiles

This is a tutorial on how to access profiles from the [JUMP Cell
Painting datasets](https://github.com/jump-cellpainting/datasets). We
will use polars to fetch the data frames lazily, with the help of `s3fs`
and `pyarrow`. We prefer lazy loading because the data can be too big to
be handled in memory.

In [1]:
import polars as pl

The shapes of the available datasets are:

1.  `cpg0016-jump[crispr]`: CRISPR knockouts genetic perturbations.
2.  `cpg0016-jump[orf]`: Overexpression genetic perturbations.
3.  `cpg0016-jump[compound]`: Chemical perturbations.

Their explicit location is determined by the transformations that
produce the datasets. The aws paths of the dataframes are built from a
prefix below:

In [2]:
INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"

We use a version-controlled csv to release the latest corrected profiles

In [3]:
profile_index = pl.read_csv(INDEX_FILE)
profile_index.head()

We do not need the ‘etag’ (used to check file integrity) column nor the
‘interpretable’ (i.e., before major modifications)

In [4]:
selected_profiles = profile_index.filter(
    pl.col("subset").is_in(("crispr", "orf", "compound"))
).select(pl.exclude("etag"))
filepaths = dict(selected_profiles.iter_rows())
print(filepaths)

{'orf': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet', 'crispr': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet', 'compound': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet'}

We will lazy-load the dataframes and print the number of rows and
columns

In [5]:
info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = pl.scan_parquet(path)
    n_rows = data.select(pl.len()).collect().item()
    schema = data.collect_schema()
    metadata_cols = [col for col in schema.keys() if col.startswith("Metadata")]
    n_cols = schema.len()
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

Let us now focus on the `crispr` dataset and use a regex to select the
metadata columns. We will then sample rows and display the overview.
Note that the collect() method enforces loading some data into memory.

In [6]:
data = pl.scan_parquet(filepaths["crispr"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

The following line excludes the metadata columns:

In [7]:
data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

Finally, we can convert this to `pandas` if we want to perform analyses
with that tool. Keep in mind that this loads the entire dataframe into
memory.

In [8]:
data_only.to_pandas()