# Basic JUMP data access

This is a tutorial on how to access profiles from the [JUMP Cell
Painting datasets](https://github.com/jump-cellpainting/datasets). We
will use polars to fetch the data frames lazily, with the help of `s3fs`
and `pyarrow`. We prefer lazy loading because the data can be too big to
be handled in memory.

In [1]:
from collections import Counter

import polars as pl
from pyarrow.dataset import dataset
from s3fs import S3FileSystem

INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"
profile_index = pl.read_csv(INDEX_FILE)
url = profile_index.filter(pl.col("subset") == "crispr_interpretable").get_column(
    "url"
)[0]
cols = tuple(pl.scan_parquet(url).collect_schema())
no_objects = list(
    set(tuple(x.split("_")[1:]) for x in cols if not x.startswith("Metadata"))
)
counters = dict(Counter(x[0] for x in no_objects))
total = sum([v for v in counters.values()])
{k: v / total for k, v in counters.items() if v > 1}
"""
{'Texture': 0.6055900621118012,
 'RadialDistribution': 0.13043478260869565,
 'Granularity': 0.062111801242236024,
 'Neighbors': 0.016304347826086956,
 'Correlation': 0.062111801242236024,
 'AreaShape': 0.04192546583850932,
 'Location': 0.017080745341614908,
 'ObjectSkeleton': 0.003105590062111801,
 'Intensity': 0.05822981366459627}
"""

"\n{'Texture': 0.6055900621118012,\n 'RadialDistribution': 0.13043478260869565,\n 'Granularity': 0.062111801242236024,\n 'Neighbors': 0.016304347826086956,\n 'Correlation': 0.062111801242236024,\n 'AreaShape': 0.04192546583850932,\n 'Location': 0.017080745341614908,\n 'ObjectSkeleton': 0.003105590062111801,\n 'Intensity': 0.05822981366459627}\n"