# Downloading and Processing CPJUMP1 Experimental Metadata

This notebook focuses on downloading and processing the metadata associated with the [CPJUMP1 pilot dataset](https://github.com/jump-cellpainting/2024_Chandrasekaran_NatureMethods_CPJUMP1). The primary goal is to identify and organize plates that contain wells treated with CRISPR perturbations for downstream analysis.

**Key Points:**
- Only metadata is downloaded and processed in this notebook. The full CPJUMP1 dataset is not downloaded here.
- The metadata provides essential information about which plates and wells are relevant for CRISPR-based experiments.
- The processed dataset used in this notebook has already undergone quality control and feature selection. For access to the full processed dataset, refer to [repo](https://github.com/WayScience/JUMP-single-cell)



In [1]:
import sys
import pprint
import pathlib
import zipfile

import requests
import polars as pl
from tqdm import tqdm

sys.path.append("../../")
from utils import io_utils

Parameters used in this notebook

In [2]:
# setting perturbation type
pert_type = "crispr"

setting input and output paths

In [3]:
# setting config path
config_path = pathlib.Path("../nb-configs.yaml").resolve(strict=True)

# setting results setting a data directory
data_dir = pathlib.Path("./data").resolve()
data_dir.mkdir(exist_ok=True)

# setting a path to save the experimental metadata
exp_metadata_path = (data_dir / "CPJUMP1-experimental-metadata.csv").resolve()

# setting profile directory
profiles_dir = (data_dir / "sc-profiles").resolve()
profiles_dir.mkdir(exist_ok=True)

# create mitocheck directory
mitocheck_dir = (profiles_dir / "mitocheck").resolve()
mitocheck_dir.mkdir(exist_ok=True)

Loading in the notebook configurations and downloading the experimental metadata

In [4]:
# loading config file and setting experimental metadata URL
nb_configs = io_utils.load_configs(config_path)
CPJUMP1_exp_metadata_url = nb_configs["links"]["CPJUMP1-experimental-metadata-source"]

# read in the experimental metadata CSV file and only filter down to plays that
# have an CRISPR perturbation
exp_metadata = pl.read_csv(
    CPJUMP1_exp_metadata_url, separator="\t", has_header=True, encoding="utf-8"
)

# filtering the metadata to only includes plates that their perturbation types are crispr
exp_metadata = exp_metadata.filter(exp_metadata["Perturbation"].str.contains(pert_type))

# save the experimental metadata as a csv file
exp_metadata.write_csv(exp_metadata_path)

# display
exp_metadata

Batch,Plate_Map_Name,Assay_Plate_Barcode,Perturbation,Cell_type,Time,Density,Antibiotics,Cell_line,Time_delay,Times_imaged,Anomaly,Number_of_images
str,str,str,str,str,i64,i64,str,str,str,i64,str,i64
"""2020_11_04_CPJUMP1""","""JUMP-Target-1_crispr_platemap""","""BR00116996""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648
"""2020_11_04_CPJUMP1""","""JUMP-Target-1_crispr_platemap""","""BR00116997""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648
"""2020_11_04_CPJUMP1""","""JUMP-Target-1_crispr_platemap""","""BR00116998""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648
"""2020_11_04_CPJUMP1""","""JUMP-Target-1_crispr_platemap""","""BR00116999""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648
"""2020_11_04_CPJUMP1""","""JUMP-Target-1_crispr_platemap""","""BR00117000""","""crispr""","""A549""",144,100,"""absent""","""Cas9""","""Day0""",1,"""none""",27640
…,…,…,…,…,…,…,…,…,…,…,…,…
"""2020_11_04_CPJUMP1""","""JUMP-Target-1_crispr_platemap""","""BR00118048""","""crispr""","""U2OS""",96,100,"""absent""","""Cas9""","""Day0""",1,"""Phalloidin""",27648
"""2020_11_04_CPJUMP1_DL""","""JUMP-Target-1_crispr_platemap""","""BR00116996""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648
"""2020_11_04_CPJUMP1_DL""","""JUMP-Target-1_crispr_platemap""","""BR00116997""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648
"""2020_11_04_CPJUMP1_DL""","""JUMP-Target-1_crispr_platemap""","""BR00116998""","""crispr""","""U2OS""",144,100,"""absent""","""Cas9""","""Day0""",1,"""WGA""",27648


Creating a dictionary to group plates by their corresponding experimental batch

This step organizes the plate barcodes from the experimental metadata into groups based on their batch. Grouping plates by batch is useful for batch-wise data processing and downstream analyses.

In [5]:
# creating a dictionary for the batch and the associated plates with the a batch
batch_plates_dict = {}
exp_metadata_batches = exp_metadata["Batch"].unique().to_list()

for batch in exp_metadata_batches:
    # getting the plates in the batch
    plates_in_batch = exp_metadata.filter(exp_metadata["Batch"] == batch)["Assay_Plate_Barcode"].to_list()

    # adding the plates to the dictionary
    batch_plates_dict[batch] = plates_in_batch 

# display batch (Keys) and plates (values) within each batch 
pprint.pprint(batch_plates_dict)

{'2020_11_04_CPJUMP1': ['BR00116996',
                        'BR00116997',
                        'BR00116998',
                        'BR00116999',
                        'BR00117000',
                        'BR00117001',
                        'BR00117002',
                        'BR00117003',
                        'BR00117004',
                        'BR00117005',
                        'BR00118041',
                        'BR00118042',
                        'BR00118043',
                        'BR00118044',
                        'BR00118045',
                        'BR00118046',
                        'BR00118047',
                        'BR00118048'],
 '2020_11_04_CPJUMP1_DL': ['BR00116996',
                           'BR00116997',
                           'BR00116998',
                           'BR00116999']}


## Downloading MitoCheck Data

In this section we are downloading the MitoCheck data that was generated in this (study)[]

In [7]:
# url source for the MitoCheck data
mitocheck_source = nb_configs["links"]["MitoCheck-profiles-source"]

# define the temporary zip file path
zip_path = mitocheck_dir / "mitocheck_data.zip"

try:
    # download with streaming to handle large files efficiently
    with requests.get(mitocheck_source, stream=True) as response:
        # check if the request was successful
        response.raise_for_status()

        # get file size for progress tracking
        total_size = int(response.headers.get("content-length", 0))

        # use tqdm for progress bar
        with (
            open(zip_path, "wb") as file,
            tqdm(
                desc="Downloading MitoCheck data",
                total=total_size,
                unit="B",
                unit_scale=True,
                unit_divisor=1024,
            ) as pbar,
        ):
            for chunk in response.iter_content(chunk_size=8192):  # 8KB chunks
                if chunk:
                    file.write(chunk)
                    pbar.update(len(chunk))

    # extract the zip file with progress bar
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        # Get list of files for extraction progress
        file_list = zip_ref.namelist()
        with tqdm(
            desc="Extracting MitoCheck data", total=len(file_list), unit="files"
        ) as pbar:
            for file in file_list:
                zip_ref.extract(file, mitocheck_dir)
                pbar.update(1)

    # removing temporary zip file after extraction
    zip_path.unlink()

# raise exceptions for any issues during download or extraction
except requests.exceptions.RequestException as e:
    print(f"Error downloading file: {e}")
except zipfile.BadZipFile as e:
    print(f"Error extracting zip file: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Downloading: 100%|██████████| 16.9G/16.9G [44:01<00:00, 6.85MB/s]  

Unexpected error: [Errno 2] No such file or directory: '/home/erikserrano/Development/BuSCar/notebooks/0.download-data/data/mitocheck/mitocheck_data.zip'



