# MAP Analysis with MitoCheck Single Cells 


In this notebook, our goal is to apply the Mean Average Precision (MAP) metric, developed in the [copairs](https://github.com/cytomining/copairs) analysis package.
We apply this metric to the MitoCheck single-cell dataset, to see the effects of genetic pertubations based on their phenotype. 


Some links to look at:
- copairs [repo](https://github.com/cytomining/copairs)
- MitoCheck github [repo](https://github.com/WayScience/mitocheck_data)
- MitoCheck zenodo [repo](https://zenodo.org/records/7967386)

In [10]:
import gc
import sys
import pathlib

import copairs
import pandas as pd
import numpy as np

# imports src 
sys.path.append("../")
from src import utils

## Loading Downloaded Data

In this section, we load the MitoCheck single-cell datasets, including the training, positive controls, and negative controls. 
For detailed information about the dataset, please refer to the MitoCheck report mentioned above.

After downloading the data, we perform formatting by dividing it into two sections. 
The first section comprises the metadata of each individual cell, while the second section presents all quantified features in a numpy array format.

This formatting is designed to easily integrate with the copairs `run_pipeline()` function, allowing for easy execution of the analysis.

In [None]:
# parameters
training_singlecell_data = pathlib.Path("../data/raw/training_data.csv.gz").resolve(strict=True)
pos_control_data = pathlib.Path("../data/raw/normalized_data/positive_control_data.csv.gz").resolve(strict=True)
neg_control_data = pathlib.Path("../data/raw/normalized_data/negative_control_data.csv.gz").resolve(strict=True)

In [16]:
# loading in the data into dataframe (~10min loading)
training_sc_data = pd.read_csv(training_singlecell_data).drop("Unnamed: 0", axis=1)
pos_control_sc_data = pd.read_csv(pos_control_data)
neg_control_sc_data = pd.read_csv(neg_control_data)


In [15]:
# seperating dataframe based on feature type 
train_sc_cp_dp_df, cp_df_feat_vals = utils.split_data(training_sc_data, dataset="CP_and_DP")
train_sc_cp_df, cp_feat_vals = utils.split_data(training_sc_data, dataset="CP")
train_sc_dp_df, dp_feat_vals = utils.split_data(training_sc_data, dataset="DP")

# spitting positive control data
pos_control_cp_dp_df, pos_cp_df_control_vals = utils.split_data(pos_control_sc_data, dataset="CP_and_DP")
pos_control_cp_df, pos_cp_control_vals = utils.split_data(pos_control_sc_data, dataset="CP")
pos_control_dp_df, pos_dp_control_vals = utils.split_data(pos_control_sc_data, dataset="DP")

# splitting negative control data
neg_control_cp_dp_df, neg_cp_df_control_vals = utils.split_data(neg_control_sc_data, dataset="CP_and_DP")
neg_control_cp_df, neg_cp_control_vals = utils.split_data(neg_control_sc_data, dataset="CP")
neg_control_dp_df, neg_dp_control_vals = utils.split_data(neg_control_sc_data, dataset="DP")

# drop original dataframes; reset by change data value
del training_sc_data
del pos_control_sc_data
del neg_control_sc_data 
gc.collect()

NameError: name 'training_sc_data' is not defined

0