### Melanoma and tumor-infiltrating lymphocyte examples

This notebook prepares Perturb-seq datasets from [Frangieh et al.](https://www.nature.com/articles/s41588-021-00779-1), a project on melanoma immune checkpoint inhibitor resistance. CRISPR knockouts of ~250 genes were applied to melanoma cells, and then the melanoma cells were treated in a variety of ways, including control, interferon gamma, and co-culture with tumor infiltrating lymphocytes. The idea is to see what mutations in melanoma might protect the melanoma from the lymphocytes. The paper included low-dimensional readouts based on melanoma survival and also perturb-seq readouts, which included mostly melanoma with minimal T cell contamination. Here we tidy the dataset and carry out a simple exploration in scanpy. 

This handling of the Frangieh data aims to understand the effect of different preprocessing on GRN inference, particularly tailored towards benchmarking DCDFG (Lopez et al. 2022). Overall, this notebook will generate 4 versions of the datasets. Each latter version consistutes of more processing than its predecessor. Specifically, expression_quantified is exactly the same as what Lopez et al. did. V2 is based on expression_quantified, except an additional standard scRNA preprocessing and a regime filtering (more details see below). V3 is the pseudobulk version of the V2. V4 is like V1, but cell cycle is regressed out. 

In [None]:
import warnings
warnings.filterwarnings('ignore')
import regex as re
import os
import shutil
import sys
import importlib
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import scipy as sp
from scipy.stats import spearmanr as spearmanr
from scipy.stats import pearsonr
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg')
import itertools as it
import anndata
import gc


import functools
from scipy.stats import f_oneway
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.oneway import anova_oneway
from sklearn.metrics import mutual_info_score
import time
from collections import Counter


# local
import importlib
import sys
sys.path.append("setup")
import ingestion
importlib.reload(ingestion)

#      visualization settings
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# I prefer to specify the working directory explicitly.
os.chdir("/home/ekernf01/Desktop/jhu/research/projects/perturbation_prediction/cell_type_knowledge_transfer/perturbation_data")

# Universal
geneAnnotationPath = "../accessory_data/gencode.v35.annotation.gtf.gz"       # Downloaded from https://www.gencodegenes.org/human/release_35.html
humanTFPath = "../accessory_data/humanTFs.csv"                               # Downloaded from http://humantfs.ccbr.utoronto.ca/download.php
humanEpiPath = "../accessory_data/epiList.csv"                               # Downloaded from https://epifactors.autosome.org/description 
cellCyclePath= "../accessory_data/regev_lab_cell_cycle_genes.txt"

# Dataset-specific
rawDataPath  = "not_ready/frangieh/RNA_expression.csv.gz"
rawH5ADPath  = "not_ready/frangieh/RNA_expression.h5ad.gz"
cellMetaPath = "not_ready/frangieh/all_sgRNA_assignments.txt"   
cellConditionPath = "not_ready/frangieh/RNA_metadata.csv"
perturbEffectTFOnlyPath = "../accessory_data/frangiehTFOnly.csv"                        # a path to store temp file
perturbEffectFullTranscriptomePath = "../accessory_data/frangiehFullTranscriptome.csv"  # a path to store temp file

finalDataFileFolder = "perturbations/frangieh"
finalDataFilePath   = "perturbations/frangieh/test.h5ad"

dataset_name = "frangieh"

### Frangieh Version 1 (exactly following Lopez et al.)

We first convert normalized data back to raw counts, and then we remove cells having less than 500 genes and and genes occurring in less than 500 cells. We also scan through predicted guide RNA assignments and remove all perturbed but not measured gene assignments. Remaining perturbations are encoded numerically, stored in the field `regime`.

We can rank all genes and kept top 1000 highly variable ones following the procedure Lopez et al. did in DCDFG repo (to ensure maximal level of similarity). Please note that the top N HVG procedure actually preserves less than 1000 genes (see code in this section for more details). We do split the dataset into 3 subsets, according to the experimental culture condition. Each version 1 dataset ends up with ~60000 instances and ~960 genes.

### Load expression data & set up cell metadata (including `sgRNA`)

In [None]:
if os.path.exists(rawH5ADPath):
    expression_quantified = sc.read_h5ad(rawH5ADPath)
else:
    expression_quantified = sc.read_csv(rawDataPath)
    expression_quantified.write_h5ad(rawH5ADPath, compression="gzip")

In [None]:
expression_quantified.X = sp.sparse.csr_matrix(expression_quantified.X)
expression_quantified = expression_quantified.T

In [None]:
cell_meta  = pd.read_csv(cellMetaPath)
cell_meta2 = pd.read_csv(cellConditionPath, skiprows=[1])
cell_meta  = cell_meta.merge(cell_meta2, left_on="Cell", right_on="NAME")
cell_meta.index = cell_meta["Cell"].tolist()
expression_quantified.obs = cell_meta

In [None]:
guides = functools.reduce(lambda a,b: a.union(b), [set(str(s).split(",")) for s in expression_quantified.obs["sgRNAs"]])
def get_target(guide):
    return re.sub("_[0-9]*$", "", guide) 
def is_control(target):
    return "SITE" in target
human_tfs = pd.read_csv(humanTFPath)
human_tfs = human_tfs.loc[human_tfs["Is TF?"]=="Yes",:]
np.array(list(set([get_target(g) for g in guides]).intersection(human_tfs["HGNC symbol"])))

In [None]:
guide_info = pd.read_csv(cellMetaPath, index_col=0)
guide_info = guide_info.replace(np.nan, '', regex=True)
expression_quantified.obs["sgRNAs"] = guide_info["sgRNAs"].astype(str)

### Adjust back to raw count

In [None]:
expression_quantified.obs["MOI"] = expression_quantified.obs["MOI"].astype(np.int32)
expression_quantified.obs["UMI_count"] = expression_quantified.obs["UMI_count"].astype(np.double)

# de-normalize and round up
def recover_raw_counts(adata, total = None):
    if total is None:
        total = np.expm1(adata.X[0, :]).sum()
        total = 10**round(np.log10(total))
        print(f"Data appear to be scaled to a total of {total}")
    norm_factor = adata.obs["UMI_count"].values / total
    Z = sp.sparse.diags(norm_factor).dot(np.expm1(adata.X))
    fraction_far_off = np.mean(np.abs(Z.data - np.rint(Z.data)) > 0.01)
    assert fraction_far_off < 0.05, f"Reverse transform returned non-integer values in {fraction_far_off} proportion of values."
    Z.data = np.rint(Z.data)
    return anndata.AnnData(X=Z, obs = adata.obs, var = adata.var)

expression_quantified.X = recover_raw_counts(expression_quantified).X

### Filter out genes and cells w/ low count 

In [None]:
sc.pp.filter_cells(expression_quantified, min_genes=500)
sc.pp.filter_genes(expression_quantified, min_cells=500)

### Label each cell with perturbations

In [None]:
def assign_targets(adata):
    # check gene sets and ensure matching with measurements
    err = 0
    ind = []
    obs_genes = {}
    unfound_genes = {}
    targets = []
    for index, row in adata.obs.iterrows():
        current_target = []
        if row["sgRNAs"] != "":
            # get all guides in cells
            sg = row["sgRNAs"].split(",")
            # get gene name by stripping guide specific info
            sg_genes = [guide.rsplit("_", maxsplit=1)[0] for guide in sg]
            for gene in sg_genes:
                if gene in adata.var.index:
                    # gene is found
                    current_target += [gene]
                    if gene not in obs_genes:
                        obs_genes[gene] = 1
                    else:
                        obs_genes[gene] += 1
                else:
                    if gene not in unfound_genes:
                        unfound_genes[gene] = 1
                    else:
                        unfound_genes[gene] += 1
        # end gene list
        targets += [",".join(current_target)]

    return targets

expression_quantified.obs["targets"]  = assign_targets(expression_quantified)
print(expression_quantified.obs["targets"].value_counts(dropna = False))
expression_quantified.obs["targets"].value_counts(dropna = False).value_counts(normalize=True).head(10)

In [None]:
expression_quantified.obs["perturbation"] = expression_quantified.obs['targets']
expression_quantified.obs["is_control"]   = [t in {"nan", ""} for t in expression_quantified.obs['targets']]
expression_quantified.obs["is_control_int"]   = expression_quantified.obs["is_control"].astype(int)
expression_quantified.obs["sgRNA"] = expression_quantified.obs.sgRNAs.apply(
    lambda x: ",".join([n for n in x.split(",") 
                        if not is_control(n) and 
                        n.split("_")[0] in expression_quantified.var.index])
)
expression_quantified.obs.sgRNA[expression_quantified.obs.is_control] = "is_control"
expression_quantified.obs["is_control"].value_counts(dropna = False)

In [None]:
""" 
sgRNAs: raw metadata (target gene + guide info)
sgRNA : sgRNAs but not measuread target genes & SITEs are removed, controls are labeled "is_control"
targets == perturbation: sgRNA but without guide info nor controls
"""
expression_quantified.obs[["sgRNAs", "sgRNA", "targets", "perturbation"]].head()

In [None]:
perturbed = expression_quantified.obs.targets.apply(lambda x: x.split(","))
perturbed = set([i for j in perturbed for i in j if i])
print(len(perturbed))
present   = perturbed & set(expression_quantified.var_names)
print(len(present  ))
present_bool = np.array([i in present for i in expression_quantified.var_names], dtype=bool)
expression_quantified.var["targeted"] = present_bool

In [None]:
perturbed_not_measured_obs = np.full(expression_quantified.n_obs, False)
for idx, targets in enumerate(expression_quantified.obs.targets):
    if expression_quantified.obs.is_control[idx]:
        continue
    if any([target not in present for target in targets.split(",")]):
        perturbed_not_measured_obs[idx] = True
expression_quantified.obs["instance_with_perturbed_not_measured_genes"] = perturbed_not_measured_obs
print(expression_quantified)

In [None]:
sc.pp.normalize_total(expression_quantified, target_sum=1e5)
sc.pp.log1p(expression_quantified)
sc.pp.highly_variable_genes(
    expression_quantified, 
    flavor='seurat_v3', 
    n_top_genes=expression_quantified.n_vars,
    span=0.2)

In [None]:
expression_quantified.obs["regimes"] = np.unique(expression_quantified.obs.targets, return_inverse=True)[1]
for condition in set(expression_quantified.obs.condition):
    subset = expression_quantified[expression_quantified.obs.condition == condition].copy()
    # Very low-expressed genes were already filtered, except that the subsetting may create more.
    sc.pp.filter_genes(subset, min_cells=0)
    
    """ Add additional metadata to ensure passing dataset validity check """
    print(f"{condition:>15} expression_quantified ends up with {subset.n_obs:>6} instances and {subset.n_vars:>5} genes.")
    subset.obs["spearmanCorr"]          = 0.0
    subset.uns["perturbations_overlap"] = True
    subset.obs["perturbation_type"]     = "knockout" 
    subset.obs["expression_level_after_perturbation"] = [",".join(["0"]*len(targets.split(","))) for targets in subset.obs["perturbation"] ]
    subset.obs.loc[(subset.obs.is_control | subset.obs.instance_with_perturbed_not_measured_genes), 
                   "expression_level_after_perturbation"] = np.nan
    
    perturbed_genes = set.union(*[set(p.split(",")) for p in subset.obs["perturbation"]])
    perturbed_and_measured_genes = perturbed_genes.intersection(subset.var.index)
    perturbed_but_not_measured_genes = perturbed_genes.difference(subset.var.index)
    subset.uns["perturbed_and_measured_genes"]     = list(perturbed_and_measured_genes)
    subset.uns["perturbed_but_not_measured_genes"] = list(perturbed_but_not_measured_genes)
    subset.raw = recover_raw_counts(subset) # We do this one at a time to save RAM
    os.makedirs(f"perturbations/{dataset_name}_{condition}_v1", exist_ok=True)
    subset.write_h5ad(f"perturbations/{dataset_name}_{condition}_v1/test.h5ad")

### Frangieh Version 4 

This is just like version 1, but with cell cycle regressed out. Sorry the versions are out of order; this is what made sense as the project developed, and versions 2 and 3 follow shortly in this notebook.

In [None]:
expression_quantified.obs["regimes"] = np.unique(expression_quantified.obs.targets, return_inverse=True)[1]
for condition in set(expression_quantified.obs.condition):
    subset = expression_quantified[expression_quantified.obs.condition == condition].copy()
    # Very low-expressed genes were already filtered, except that the subsetting may create more.
    sc.pp.filter_genes(subset, min_cells=0)
    
    subset.raw = recover_raw_counts(subset) # We do this one at a time to save RAM

    # Regress out CC
    cc_genes = pd.read_csv(cellCyclePath, header = None)[0]
    sc.tl.score_genes_cell_cycle(subset, s_genes=cc_genes[:43], g2m_genes=cc_genes[43:])
    sc.pp.regress_out(subset, keys = ['S_score', 'G2M_score'])


    """ Add additional metadata to ensure passing dataset validity check """
    print(f"{condition:>15} expression_quantified ends up with {subset.n_obs:>6} instances and {subset.n_vars:>5} genes.")
    subset.obs["spearmanCorr"]          = 0.0
    subset.uns["perturbations_overlap"] = True
    subset.obs["perturbation_type"]     = "knockout" 
    subset.obs["expression_level_after_perturbation"] = [",".join(["0"]*len(targets.split(","))) for targets in subset.obs["perturbation"] ]
    subset.obs.loc[(subset.obs.is_control | subset.obs.instance_with_perturbed_not_measured_genes), 
                   "expression_level_after_perturbation"] = np.nan
    
    perturbed_genes = set.union(*[set(p.split(",")) for p in subset.obs["perturbation"]])
    perturbed_and_measured_genes = perturbed_genes.intersection(subset.var.index)
    perturbed_but_not_measured_genes = perturbed_genes.difference(subset.var.index)
    subset.uns["perturbed_and_measured_genes"]     = list(perturbed_and_measured_genes)
    subset.uns["perturbed_but_not_measured_genes"] = list(perturbed_but_not_measured_genes)
    os.makedirs(f"perturbations/{dataset_name}_{condition}_v4", exist_ok=True)
    subset.write_h5ad(f"perturbations/{dataset_name}_{condition}_v4/test.h5ad")

### Frangieh Version 2

In addition to all the processing performed for v1, we prefer to add more conservative filters, including removal of doublets, cells with high mitochondrial/ribosome gene content. (However, mito and ribo genes are NOT removed.)

Additional, many cells are infected by more than one lentivirus and therefore potentially have multiple perturbed genes. Because cells with high MOIs are rare (mean MOI ~= 1.3) and have few duplicates, to ease the causal inference problem at hand, version 2 datasets kept only cells with 1 predicted perturbed gene and at least 50 cells (across all guide RNAs).

Version 2 dataset ends up with ~40k instances and ~15k (all ranked, ranking was performed during version 1 processing). Thus, if one were to perform Top N HVG procedure as Lopez et al. did, they can obtain a dataset with the same set of genes (roughly, because the procedure also removes genes that are 0 across all instances, removal of instances may result in certain genes become all 0).

#### Single-Cell Standard QC filters 

In [None]:
sc.pl.highest_expr_genes(expression_quantified, n_top=30, palette="Blues", width=.3)

In [None]:
expression_quantified.var['mt']   = expression_quantified.var_names.str.startswith(("MT-"))
expression_quantified.var['ribo'] = expression_quantified.var_names.str.startswith(("RPS","RPL"))
expression_quantified.var['mt'].sum(), expression_quantified.var['ribo'].sum()

In [None]:
sc.pp.calculate_qc_metrics(expression_quantified, qc_vars=['ribo', 'mt'], log1p=False, inplace=True)

In [None]:
axs = sc.pl.violin(expression_quantified, ['n_genes_by_counts', 
                                           'total_counts', 
                                           'pct_counts_mt', 
                                           'pct_counts_ribo', 
                                           'pct_counts_in_top_50_genes'], 
                   jitter=0.5, multi_panel=True)

In [None]:
fig, ax = plt.subplots(1,1,figsize=(2,2))
sc.pl.scatter(expression_quantified, x='total_counts', y='n_genes_by_counts', ax=ax)

In [None]:
print("Number of cells: ", expression_quantified.n_obs)

# figure out the total counts == 99th percentile
thresh = np.percentile(expression_quantified.obs['total_counts'], 99)
print("99th percentile: ", thresh)

In [None]:
# filter for % mt & % ribo
expression_quantified = expression_quantified[((expression_quantified.obs['total_counts']    < thresh) &
                                               (expression_quantified.obs["total_counts"]    >= 6000)  & 
                                               (expression_quantified.obs["pct_counts_in_top_50_genes"] <= 30) & 
                                               (expression_quantified.obs['pct_counts_mt']   < 10)     & 
                                               (expression_quantified.obs['pct_counts_ribo'] < 20)), :]
expression_quantified = expression_quantified.copy()
print("Number of cells: ", expression_quantified.n_obs)

In [None]:
""" To verify the outcome of filtering cells """
sc.pp.calculate_qc_metrics(expression_quantified, qc_vars=['ribo', 'mt'], percent_top=None, log1p=False, inplace=True)

In [None]:
axs = sc.pl.violin(expression_quantified, ['n_genes_by_counts', 
                                           'total_counts', 
                                           'pct_counts_mt', 
                                           'pct_counts_ribo', 
                                           'pct_counts_in_top_50_genes'], 
                   jitter=0.4, multi_panel=True)

In [None]:
fig, ax = plt.subplots(1,1,figsize=(2,2))
sc.pl.scatter(expression_quantified, x='total_counts', y='n_genes_by_counts', ax=ax)

#### Regime Filtering (more than 1 MOI or less than 50 cells)

In [None]:
grp = expression_quantified.obs.groupby(["condition", "perturbation"])
keep = []
for k,v in grp.indices.items():
    if len(k[1].split(",")) > 1 or len(v) < 50:
        continue
    keep.append(k)
    profile = np.squeeze(np.array(expression_quantified.X[v,:].sum(axis=0)))
print(f"We are keeping {len(keep)} culture condition-target gene combination (not distinguishing different sgRNA)")

In [None]:
keeprow = [grp.indices[k] for k in keep]
keeprow = [i for j in keeprow for i in j]
print(f"We are keeping {len(keep)} condition-target combo, " \
      f"totaling {len(keeprow)} cells.")
expression_quantified = expression_quantified[keeprow].copy()
expression_quantified.obs["perturbation"] = expression_quantified.obs["perturbation"].astype(str)
print(expression_quantified)

In [None]:
expression_quantified
for condition in set(expression_quantified.obs.condition):
    subset = expression_quantified[expression_quantified.obs.condition == condition].copy()
    
    subset.obs["regimes"] = np.unique(subset.obs.targets, return_inverse=True)[1]    
    print(f"{condition:>15} v2 ends up with {subset.n_obs:>6} instances and {subset.n_vars:>5} genes.")
    subset.obs["spearmanCorr"]          = 0.0
    subset.uns["perturbations_overlap"] = True
    subset.obs["perturbation_type"]     = "knockout" 
    subset.obs["expression_level_after_perturbation"] = [",".join(["0"]*len(targets.split(","))) for targets in subset.obs["perturbation"] ]
    subset.obs.loc[(subset.obs.is_control | subset.obs.instance_with_perturbed_not_measured_genes), 
                   "expression_level_after_perturbation"] = np.nan

    perturbed_genes = set.union(*[set(p.split(",")) for p in subset.obs["perturbation"]])
    perturbed_and_measured_genes = perturbed_genes.intersection(subset.var.index)
    perturbed_but_not_measured_genes = perturbed_genes.difference(subset.var.index)
    subset.uns["perturbed_and_measured_genes"]     = list(perturbed_and_measured_genes)
    subset.uns["perturbed_but_not_measured_genes"] = list(perturbed_but_not_measured_genes)
    subset.raw = recover_raw_counts(subset) # We do this one at a time to save RAM
    os.makedirs(f"perturbations/{dataset_name}_{condition}_v2", exist_ok=True)
    subset.write_h5ad(f"perturbations/{dataset_name}_{condition}_v2/test.h5ad")

### Frangieh Version 3

On top of V2, cells with the same culture condition and the same guide RNAs are summed (raw counts) and normalized. The cells with the same `regime` but different sgRNAs remain separate (thus, each `regime` may end up with multiple meta-cells post aggregation) 

After aggregation, the version 3 dataset ends up with ~600 instances and ~15k genes. The gene ranking from version 1 is kept, so that in v1, v2, or v3, if the top N genes are selected, the resulting list will be the same. 

In [None]:
grp = expression_quantified.obs.groupby(["condition", "perturbation", "sgRNA"])
newObs = pd.DataFrame([[k[0], k[1], k[1], k[2], 
                        True if k[2] == "is_control" else False, 
                        1 if k[2] == "is_control" else 0] 
                       for k,v in grp.indices.items()],
                      columns=["condition", "perturbation", "targets", "sgRNA", "is_control", "is_control_int"])
expression_quantified.X = recover_raw_counts(expression_quantified).X
newX   = np.squeeze(np.array([
    expression_quantified.X[grp.indices[(r[0], r[1], r[3])],:].sum(axis=0).copy() 
    for idx, r in newObs.iterrows()
]))
pseudobulk = sc.AnnData(newX,
                        var=expression_quantified.var.copy(),
                        obs=newObs)
pseudobulk.raw = pseudobulk.copy()
                    

In [None]:
for condition in set(pseudobulk.obs.condition):
    subset = pseudobulk[pseudobulk.obs.condition == condition].copy()
    sc.pp.normalize_total(subset)
    sc.pp.log1p(subset)
    subset.obs["regimes"] = np.unique(subset.obs.targets, return_inverse=True)[1]    
    print(f"{condition:>15} v3 ends up with {subset.n_obs:>6} instances and {subset.n_vars:>5} genes.")
    subset.obs["spearmanCorr"]          = 0.0
    subset.uns["perturbations_overlap"] = True
    subset.obs["perturbation_type"]     = "knockout" 
    subset.obs["expression_level_after_perturbation"] = [",".join(["0"]*len(targets.split(","))) for targets in subset.obs["perturbation"] ]
    subset.obs.loc[(subset.obs.is_control),          # All instances' perturbations should be measured
                   "expression_level_after_perturbation"] = np.nan

    perturbed_genes = set.union(*[set(p.split(",")) for p in subset.obs["perturbation"]])
    perturbed_and_measured_genes = perturbed_genes.intersection(subset.var.index)
    perturbed_but_not_measured_genes = perturbed_genes.difference(subset.var.index)
    subset.uns["perturbed_and_measured_genes"]     = list(perturbed_and_measured_genes)
    subset.uns["perturbed_but_not_measured_genes"] = list(perturbed_but_not_measured_genes)

    os.makedirs(f"perturbations/{dataset_name}_{condition}_v3", exist_ok=True)
    subset.write_h5ad(f"perturbations/{dataset_name}_{condition}_v3/test.h5ad")

### Check Consistency w/ Perturbation

In [None]:
pseudobulks = [
    pseudobulk[pseudobulk.obs.condition == i].copy()
    for i in set(pseudobulk.obs.condition)]

for idx, pb in enumerate(pseudobulks):
    print(f"{pb.obs.condition[0]:>15} condition has {pb.shape[0]} aggregated cells")

In [None]:
# If verbose is set to True, display disconcordant trials and their controls
for idx, pb in enumerate(pseudobulks):
    status, logFC = ingestion.checkConsistency(pb, 
                                               perturbationType="knockdown", 
                                               group=None,
                                               verbose=False) 
    pseudobulks[idx].obs["consistentW/Perturbation"] = status
    pseudobulks[idx].obs["logFC"] = logFC
    print(Counter(status))

### Check Consistency between replications

In [None]:
for idx, pb in enumerate(pseudobulks):
    correlations = ingestion.computeCorrelation(pb, verbose=True)
    pseudobulks[idx].obs["spearmanCorr"] = correlations[0]
    pseudobulks[idx].obs[" pearsonCorr"] = correlations[1]

### Compute the Magnitude of Perturbation Effect

In [None]:
"""
Downloaded from http://humantfs.ccbr.utoronto.ca/download.php """
TFList = pd.read_csv(humanTFPath, index_col=0).iloc[:, [1,3]]
TFDict = dict([tuple(i) for i in TFList.to_numpy().tolist() if i[1] == 'Yes'])

"""
Downloaded from https://epifactors.autosome.org/description """
EpiList = pd.read_csv(humanEpiPath, index_col=0).iloc[:, [0,14]]
EpiDict = dict([tuple(i) for i in EpiList.to_numpy().tolist()])

### The plot for the figure

The below chunk is the figure we use in the manuscript.

In [None]:
""" If want to look at bigness on TF only """
for idx, pb in enumerate(pseudobulks):
    print(pb.obs.condition[0])
    TFVar = [i for i,p in enumerate(pb.var.index) if p in TFDict or p in EpiDict]
    pseudobulkTFOnly = pb[:, TFVar].copy()
    
    ingestion.quantifyEffect(adata=pseudobulkTFOnly, 
                             fname=perturbEffectTFOnlyPath.split(".csv")[0] + pb.obs.condition[0] + ".csv", 
                             group=None, 
                             diffExprFC=False, 
                             withDEG=False,
                             prefix="TFOnly")
    
    ingestion.quantifyEffect(adata=pb, 
                             fname=perturbEffectFullTranscriptomePath.split(".csv")[0] + pb.obs.condition[0] + ".csv", 
                             group=None,
                             diffExprFC=False, 
                             withDEG=False,
                             prefix="")

    listOfMetrics = ["MI", "logFCMean", "logFCNorm2", "logFCMedian"]
    for m in listOfMetrics:
        pseudobulks[idx].obs[f"TFOnly{m}"] = pseudobulkTFOnly.obs[f"TFOnly{m}"]

In [None]:
for idx, pb in enumerate(pseudobulks):
    print(pb.obs.condition[0])
    metricOfInterest = ["logFCMean", "logFCNorm2", "logFCMedian", 
                        "TFOnlylogFCMean", "TFOnlylogFCNorm2", "TFOnlylogFCMedian"]
    ingestion.checkPerturbationEffectMetricCorrelation(pb, metrics=metricOfInterest)
    ingestion.visualizePerturbationEffect(pb, metrics=metricOfInterest, TFDict=TFDict, EpiDict=EpiDict)

In [None]:
for idx, pb in enumerate(pseudobulks):
    sc.pp.calculate_qc_metrics(pb, log1p=False, inplace=True)

In [None]:
for idx, pb in enumerate(pseudobulks):
    print(pb.obs["condition"][0])
    ingestion.visualizePerturbationMetadata(pb[pb.obs["spearmanCorr"] != -999], 
                                            x="spearmanCorr", 
                                            y="logFC", 
                                            style="consistentW/Perturbation", 
                                            hue="logFCNorm2", 
                                            markers=['o', '^', 'X'])

### Basic EDA

In [None]:
for idx, pb in enumerate(pseudobulks):
    print(pb.obs.condition[0])
    sc.pp.calculate_qc_metrics(pb, log1p=False, inplace=True)
    sc.pp.log1p(pb)
    sc.pp.highly_variable_genes(pb, min_mean=0.2, max_mean=4, min_disp=0.2, n_bins=50)
    with warnings.catch_warnings():
        sc.tl.pca(pb, n_comps=100)
    sc.pp.neighbors(pb)
    sc.tl.umap(pb)
    clusterResolutions = []
    sc.tl.louvain(pb)
    cc_genes = pd.read_csv(cellCyclePath, header = None)[0]
    sc.tl.score_genes_cell_cycle(pb, s_genes=cc_genes[:43], g2m_genes=cc_genes[43:])
    plt.rcParams['figure.figsize'] = [6, 4.5]
    sc.pl.umap(pb, color = [
        "GAPDH",
        "louvain", 
        "is_control_int",
        "n_genes_by_counts",
        "total_counts",
        'S_score',
        'G2M_score', 
        'phase', 
        'pct_counts_in_top_50_genes', 
        "DEG",
    ])
    display(pb.obs)