### FANTOM4 THP-1 data

**DESCRIPTION IN PROGRESS**

This notebook prepares a dataset with 24 individual knockout experiments applied to CD4 T cells ([Freimer et al 2020](https://www.nature.com/articles/s41588-022-01106-y)). Each knockout was profiled with both ATAC and RNA-seq measurements, but we use only RNA. The data have UMI's. Controls are 8 guide RNA's targeting the "safe-harbor" AAVS1 locus, and are labeled `AAVS1_1`, `AAVS1_8`, etc. The experiment was done separately on blood from 3 different donors.

Here we tidy the dataset and carry out a simple exploration in scanpy. (It's not single cell data but scanpy is still useful for data exploration.)

In [1]:
import warnings
warnings.filterwarnings('ignore')
import regex as re
import os
import shutil
import importlib
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import celloracle as co
from scipy.stats import spearmanr as spearmanr
from IPython.display import display, HTML
# local
import importlib
import sys
sys.path.append("setup")
import ingestion
importlib.reload(ingestion)

#      visualization settings
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# Specify the working directory explicitly.
os.chdir("/home/ekernf01/Desktop/jhu/research/projects/perturbation_prediction/cell_type_knowledge_transfer/perturbations/")

### Reshape the data

In [5]:
expression_quantified = pd.read_csv("not_ready/fantom4/E-GEAD-547.raw/BCL6_s2_lot2.txt", 
                              delimiter="\t",
                              index_col=0, 
                              header=0, 
                              comment = '!')  
                              
expression_quantified 
# gene_metadata   = expression_quantified.iloc[:,0:5]
# expression_quantified = expression_quantified.iloc[:, 5:].T
# sample_metadata = pd.DataFrame(columns = ["donor", "perturbation"], 
#                                index = expression_quantified.index,
#                                data = [g.split("_", maxsplit=2)[1:3] for g in expression_quantified.index])
# print("\n\ngene_metadata\n")
# display(gene_metadata.head())
# print("\n\nsample_metadata\n")
# display(sample_metadata.head())
# print("\n\n expression_quantified\n")
# display(expression_quantified.head().T.head())

Unnamed: 0_level_0,TargetID,Signal-BCL6-2,Detection-BCL6-2
ProbeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6960451,ILMN_10000,101.9,0.98981
2850504,ILMN_100000,61.1,0.35298
7560397,ILMN_100007,58.5,0.20888
5360747,ILMN_100009,59.5,0.26274
2600731,ILMN_10001,645.4,1.00000
...,...,...,...
1780441,ILMN_99987,63.4,0.45560
5490379,ILMN_9999,126.9,0.99563
1990209,ILMN_99990,66.4,0.61135
1090736,ILMN_99995,70.8,0.78020


### Combine into anndata to keep everything together

In [3]:
expression_quantified = sc.AnnData(expression_quantified, 
                             var = gene_metadata.copy(),
                             obs = sample_metadata.copy())
# DRY these up once they're in AnnData
del gene_metadata
del sample_metadata

### Convert ensembl gene id's to gene symbol

In [None]:
expression_quantified.var_names = ingestion.convert_ens_to_symbol(expression_quantified.var_names, gtf = "../accessory_data/gencode.v35.annotation.gtf.gz")
display(expression_quantified.var.head())
display(expression_quantified.var_names[0:5])

In [None]:
# Document controls with weird names
controls = [f"AAVS1_{i}" for i in range(1,9)]
for c in controls:
    assert c in expression_quantified.obs['perturbation'].unique() 
expression_quantified.obs["is_control"] = expression_quantified.obs['perturbation'].isin(controls).astype(int)

In [None]:
sc.pp.normalize_total(expression_quantified, target_sum=1e4)
sc.pp.log1p(expression_quantified)
sc.pp.highly_variable_genes(expression_quantified, min_mean=0.2, max_mean=4, min_disp=0.2, n_bins=50)
sc.pl.highly_variable_genes(expression_quantified)
with warnings.catch_warnings():
    sc.tl.pca(expression_quantified, n_comps=5)
sc.pp.neighbors(expression_quantified)
sc.tl.umap(expression_quantified)
clusterResolutions = []
sc.tl.leiden(expression_quantified)

In [None]:
sc.pl.umap(expression_quantified, color = ["IL2RA", "IL2", "CTLA4", "leiden", "donor", "is_control", "perturbation"])
# Due to the small number of samples, ask CO to use only one cluster.
# Requires setting certain other undocumented aspects of object state. :(
expression_quantified.obs["fake_cluster"]="all_one_cluster"
expression_quantified.obs.fake_cluster = expression_quantified.obs.fake_cluster.astype("category")
expression_quantified.uns["fake_cluster_colors"] = ['#1f77b4']

### Data reduction

With only 64GB of RAM, I have been unable to make whole-transcriptome predictions with CellOracle. A data reduction step is necessary where only highly variable genes are included. We also keep all genes that are perturbed, whether or not they appear highly variable -- unless they are unavailable in the first place. 

In [None]:
perturbed_genes = set(list(expression_quantified.obs['perturbation'].unique())).difference(controls)
perturbed_and_measured_genes = perturbed_genes.intersection(expression_quantified.var.index)
perturbed_but_not_measured_genes = perturbed_genes.difference(expression_quantified.var.index)
genes_keep = expression_quantified.var.index[expression_quantified.var['highly_variable']]
genes_keep = set(genes_keep).union(perturbed_and_measured_genes)
expression_quantified_orig = expression_quantified.copy()
print("These genes were perturbed but not measured:")
print(perturbed_but_not_measured_genes)
print("This many variable genes will be kept and used by CO:")
print(len(genes_keep))

In [None]:
# final form, ready to save
expression_quantified = expression_quantified_orig[:,list(genes_keep)]
expression_quantified.uns["perturbed_and_measured_genes"]     = list(perturbed_and_measured_genes)
expression_quantified.uns["perturbed_but_not_measured_genes"] = list(perturbed_but_not_measured_genes)

In [None]:
os.makedirs("perturbations/freimer", exist_ok = True)
expression_quantified.write_h5ad("perturbations/freimer/test.h5ad")