### PSC overexpression example

This notebook prepares a dataset with hundreds of individual overexpression experiments applied to pluripotent stem cells ([Nakatake et al 2020](https://www.sciencedirect.com/science/article/pii/S2211124720306082)). This choice of dataset is meant to be an easy starting point: the time-scale (48 hours) is fairly short-term, the cell state (pluripotency) is well studied,  and the perturbations are numerous (714 genes including 481 TF's). The dataset is also small (~1k samples) so testing/debugging is fast. 

The data contain both microarray and RNA-seq measurements, but these have already been effectively integrated by the authors using a strategy akin to quantile normalization. There are missing values marked -9999, which include genes missing from microarrays and outlying measurements censored by the creators. For now, missing values are being filled in with control gene expression. There are three types of negative control *samples*, labeled "control", "emerald", and "CAG-rtTA35-IH". We use mostly "control". 

Here we tidy the PSC overexpression dataset and carry out a simple exploration in scanpy. (It's not single cell data but scanpy is a useful collection of software for any high-sample-number transcriptomics.) 

In [None]:
import warnings
warnings.filterwarnings('ignore')
import regex as re
import os
import shutil
import sys
import importlib
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg')
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
from scipy.stats import spearmanr as spearmanr
from scipy.stats import pearsonr
import itertools as it
import anndata

from scipy.stats import f_oneway
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.oneway import anova_oneway
from sklearn.metrics import mutual_info_score
import time
from collections import Counter


# local
import importlib
import sys
sys.path.append("setup")
import ingestion
importlib.reload(ingestion)

#      visualization settings
plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# I prefer to specify the working directory explicitly.
os.chdir("/home/ekernf01/Desktop/jhu/research/projects/perturbation_prediction/cell_type_knowledge_transfer/perturbation_data")

# Universal
geneAnnotationPath = "../accessory_data/gencode.v35.annotation.gtf.gz"       # Downloaded from https://www.gencodegenes.org/human/release_35.html
humanTFPath =  "../accessory_data/humanTFs.csv"                              # Downloaded from http://humantfs.ccbr.utoronto.ca/download.php
humanEpiPath = "../accessory_data/epiList.csv"                               # Downloaded from https://epifactors.autosome.org/description 

# Nakatake Specific
rawDataPath               = "not_ready/ko_esc/CREST_06162021.txt"
nakatakeSupplemental1Path = "not_ready/ko_esc/nakatakeSupplemental1.csv"    # https://ars.els-cdn.com/content/image/1-s2.0-S2211124720306082-mmc2.xlsx                                                
nakatakeSupplemental3Path = "not_ready/ko_esc/nakatakeSupplemental3.csv"    # https://ars.els-cdn.com/content/image/1-s2.0-S2211124720306082-mmc4.xlsx
perturbEffectTFOnlyPath            = "not_ready/ko_esc/nakatakeTFOnly.csv"             # additional output
perturbEffectFullTranscriptomePath = "not_ready/ko_esc/nakatakeFullTranscriptome.csv"  # additional output
finalDataFileFolder = "perturbations/nakatake"
finalDataFilePath   = "perturbations/nakatake/test.h5ad"

### Reshape the data

In [None]:
expression_quantified = pd.read_csv(rawDataPath, 
                                    delimiter="\t",
                                    index_col=0, 
                                    header=0, 
                                    comment = '!') 

In [None]:
gene_metadata   = expression_quantified.iloc[:,-4:]
expression_quantified = expression_quantified.iloc[:, 0:-4].T
# The name of this gene in the variable names is TBXT, but its OE samples use the name T. 
expression_quantified.rename(index={'T':'TBXT', "T.1":"TBXT.1"}, inplace=True)
sample_metadata = pd.DataFrame(columns = ["perturbation"], 
                               index = expression_quantified.index,
                               data = [re.sub("\..", "", g) for g in expression_quantified.index])

expression_quantified = sc.AnnData(expression_quantified, 
                                   var = gene_metadata,
                                   obs = sample_metadata)
expression_quantified.raw = expression_quantified.copy()

In [None]:
# Document controls with weird names
""" Emerald : Transgene w/ fluophore only
    Control : median gene expression
    CAG-rtTA35-IH : hESC cell line """
controls = ("Emerald", "Control", "CAG-rtTA35-IH")
for c in controls:
    assert c in sample_metadata['perturbation'].unique() 
expression_quantified.obs["is_control"] = expression_quantified.obs['perturbation'].isin(controls)

### Count and Impute missing entries

Most but not all of the missingness is due to microarrays capturing fewer genes than RNA-seq.

Some is due to Nakatake et al. removing outliers. 

In [None]:
""" Two controls have identical expression levels except for 
genes that are missing in the microarrays. """
plt.figure(figsize=(3,3))
controlExpr = expression_quantified.X[expression_quantified.obs.perturbation == "Control" ,:]
controlExpr = controlExpr[:, ~(controlExpr[1,:] == -9999)]
plt.scatter(controlExpr[0,:], controlExpr[1,:], s=1)
plt.title("Median of Expr")
plt.show()

plt.figure(figsize=(3,3))
controlExpr = expression_quantified.X[expression_quantified.obs.perturbation == "Emerald" ,:]
controlExpr = controlExpr[:, ~(controlExpr[1,:] == -9999)]
plt.scatter(controlExpr[0,:], controlExpr[1,:], s=1)
plt.title("Emerald")
plt.show()

In [None]:
missing = expression_quantified.X==-9999
expression_quantified.obs["fraction_missing"] = missing.mean(axis=1)
expression_quantified.var["fraction_missing"] = missing.mean(axis=0)
controlIndex = expression_quantified.obs.index=="Control"
for i in range(len(expression_quantified.obs.index)):
    missing_i = np.squeeze(expression_quantified[i,:].X==-9999)
    expression_quantified.X[i,missing_i] = expression_quantified.X[controlIndex,missing_i]

In [None]:
display(pd.DataFrame(
    pd.DataFrame(
        expression_quantified.obs.fraction_missing.tolist(), 
        columns=["% Gene Missing"])
    .value_counts(), 
    columns=["Such # of Clones"]))

display(pd.DataFrame(
    pd.DataFrame(
        expression_quantified.var.fraction_missing.tolist(), 
        columns=["% Clone Missing"])
    .value_counts(), 
    columns=["Such # of Genes"]))

In [None]:
""" Sanity Check:
The sum of gene expression before and after normalization """
fig, axes = plt.subplots(1, 2, figsize=(8,3))
axes[0].hist(expression_quantified.X.sum(axis=1), bins=100, log=False, label="before DESeq2 norm")
axes[1].hist(ingestion.deseq2Normalization(expression_quantified.X.T).T.sum(axis=1), bins=100, log=False, label="after DESeq2 norm")
axes[0].legend()
axes[1].legend()
plt.show()

### Normalization on bulk 

In [None]:
expression_quantified.X = ingestion.deseq2Normalization(expression_quantified.X.T).T

### Check Gene Expr Consistency, Replication Consistency

In [None]:
# If verbose is set to True, display disconcordant trials and their controls
status, logFC, pval = ingestion.checkConsistency(
   expression_quantified, 
   perturbationType="overexpression", 
   group=None,
   verbose=False,
   do_return_pval = True) 
expression_quantified.obs["consistentW/Perturbation"] = status
expression_quantified.obs["logFC"] = logFC
expression_quantified.obs["pval"] = pval
Counter(status)

In [None]:
TFqPCR = set(pd.read_csv(nakatakeSupplemental3Path)['TF'])
expression_quantified.obs['qPCRExamined'] = [True if i in TFqPCR else False for i in expression_quantified.obs.perturbation]

In [None]:
correlations = ingestion.computeCorrelation(expression_quantified, verbose=True)
expression_quantified.obs["spearmanCorr"] = correlations[0]
expression_quantified.obs[ "pearsonCorr"] = correlations[1]

In [None]:
"""Downloaded from http://humantfs.ccbr.utoronto.ca/download.php """
TFList = pd.read_csv(humanTFPath, index_col=0).iloc[:, [1,3]]
TFDict = dict([tuple(i) for i in TFList.to_numpy().tolist() if i[1] == 'Yes'])

"""Downloaded from https://epifactors.autosome.org/description """
EpiList = pd.read_csv(humanEpiPath, index_col=0).iloc[:, [0,14]]
EpiDict = dict([tuple(i) for i in EpiList.to_numpy().tolist()])

"""Download from https://ars.els-cdn.com/content/image/1-s2.0-S2211124720306082-mmc2.xlsx """
annotation = pd.read_csv(nakatakeSupplemental1Path).iloc[:, [0,1]]
annotation = dict([tuple(i) for i in annotation.to_numpy().tolist()])

In [None]:
""" If want to look at effect magnitude of perturbation on TF only """
TFVar = [i for i,p in enumerate(expression_quantified.var.index) if p in TFDict or p in EpiDict]
expression_quantifiedTFOnly = expression_quantified[:, TFVar].copy()
ingestion.quantifyEffect(adata=expression_quantifiedTFOnly, 
                         fname=perturbEffectTFOnlyPath, 
                         group=None, 
                         diffExprFC=True, 
                         prefix="TFOnly")

In [None]:
""" If want to look at effect magnitude of perturbation on the entire transcriptome """
ingestion.quantifyEffect(adata=expression_quantified, 
                         fname=perturbEffectFullTranscriptomePath, 
                         group=None, 
                         diffExprFC=True, 
                         prefix="")

listOfMetrics = ["DEG", "MI", "logFCMean", "logFCNorm2", "logFCMedian"]
for m in listOfMetrics:
    expression_quantified.obs[f"TFOnly{m}"] = expression_quantifiedTFOnly.obs[f"TFOnly{m}"]

In [None]:
metricOfInterest = ["DEG", "logFCNorm2", "TFOnlyDEG", "TFOnlylogFCNorm2"]
ingestion.checkPerturbationEffectMetricCorrelation(expression_quantified, metrics=metricOfInterest)

In [None]:
ingestion.visualizePerturbationEffect(expression_quantified, metrics=metricOfInterest, TFDict=TFDict, EpiDict=EpiDict)

### The plot for the figure

The below chunk is the figure we use in the manuscript.

In [None]:
temp = expression_quantified.copy()
ingestion.visualizePerturbationMetadata(temp,
                                        x="spearmanCorr", 
                                        y="logFC", 
                                        style="consistentW/Perturbation", 
                                        hue="logFCNorm2", 
                                        markers=['o', '^'], 
                                        xlim=[-0.2, 1])
plt.savefig(finalDataFileFolder + "/qc.pdf")

### Basic EDA

In [None]:
sc.pp.log1p(expression_quantified)
sc.pp.highly_variable_genes(expression_quantified, n_bins=50, n_top_genes = expression_quantified.var.shape[0], flavor = "seurat_v3" )
sc.pl.highly_variable_genes(expression_quantified)
with warnings.catch_warnings():
    sc.tl.pca(expression_quantified, n_comps=100)
sc.pp.neighbors(expression_quantified)
sc.tl.umap(expression_quantified)
sc.tl.louvain(expression_quantified)
sc.pl.umap(expression_quantified, color = ["NEUROG1", "SOX17", "POU5F1", "MYOD1", "fraction_missing"])

### Final decision on filtering

Require positive log fold change significant at p<0.1.

In [None]:
expression_quantified.obs["logFC>0"] = expression_quantified.obs["logFC"]>0
expression_quantified.obs["pval<0.1"] = expression_quantified.obs["pval"]<0.1
print(expression_quantified.obs[["logFC>0", "pval<0.1"]].value_counts())
expression_quantified = expression_quantified[
    ( ( expression_quantified.obs.logFC > 0 ) & ( expression_quantified.obs.pval < 0.1 ) ) |
    ( expression_quantified.obs.logFC == -999 ),
    :
    ].copy()

In [None]:
perturbed_genes = set(list(expression_quantified.obs['perturbation'].unique())).difference(controls)
perturbed_and_measured_genes = perturbed_genes.intersection(expression_quantified.var.index)
perturbed_but_not_measured_genes = perturbed_genes.difference(expression_quantified.var.index)
print("These genes were perturbed but not measured:")
print(perturbed_but_not_measured_genes)
expression_quantified.uns["perturbed_and_measured_genes"] = list(perturbed_and_measured_genes)
expression_quantified.uns["perturbed_but_not_measured_genes"] = list(perturbed_but_not_measured_genes)
expression_quantified = ingestion.describe_perturbation_effect(expression_quantified, "overexpression")
expression_quantified

In [None]:
print(expression_quantified)

In [None]:
try:
    os.makedirs(finalDataFileFolder)
except FileExistsError:
    pass
expression_quantified.write_h5ad(finalDataFilePath)