# Immune Cell Deconvolution

## Where the data at?

In [1]:
input_path = '../Data/Processed_Data/'
output_path = '../Data/Processed_Data/Cell_Deconvolution/'

## Load AML Dataset

In [2]:
import pandas as pd

x = pd.read_pickle(input_path+'x.pkl')
y = pd.read_csv(input_path+'y.csv', index_col=0)

print(f' Dataset (df) contains {x.shape[0]} rows (CpG probes) and {x.shape[1]} columns (samples).')

 Dataset (df) contains 1346 rows (CpG probes) and 310545 columns (samples).


## Train-Test Split

To avoid data leakage and maximize the independence of the validation cohort (test dataset), we will split the data by clinical trial. The validation cohort will be St. Jude Children's led trials (AML02 and AML08), and the training cohort will be all other trials.

In [3]:
# Split data into training and test sets by clinical trial
y_train = y[~y['Clinical Trial'].isin(['AML02','AML08'])]
y_test = y[y['Clinical Trial'].isin(['AML02','AML08'])]

# Select samples in x that are in y_train
x_train = x.loc[y_train.index]
x_test = x.loc[y_test.index]

y['Clinical Trial'].value_counts(dropna=False)

AAML1031    520
AAML0531    508
AML02       162
AML05        64
AML08        42
AAML03P1     36
CCG2961      14
Name: Clinical Trial, dtype: int64

## Batch Correction with pyCombat

- __pyCombat__: a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods

- __Website__: [https://epigenelabs.github.io/pyComBat/](https://epigenelabs.github.io/pyComBat/)

- __Paper__: [bioRxiv](https://doi.org/10.1101/2020.03.17.995431)

In [4]:
from combat.pycombat import pycombat
data_corrected = pycombat(x_train.T,y_train['Batch'])
x_train2 = data_corrected.T

Found 4 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data


## Load Reference Dataset

- __FlowSorted.Blood.EPIC__: An optimized library for reference-based deconvolution of whole-blood biospecimens, __n=49__

- __GEO__: [GSE110554](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110554)

- __PMID__: [29843789](https://www.ncbi.nlm.nih.gov/pubmed/29843789)

- __Description__:  Bisulphite converted DNA from neutrophils (Neu, n=6), monocytes (Mono, n=6), B-lymphocytes (Bcells, n=6), CD4+ T-cells (CD4T, n=7, six samples and one technical replicate), CD8+ T-cells (CD8T, n=6), Natural Killer cells (NK, n=6), and 12 DNA artificial mixtures (labeled as MIX in the dataset) were hybridised to the Illumina Infinium HumanMethylationEPIC Beadchip v1.0_B4

- __External Validation__: [Significant variation in the performance of DNA methylation predictors across data preprocessing and normalization strategies](https://pubmed.ncbi.nlm.nih.gov/36280888/)

- __CSV file__: [Download](https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1448-7/MediaObjects/13059_2018_1448_MOESM4_ESM.csv)

In [5]:
ref = pd.read_csv('https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1448-7/MediaObjects/13059_2018_1448_MOESM4_ESM.csv',
                  index_col=0, skiprows=1)[['CD8T','CD4T','NK','Bcell','Mono','Neu']]

In [6]:
ref

Unnamed: 0,CD8T,CD4T,NK,Bcell,Mono,Neu
cg00133995,0.129519,0.298786,0.108926,0.035983,0.040495,0.104418
cg00219921,0.133676,0.807196,0.915346,0.943263,0.936359,0.929755
cg00225157,0.204290,0.322682,0.404319,0.372836,0.420142,0.424195
cg00297099,0.919939,0.914969,0.913794,0.056593,0.922112,0.910821
cg00298230,0.334686,0.255857,0.076164,0.390327,0.911243,0.634052
...,...,...,...,...,...,...
cg27232680,0.877548,0.879531,0.884759,0.044117,0.927373,0.923118
cg27249387,0.925872,0.918406,0.913822,0.909874,0.075194,0.840672
cg27477043,0.367343,0.141357,0.074931,0.015590,0.014581,0.019599
cg27567284,0.912333,0.885379,0.900232,0.036931,0.935405,0.928055


In [22]:
ref = pd.read_pickle(output_path+'methylprep_defined_immune_reference.pkl')

In [29]:
# remove index name
ref.index.name = None
# remove duplicates from index
ref = ref[~ref.index.duplicated(keep='first')]

In [30]:

# File has also been downloaded locally under ".../Data/Blood_Reference_PMID29843789" as backup in case Springer link is down

# Harmonize index of reference data with our data
merge = ref.join(x.T, how='inner')

# update ref and mix with merge index
ref = ref.loc[merge.index]
mix = x_train.T.loc[merge.index]
mix_test = x_test.T.loc[merge.index]

# save ref and mix to csv
ref.to_csv(output_path+'ReferenceData_ARIC.csv')
mix.to_csv(output_path+'Input_TrainData_ARIC.csv')
mix_test.to_csv(output_path+'Input_TestData_ARIC.csv')

## Immune Cell Deconvolution with ARIC

- __ARIC__: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

- __Website__: [xwanglabthu.github.io/ARIC/](xwanglabthu.github.io/ARIC/)

- __PMID__: [34472588](https://pubmed.ncbi.nlm.nih.gov/34472588/)

- __External Validation__: [A systematic assessment of cell type deconvolution algorithms for DNA methylation data](https://doi.org/10.1093/bib/bbac449)

In [33]:
from ARIC import *

# Run cell deconvolution on train data
ARIC(mix_path=output_path+'Input_TrainData_ARIC.csv',
     ref_path=output_path+'ReferenceData_ARIC.csv',
     is_methylation=True,
     save_path=output_path+'Results_TrainData_ARIC.csv')

# Run cell deconvolution on test data
ARIC(mix_path=output_path+'Input_TestData_ARIC.csv',
     ref_path=output_path+'ReferenceData_ARIC.csv',
     is_methylation=True,
     save_path=output_path+'Results_TestData_ARIC.csv')


---------------------------------------------
--------------WELCOME TO ARIC----------------
---------------------------------------------
Data reading finished!
ARIC Engines Start, Please Wait......


100%|██████████| 1142/1142 [03:38<00:00,  5.24it/s]


Deconvo Results Saving!
Finished!
---------------------------------------------
--------------WELCOME TO ARIC----------------
---------------------------------------------
Data reading finished!
ARIC Engines Start, Please Wait......


100%|██████████| 204/204 [00:39<00:00,  5.19it/s]

Deconvo Results Saving!
Finished!





## Watermark

In [35]:
%load_ext watermark
%watermark -v -p numpy,pandas,matplotlib,seaborn,scipy,sklearn,combat,ARIC

Python implementation: CPython
Python version       : 3.10.9
IPython version      : 8.8.0

numpy     : 1.24.1
pandas    : 1.5.2
matplotlib: 3.6.2
seaborn   : 0.12.2
scipy     : 1.10.0
sklearn   : 0.0.post1
combat    : 0.3.3
ARIC      : 1.0.0

