# Immune Cell Deconvolution

## Where the data at?

In [1]:
input_path = '../Data/Processed_Data/'
output_path = '../Data/Processed_Data/Blood_Deconvolution_ARIC/'

## Load AML Dataset

In [2]:
import pandas as pd

x = pd.read_pickle(input_path+'x.pkl')
y = pd.read_csv(input_path+'y.csv', index_col=0)

## Test Train Split

In [3]:
y['Clinical Trial'].value_counts(dropna=False)

AAML1031    520
AAML0531    508
AML02       162
AML05        64
AML08        42
AAML03P1     36
CCG2961      14
Name: Clinical Trial, dtype: int64

In [4]:
y_train = y[~y['Clinical Trial'].isin(['AML02','AML08'])]
y_test = y[y['Clinical Trial'].isin(['AML02','AML08'])]

In [5]:
# Select samples in x that are in y_train
x_train = x.loc[y_train.index]
x_test = x.loc[y_test.index]

In [6]:
x_train.shape, x_test.shape

((1142, 310545), (204, 310545))

## Batch Correction with pyCombat

- __pyCombat__: a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods

- __Website__: [https://epigenelabs.github.io/pyComBat/](https://epigenelabs.github.io/pyComBat/)

- __Paper__: [bioRxiv](https://doi.org/10.1101/2020.03.17.995431)

In [7]:
from combat.pycombat import pycombat
data_corrected = pycombat(x_train.T,y_train['Batch'])
x_train2 = data_corrected.T

Found 4 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data


## Load Reference Dataset

- __FlowSorted.Blood.EPIC__: An optimized library for reference-based deconvolution of whole-blood biospecimens, __n=49__

- __GEO__: [GSE110554](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110554)

- __PMID__: [29843789](https://www.ncbi.nlm.nih.gov/pubmed/29843789)

- __Description__:  Bisulphite converted DNA from neutrophils (Neu, n=6), monocytes (Mono, n=6), B-lymphocytes (Bcells, n=6), CD4+ T-cells (CD4T, n=7, six samples and one technical replicate), CD8+ T-cells (CD8T, n=6), Natural Killer cells (NK, n=6), and 12 DNA artificial mixtures (labeled as MIX in the dataset) were hybridised to the Illumina Infinium HumanMethylationEPIC Beadchip v1.0_B4

- __CSV file__: [Download](https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1448-7/MediaObjects/13059_2018_1448_MOESM4_ESM.csv)

In [16]:
ref = pd.read_csv('https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1448-7/MediaObjects/13059_2018_1448_MOESM4_ESM.csv',
                  index_col=0, skiprows=1)[['CD8T','CD4T','NK','Bcell','Mono','Neu']]

# File has also been downloaded locally under ".../Data/Blood_Reference_PMID29843789" as backup in case Springer link is down

mix = x_train.T
merge = ref.join(mix, how='inner')

# update ref and mix with merge index
ref = ref.loc[merge.index]
mix = mix.loc[merge.index]

# save ref and mix to csv
ref.to_csv(output_path+'ref.csv')
mix.to_csv(output_path+'mix.csv')

## Immune Cell Deconvolution with ARIC

- __ARIC__: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

- __Website__: [xwanglabthu.github.io/ARIC/](xwanglabthu.github.io/ARIC/)

- __PMID__: [34472588](https://pubmed.ncbi.nlm.nih.gov/34472588/)

- __External Validation__: [A systematic assessment of cell type deconvolution algorithms for DNA methylation data](https://doi.org/10.1093/bib/bbac449)

In [29]:
from ARIC import *

ARIC(mix_path=output_path+'mix.csv', ref_path=output_path+'ref.csv',
     is_methylation=True, unknown=False)

---------------------------------------------
--------------WELCOME TO ARIC----------------
---------------------------------------------
Data reading finished!
ARIC Engines Start, Please Wait......


100%|██████████| 1142/1142 [00:26<00:00, 43.52it/s]

Deconvo Results Saving!
Finished!





In [30]:
# Read deconvolution results

deconv = pd.read_csv(output_path+'mix_prop.csv', index_col=0)

In [32]:
deconv

Unnamed: 0_level_0,201005010090_R03C01,201005010034_R05C01,201005010090_R02C01,201005010034_R06C01,201005010090_R04C01,201005010090_R05C01,201005010090_R01C01,201005010090_R06C01,201005010090_R07C01,201005010090_R08C01,...,200526210087_R07C01,200526210082_R01C01,200517480142_R07C01,200526210087_R01C01,200526210091_R02C01,200526210048_R07C01,200517480143_R01C01,200517480142_R04C01,200526210091_R07C01,200517480143_R04C01
cell types,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CD8T,0.040796,0.093658,0.131161,0.141251,0.152796,0.086805,0.027893,0.150081,0.039892,0.019137,...,0.075541,0.016774,0.036538,0.016605,0.005556,0.104392,0.076321,0.058009,0.014174,0.042775
CD4T,0.132652,0.125704,0.153804,0.144729,0.150274,0.153148,0.10396,0.166147,0.15588,0.167041,...,0.183274,0.225158,0.147381,0.150418,0.139803,0.181001,0.219949,0.127744,0.134903,0.176982
NK,0.083292,0.068569,0.07557,0.105339,0.105224,0.11856,0.070583,0.071844,0.033575,0.076409,...,0.072611,0.111069,0.098437,0.132589,0.108946,0.0246,0.005034,0.054445,0.068907,0.085902
Bcell,0.244599,0.158814,0.096914,0.142142,0.107323,0.146335,0.147479,0.140536,0.104748,0.039003,...,0.099132,0.172013,0.173018,0.20755,0.093517,0.121235,0.212112,0.039044,0.118135,0.145913
Mono,0.374861,0.329469,0.295299,0.27245,0.303336,0.293916,0.252455,0.262171,0.49396,0.466387,...,0.350823,0.412535,0.302166,0.22919,0.334315,0.318272,0.302303,0.393403,0.354223,0.331445
Neu,0.123799,0.223786,0.247251,0.19409,0.181047,0.201236,0.397631,0.209221,0.171945,0.232023,...,0.218619,0.06245,0.24246,0.263648,0.317863,0.250499,0.184281,0.327356,0.309659,0.216984
