# Immune Cell Deconvolution

## Where the data at?

In [1]:
input_path = '../Data/Processed_Data/'
output_path = '../Data/Processed_Data/Cell_Deconvolution/'

## Load AML Dataset

In [2]:
import pandas as pd

x = pd.read_pickle(input_path+'x.pkl')
y = pd.read_csv(input_path+'y.csv', index_col=0)

print(f' Dataset (df) contains {x.shape[0]} rows (CpG probes) and {x.shape[1]} columns (samples).')

 Dataset (df) contains 1346 rows (CpG probes) and 310545 columns (samples).


## Train-Test Split

Here we will split the data into a training/discovery and testing/validation set.

We will use ```y_train``` to denote the training set, and ```y_test``` to denote the testing set. 

In [3]:
# Split train and test by clinical trial
y_train = y[~y['Clinical Trial'].isin(['AML02', 'AML08'])]
y_test = y[y['Clinical Trial'].isin(['AML02', 'AML08'])]

# Select samples in x that are in y_train
x_train = x.loc[y_train.index]
x_test = x.loc[y_test.index]


print(
    f"Discovery dataset (train) contains {x_train.shape[1]} rows (mC sites) and {x_train.shape[0]} columns (samples)")
print(
    f"\n{y_train['Clinical Trial'].value_counts(dropna=False).to_string()}\n")
print(
    f"Validation dataset (test) contains {x_test.shape[1]} rows (mC sites) and {x_test.shape[0]} columns (samples).")
print(f"\n{y_test['Clinical Trial'].value_counts(dropna=False).to_string()}\n")


Discovery dataset (train) contains 310545 rows (mC sites) and 1142 columns (samples)

AAML1031    520
AAML0531    508
AML05        64
AAML03P1     36
CCG2961      14

Validation dataset (test) contains 310545 rows (mC sites) and 204 columns (samples).

AML02    162
AML08     42



## Batch Correction with pyCombat

- __pyCombat__: a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods

- __Github__: [https://epigenelabs.github.io/pyComBat/](https://epigenelabs.github.io/pyComBat/)

- __Implementation Paper__: [bioRxiv](https://doi.org/10.1101/2020.03.17.995431)

- __Original Paper__: [Biostatistics](https://pubmed.ncbi.nlm.nih.gov/16632515/)

In [4]:
from combat.pycombat import pycombat

# Correct batch effects in the training dataset
x_train2 = pycombat(x_train.T, y_train['Batch']).T

print('Succesfully corrected batch effects in the training dataset.')


Found 4 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data
Succesfully corrected batch effects in the training dataset.


## Load Reference Dataset

- __FlowSorted.Blood.EPIC__: An optimized library for reference-based deconvolution of whole-blood biospecimens, __n=49__

- __GEO__: [GSE110554](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110554)

- __PMID__: [29843789](https://www.ncbi.nlm.nih.gov/pubmed/29843789)

- __Description__:  Bisulphite converted DNA from neutrophils (Neu, n=6), monocytes (Mono, n=6), B-lymphocytes (Bcells, n=6), CD4+ T-cells (CD4T, n=7, six samples and one technical replicate), CD8+ T-cells (CD8T, n=6), Natural Killer cells (NK, n=6), and 12 DNA artificial mixtures (labeled as MIX in the dataset) were hybridised to the Illumina Infinium HumanMethylationEPIC Beadchip v1.0_B4

- __External Validation__: [Significant variation in the performance of DNA methylation predictors across data preprocessing and normalization strategies](https://pubmed.ncbi.nlm.nih.gov/36280888/)

- __CSV file__: [Download](https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1448-7/MediaObjects/13059_2018_1448_MOESM4_ESM.csv)

In [5]:
ref = pd.read_pickle(output_path+'paper_defined_immune_reference.pkl')
# remove index name
ref.index.name = None
# remove duplicates from index
ref = ref[~ref.index.duplicated(keep='first')]

# Harmonize index of reference data with our data
merge = ref.join(x.T, how='inner')

# update ref and mix with merge index
ref = ref.loc[merge.index]
mix = x_train2.T.loc[merge.index]
mix_test = x_test.T.loc[merge.index]

# save ref and mix to csv
ref.to_csv(output_path+'ReferenceData_ARIC.csv')
mix.to_csv(output_path+'Input_TrainData_ARIC.csv')
mix_test.to_csv(output_path+'Input_TestData_ARIC.csv')

## Immune Cell Deconvolution with ARIC

- __ARIC__: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

- __Website__: [xwanglabthu.github.io/ARIC/](xwanglabthu.github.io/ARIC/)

- __PMID__: [34472588](https://pubmed.ncbi.nlm.nih.gov/34472588/)

- __External Validation__: [A systematic assessment of cell type deconvolution algorithms for DNA methylation data](https://doi.org/10.1093/bib/bbac449)

In [9]:
from ARIC import *

# Run cell deconvolution on train data
ARIC(mix_path=output_path+'Input_TrainData_ARIC.csv',
     ref_path=output_path+'ReferenceData_ARIC.csv',
     is_methylation=True,
     ,
     unknown=True,
     save_path=output_path+'Results_TrainData_ARIC.csv')

# Run cell deconvolution on test data
ARIC(mix_path=output_path+'Input_TestData_ARIC.csv',
     ref_path=output_path+'ReferenceData_ARIC.csv',
     is_methylation=True,
     unknown=True,
     save_path=output_path+'Results_TestData_ARIC.csv')


---------------------------------------------
--------------WELCOME TO ARIC----------------
---------------------------------------------
Data reading finished!
ARIC Engines Start, Please Wait......


100%|██████████| 1142/1142 [03:26<00:00,  5.53it/s]


Deconvo Results Saving!
Finished!
---------------------------------------------
--------------WELCOME TO ARIC----------------
---------------------------------------------
Data reading finished!
ARIC Engines Start, Please Wait......


100%|██████████| 204/204 [00:37<00:00,  5.47it/s]

Deconvo Results Saving!
Finished!





## Watermark

In [7]:
%load_ext watermark

In [8]:
%watermark -v -p numpy,pandas,matplotlib,seaborn,scipy,sklearn,combat,ARIC

Python implementation: CPython
Python version       : 3.10.10
IPython version      : 8.3.0

numpy     : 1.24.2
pandas    : 1.5.3
matplotlib: 3.7.1
seaborn   : 0.12.2
scipy     : 1.10.1
sklearn   : 0.0.post1
combat    : 0.3.3
ARIC      : 1.0.0

