# HIDE Algorithm Tutorial

This tutorial will guide you through the process of using the HIDE algorithm for cell type deconvolution. We will cover data preprocessing, training the model, and applying the trained parameters to an unknown bulk dataset.


## Prerequisites

Make sure you have all necessary libraries installed:

You can install these libraries using pip: 
```python
pip install -r requirements.txt
```

## Structuring the `cell_hierarchy` File

The `cell_hierarchy.csv` file is essential for defining the hierarchical relationships between different cell types. This file should contain three columns:
- `celltype_major`: The major cell type categories.
- `celltype_minor`: The minor cell type categories that fall under each major category.
- `celltype_sub`: The sub cell type categories that fall under each minor category.

Here is an example of how the `cell_hierarchy.csv` file might look:

```csv
celltype_major,celltype_minor,celltype_sub
T cell,CD4 T cell,Tfh
T cell,CD4 T cell,Treg
T cell,CD4 T cell,INF responsed T
T cell,CD8 T cell,CXCL13 exhausted CD8 T
```

If celltypes don't have a subtype, you can just fill in the same celltype name in `celltype_sub` as used in `celltype_minor`

```csv
celltype_major,celltype_minor,celltype_sub
B cell,IgA plasma,IgA plasma,
B cell,IgG plasma cell,IgG plasma cell
```

In case no minor type, and thus no subtype is present, you can just set `celltype_sub` and `celltype_minor` to the same name used in `celltype_major`
```csv
celltype_major,celltype_minor,celltype_sub
Proliferation T NK,Proliferation T/NK,Proliferation T/NK
NK cell,NK cell,NK cell
```

Please ensure that the names used in the `celltype_sub` are the same as used in the single cell dataframe, as otherwise HIDE is not able to link the celltypes correctly.

## Loading Disco Example Data

To load the disco example data using the `cell_hierarchy.csv` file, follow these steps:

1. **Define the Path to Data Folder**: Set the path where your data files are located.
2. **Load Metadata**: Use the `disco_read_metadata` function to read the `cell_hierarchy.csv` file and extract the cell type information.
3. **Load Training and Test Data**: Read the training and test datasets using pandas.
4. **Merge Cell Types**: Merge the cell types using the `merge_celltypes` function.
5. **Filter Subtypes**: Filter the subtypes to ensure only relevant subtypes are included.

Here is a step-by-step guide:

In [3]:
# Import necessary libraries
import pandas as pd
from pipelines_dataloader import disco_read_metadata
from pipelines_utils import merge_celltypes, filter_subtypes_by_dataframe_columns
from hDTD import HIDE

# Define the path to the data folder
path_to_data_folder = "./data/"

# Load metadata
meta = disco_read_metadata(path_to_data_folder + 'cell_hierarchy.csv', "celltype_major", 'celltype_minor')
main_celltypes = meta['main_celltypes']
sub_celltypes = meta['sub_celltypes']
meta = disco_read_metadata(path_to_data_folder + 'cell_hierarchy.csv', "celltype_minor", 'celltype_sub')
subset_celltypes = meta['sub_celltypes']




In the meta data loading process we aquired dictionaries linking the lower hierarchy level celltypes to the level above. <br>
For example in `subset_celltypes` we can find all subset types of CD8 T-cells

In [4]:
subset_celltypes['CD8 T cell']

['CXCL13 exhausted CD8 T', 'GZMH CD8 T', 'GZMK CD8 T']

The next function merges both dictionaries such that the tree structure introduced in the `cell_hierarchy.csv` file can be used by HIDE

In [5]:
# Merge cell types
sub_celltypes = merge_celltypes(sub_celltypes, subset_celltypes)

Next we need to load the reference matrix `X_train`, the bulk training and validation profiles `Y_train` and `Y_val` and the corresponding cellular composition `C_train` and `C_val`.

In [6]:
# Load training data
X_train = pd.read_csv(path_to_data_folder + "/X_train.csv", index_col=0)
Y_train = pd.read_csv(path_to_data_folder + "/train_data.csv", index_col=0)
C_train = pd.read_csv(path_to_data_folder + "/train_distribution.csv", index_col=0)

# Load test data
Y_val = pd.read_csv(path_to_data_folder + "/test_data.csv", index_col=0)
C_val = pd.read_csv(path_to_data_folder + "/test_distribution.csv", index_col=0)

Please ensure that `X`, `Y` and `C` have the correct structure: <br>
`X`should have the name of the subtypes as column names and the genes of interest as index. <br>
`Y`contains the bulk expression of each mixture as columns where the row indices are the genes <br>
`C` contains the cell counts of each mixture as columns where the row indices are the cellular subtypes

Next the sub_celltypes dictionary is filtered such that celltypes that do not exist in the reference matrix are removed from the hierarchy dictionary.

In [7]:
# Filter subtypes dictionary
for type in main_celltypes:
    sub_celltypes[type] |= filter_subtypes_by_dataframe_columns(sub_celltypes[type], X_train)

For creation of the reference profile of the higher level celltypes it is also necessary to count the specific celltypes in the training set, which is done by the following line of code:

<div class="alert alert-block alert-info">
<b>Tip:</b> Please be aware of the fact, that the example composition contains total cell counts. If your used training data is normalized, you need another way of counting the celltypes.
</div>

In [8]:
# Calculate sum of each celltype
celltype_counts_train = {}
for celltype in X_train.columns.unique():
    celltype_counts_train[celltype] = C_train.sum(axis=1)[celltype]

## Running HIDE Algorithm

After preprocessing the necessary data, we are ready to execute the HIDE algorithm. The HIDE algorithm will use the preprocessed data to perform cell type deconvolution.

### Parameters for HIDE:

1. **Training Composition**: The ground truth composition of the training data
2. **Validation Composition**: The ground truth composition of the validation data
3. **Training Bulk**: The bulk data of the training composition
4. **Validation Bulk**: The bulk data of the validation composition
5. **Reference profile**: The reference profile for the cellular subtypes
6. **Hierarchy dictionary**: The dictionary linking each level of celltypes
7. **Counts of training celltypes**: The count of each celltype in the training composition
8. **Number of iterations**: Number of iterations that DTD should perform for each level
9. **Save path**: Path where the results should be saved at
10. **Save estimated compositions [optional]**: Flag whether the estimated compositions should be saved
11. **Save Gamma and X [optional]**: Flag whether the reference profiles and models for each level should be saved



<div class="alert alert-block alert-warning">
<b>Warning:</b> To keep this tutorial lightweight, we set the iterations to 25, for real usage this should be set around 1000. Thus the resulting correlations will be low.
</div>

In [None]:
# Run HIDE
res_hide = HIDE(C_train, 
                    C_val, 
                    Y_train, 
                    Y_val, 
                    X_train, 
                    sub_celltypes, 
                    celltype_counts_train, 
                    iterations_dtd=25,
                    savePath='./results/', 
                    saveC=False,
                    saveGammaAndX=False)

##################################
###       HIDE pipeline       ###
##################################
-> Number of used genes: 5000
-> list of all celltypes:
	 0: Arterial EC
	 1: B cell
	 2: Breast basal cell
	 3: Breast cancer specific luminal cell
	 4: Breast cancer specific proliferation luminal cell
	 5: Capillary EC
	 6: CD4 T
	 7: cDC2
	 8: CFD fibroblast
	 9: CXCL1/2/3 fibroblast
	 10: CXCL13 exhausted CD8 T
	 11: Fibroblast
	 12: GZMH CD8 T
	 13: GZMK CD8 T
	 14: IgA plasma
	 15: IgG plasma cell
	 16: INF responsed T
	 17: Luminal progenitor
	 18: Lymphatic EC
	 19: Macrophage
	 20: Mast cell
	 21: Monocyte
	 22: mregDC
	 23: NK cell
	 24: pDC
	 25: Pericyte
	 26: Proliferation macrophage
	 27: Proliferation T/NK
	 28: Smooth muscle cell
	 29: Tfh
	 30: TGM2 luminal cell
	 31: TNBC-specific epithelial cell
	 32: Treg
	 33: Venous EC
-> Started Pipeline at 09:47:18.215430
### HIDE on maintypes ###
-> list of all maintypes:
	 0: Endothelial cell
	 1: Epithelial cell
	 2: Fibro

  0%|          | 0/25 [00:00<?, ?it/s]

-> Average train correlation: 0.5996181020909988
-> Validating HIDE
-> Average val correlation: 0.5992226156106237
### HIDE on Endothelial cell subtypes ###
-> list of Endothelial cell subtypes:
	 0: Arterial EC
	 1: Capillary EC
	 2: Lymphatic EC
	 3: Venous EC
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.24174197110818618
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.18778376952923664

-> No subset types of Arterial EC

-> No subset types of Capillary EC

-> No subset types of Lymphatic EC

-> No subset types of Venous EC

### HIDE on Epithelial cell subtypes ###
-> list of Epithelial cell subtypes:
	 0: Luminal progenitor
	 1: TGM2 luminal cell
	 2: Breast cancer specific luminal cell
	 3: Breast cancer specific proliferation luminal cell
	 4: TNBC-specific epithelial cell
	 5: Breast basal cell
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.5471805215361805
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.5468740946359594

-> No subset types of Luminal progenitor

-> No subset types of TGM2 luminal cell

-> No subset types of Breast cancer specific luminal cell

-> No subset types of Breast cancer specific proliferation luminal cell

-> No subset types of TNBC-specific epithelial cell

-> No subset types of Breast basal cell

### HIDE on Fibroblast subtypes ###
-> list of Fibroblast subtypes:
	 0: other fibroblasts
	 1: CFD fibroblast
	 2: CXCL1/2/3 fibroblast
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.6298622521387818
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.6034543318016931

-> No subset types of other fibroblasts

-> No subset types of CFD fibroblast

-> No subset types of CXCL1/2/3 fibroblast

### HIDE on B cell subtypes ###
-> list of B cell subtypes:
	 0: other B cells
	 1: IgA plasma
	 2: IgG plasma cell
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.5423406973652106
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.5588289245122242

-> No subset types of other B cells

-> No subset types of IgA plasma

-> No subset types of IgG plasma cell

### HIDE on NK cell subtypes ###
-> list of NK cell subtypes:
	 0: NK cell
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.39744119894134045
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.4050494985400631

-> No subset types of NK cell

### HIDE on Proliferation T NK subtypes ###
-> list of Proliferation T NK subtypes:
	 0: Proliferation T/NK
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.3161970429423029
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.36282165320411125

-> No subset types of Proliferation T/NK

### HIDE on T cell subtypes ###
-> list of T cell subtypes:
	 0: CD4 T cell
	 1: CD8 T cell
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.5739453697533887
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.5606220409642149

### HIDE on CD4 T cell subtypes ###
-> list of CD4 T cell subtypes:
	 0: CD4 T
	 1: Tfh
	 2: Treg
	 3: INF responsed T
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.3507005431686171
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.33776725819689296

### HIDE on CD8 T cell subtypes ###
-> list of CD8 T cell subtypes:
	 0: CXCL13 exhausted CD8 T
	 1: GZMH CD8 T
	 2: GZMK CD8 T
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.3205941501665855
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.35362551541554943

### HIDE on Myeloid cell subtypes ###
-> list of Myeloid cell subtypes:
	 0: Dendritic cell
	 1: Granulocyte
	 2: Macrophages
	 3: Monocyte
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.3870296727354586
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.4150367359922475

### HIDE on Dendritic cell subtypes ###
-> list of Dendritic cell subtypes:
	 0: cDC2
	 1: mregDC
	 2: pDC
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.204401162113077
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.2209968707791732

-> No subset types of Granulocyte

### HIDE on Macrophages subtypes ###
-> list of Macrophages subtypes:
	 0: Macrophage
	 1: Proliferation macrophage
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.40809566781671525
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.40077743681635003

-> No subset types of Monocyte

### HIDE on Perivascular cell subtypes ###
-> list of Perivascular cell subtypes:
	 0: Pericyte
	 1: Smooth muscle cell
-> Clearing training bulks
-> Training HIDE


  0%|          | 0/25 [00:00<?, ?it/s]

-> clearing bulks
-> Average train correlation: 0.3351833338728135
-> Validating HIDE
-> clearing bulks
-> Average val correlation: 0.3254796612486195

-> No subset types of Pericyte

-> No subset types of Smooth muscle cell

-> Ended Pipeline at 09:47:47.924438
-> Total duration: 0:00:29.709008
### Correlations ###
--- HIDE Training ---
-> Main correlation: 0.5996181020909988
-> Sub correlation: 0.45564131547075726
-> Subset correlation: 0.3161649537622406
-> Total correlation: 0.4164172243455789

--- HIDE Validation ---
-> Main correlation: 0.5992226156106237
-> Sub correlation: 0.45074811211589566
-> Subset correlation: 0.32304092208370333
-> Total correlation: 0.41526063065713725

##################################


In the result folder you can now see various files displaying scatter plots for each celltype.

## Applying the estimated parameters to another dataset

As a last section we want to show how the estimated parameters of HIDE can be used for estimating cellular proportions in another bulk dataset.

To keep this tutorial lightweight, we reuse 25 mixtures of the validation bulk. It is necessary that the real application data has the same structure, meaning each column represents a single mixture and its rows the gene expression.

In [None]:
Y_appl = Y_val.iloc[:,0:25]

Now you need to apply the estimated DTD model, the rescaling factor and the linear regression manually on each celltype you want to deconvolve. We start this by estimating the major celltype proportions. <br>
In the previous part we used HIDE to estimate all necessary models and parameters, which are stored in the ```res_hide``` dictionary. <br>
To apply them on another dataset we need the gamma vector for DTD, the calculated reference profiles and the linear regression parameters. For the main celltypes those can be accessed via
```python
gamma_main = res_hide['major']['model_main'].gamma
X_main = res_hide['major']['X_main']
LinReg_main = res_hide['major']['LinReg']
```

As described in our paper, first the cellular composition is estimated by DTD and then adjusted with the Linear Regression. To prevent any negative values, results having a negative estimate are set to zero. Afterwards each mixture is normalized to one, as we want to provide the percentage of each celltype.

In [None]:
from utils import calculate_estimated_composition
from pipelines_utils import adjustToLinReg

gamma_main = res_hide['major']['model_main'].gamma
X_main = res_hide['major']['X_main']
LinReg_main = res_hide['major']['LinReg']



C_main = calculate_estimated_composition(X_main, Y_appl, gamma_main)
C_main = adjustToLinReg(C_main, LinReg_main)

# Ensure that prediction is not negative
C_main = C_main.clip(lower=0)
C_main = C_main / C_main.sum(axis=0)

display(C_main.iloc[:,0:5])

Unnamed: 0,V1,V2,V3,V4,V5
Endothelial cell,3.775416e-08,0.008413,0.005402913,0.006916831,0.009628441
Epithelial cell,0.5635308,0.628984,0.5424821,0.3964452,0.5454112
Fibroblast,0.07832261,0.056059,0.05932667,0.07853533,0.09172605
B cell,0.02735336,0.056139,0.01420662,0.05933958,0.01512061
NK cell,2.093987e-08,0.010081,2.124476e-08,0.007946003,0.01006802
Proliferation T NK,0.001665234,0.002745,1.505727e-08,1.584617e-08,0.01280559
T cell,0.184852,0.141163,0.2372185,0.2380544,0.1617439
Myeloid cell,0.1257517,0.088888,0.1301862,0.2080691,0.1534961
Perivascular cell,0.01852425,0.007526,0.01117694,0.004693481,2.362669e-08


Next we examplarly show how to estimate the T cell subtypes. Therefor we again need the estimated parameters which can be found in the results dictionary: <br>
```python
gamma_T = res_hide['minor']['T cell']['model'].gamma
X_T = res_hide['minor']['T cell']['X_sub']
LinReg_T = res_hide['minor']['T cell']['LinReg']
```

For minor and subtype estimation HIDE provides a function ```subtypes_estimate_composition``` which takes the reference profile of the parent type (here X_main) and the current type (here X_T), the bulk transcriptomic profile, the name of the parent type that should be splitted (here 'T cell'), the estimated parent composition (here C_main) and the estimated gamma vector and linear regression. <br>

The function returns a dictionary containing the estimated composition ('C_est) and the reduced bulk profile ('Y_reduced').

In [None]:
from hDTD import subtypes_estimate_composition

gamma_T = res_hide['minor']['T cell']['model'].gamma
X_T = res_hide['minor']['T cell']['X_sub']
LinReg_T = res_hide['minor']['T cell']['LinReg']


result_Tcell = subtypes_estimate_composition(X_T, X_main, Y_appl, 'T cell', C_main, gamma_T, LinReg_T)

C_T = result_Tcell['C_est']
display(C_T.iloc[:,0:5])

-> clearing bulks


Unnamed: 0,0,1,2,3,4
CD4 T cell,0.184852,0.1411632,0.2372185,0.214111,0.143722
CD8 T cell,5.419652e-08,4.412962e-08,6.643588e-08,0.023943,0.018022


As a last step we now want to deconvolve all CD4 T cell subtypes, which can be done almost in the same way as before. But instead of using X_main as reference profile for the parent type and Y_appl as bulk profile, we need to use the reference profile of the T-cells and the reduced bulk of the T cells, which is stored in ```result_Tcell['Y_reduced']```

In [None]:
gamma_CD4 = res_hide['sub']['CD4 T cell']['model'].gamma
X_CD4 = res_hide['sub']['CD4 T cell']['X_sub']
LinReg_CD4 = res_hide['sub']['CD4 T cell']['LinReg']


result_CD4 = subtypes_estimate_composition(X_CD4, X_T, result_Tcell['Y_reduced'], 'CD4 T cell', C_T, gamma_CD4, LinReg_CD4)

C_CD4 = result_CD4['C_est']
display(C_CD4.iloc[:,0:5])

-> clearing bulks


Unnamed: 0,0,1,2,3,4
CD4 T,0.1238749,0.08947309,0.1622369,0.1258131,0.08516915
Tfh,0.02655366,0.02008267,0.0236128,0.03542636,0.02950994
Treg,0.03442341,0.03160745,0.05136874,0.05287168,0.02904296
INF responsed T,5.868077e-10,4.46176e-10,7.718906e-10,7.122566e-10,4.694051e-10
