# Setup

Section to set up Jupyter Notebook and intialize experimental settings.

### Give Jupyter Notebook access to relative import

In [None]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

### Create GTMDecon object

For ease of user access, we use the GTMDecon python wrapper, built around the gtm-decon C executable files.

In [None]:
from PythonWrapper.GTM_decon import GTM_decon

Initialize GTMDecon wrapper object:

Basic Constructor Arguments:
- **experiment_name** : str [optional]
- **n_topics** : int [optional, default=1]
    - number of topics we wish to set per celltype
- **engine_path** : str
    - path to GTM-decon C executable
    Here we only set the experiment name and engine path, the n_topics parameter will be by default set to 1.

In [None]:
GTM = GTM_decon(
    experiment_name = "gtm-tutorial",
    engine_path = "../gtm-decon-code/gtm-decon"
)

We can see the parameters set for our GTM wrapper, including the number of topics per celltype and the engine_path (path to C executable).

We can see that the **experiment_name**, **n_topics**, and **engine_path** attributes have been set as we intended, while the remaining attributes have been left unfilled. The **genes**, **celltypes**, and **bulk_samples** parameters will be populated as we provide our input reference and bulk data.

In [None]:
print(GTM)

# Example Deconvolution Pipeline

In order to infer cell-type proportions for a given bulk dataset and given single cell reference matrix, we can use the **GTMDecon.pipeline** function to process the input information, run it through the gtm-decon C executables, and output the predicted cell-type proportions of our bulk.

### Loading DataFrames

In [None]:
import pandas as pd
import anndata as ad

Load our example reference and bulk dataframes from the example csvs.

The **reference_DataFrame** should be a pandas DataFrame object, the rows are cells, the columns are the genes, with one additional column named *Celltype* containing the cell-type labels associated with each row.

The **bulk_DataFrame** should be a pandas DataFrame, where the rows represent genes, with the genes stored as the index, and the columns represent the bulk batches.

First need to uncompress tutorial data.

In [None]:
!ls ../data

In [None]:
!tar -xzvf ../data/tutorial_data.tar.gz -C ../data

In [None]:
bulk_DataFrame = pd.read_csv("../data/bulk_data.csv", index_col=0)
reference_DataFrame = pd.read_csv("../data/reference_data.csv")

### Single Leave-One-Out CV fold

Since we have paired single cell reference and bulk data for this example, we will remove one batch from the reference data, and deconvolve the bulk data corresponding to that same individual (in order to prevent data leaking). 

Here we will leave-out H2.

In [None]:
reference_df = reference_DataFrame[reference_DataFrame['Batch'] != 'H2']

We need to first remove the **Batch** column, as GTM_decon expects DataFrames to only contain the genes and cell-type labels in its columns.

In [None]:
reference_df = reference_df.drop(columns=['Batch'])

In [None]:
reference_df.head()

For the bulk data we will do the inverse of the above, we will keep batch H2 so that we can infer the cell-type proportions of this sample.

In [None]:
bulk_df = bulk_DataFrame[['H2']]

In [None]:
bulk_df.head()

### Running our Pipeline

GTMDecon.pipeline arguments:
- **bulk_data** : pd.DataFrame
- **reference_data** : pd.DataFrame
- **directory** : str
    - directory where we want to save the model parameters and inferred cell-type proportions 
    - we expect the inferred propotions to end up here: **/vignette_results/gatheredResults.csv**


We make a directory to store the results for this vignette

In [None]:
!mkdir tutorial_results

Here we run our pipeline, including processing data to GTM-decon format, training, and cell-type proportion inference.

If we want to suppress print statements, set GTM.verbose = False

In [None]:
GTM.pipeline(
    bulk_data = bulk_df,
    reference_data = reference_df,
    directory = os.path.join(os.getcwd(), 'tutorial_results'),
)

Upon completion we should be able to obtain the predicted proportions in **/tutorial_results/gatheredResults.csv**

This file contains the inferred cell-type proportions of our provided bulk data given the provided refernce data. The sample names are the index and the celltypes are the columns of this file.

In [None]:
predicted_props = pd.read_csv("../vignettes/tutorial_results/gatheredResults.csv", index_col=0)

In [None]:
predicted_props.head()

In [None]:
# C make