<a id='predict'></a>

# CellPy Predict

## Predict Options

predictOne: prediction of layers w/ val_metadata, each layer is predicted independently
* (runMode = predictOne, predNormExpr, predMetadata, layerObjectPaths, rejectionCutoff)
* Example: `cellpy --runMode predictOne --predNormExpr /cellpy_example/pbmc_10k_normalized.csv --predMetadata /cellpy_example/pbmc_10k_metadata.csv --layerObjectPaths /cellpy_example/cellpy_results_20210720155257/training/Root_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/CD4_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/CD8_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/T-cell_object.pkl --rejectionCutoff 0.5`

predictOne: prediction of layers w/o val_metadata, each layer is predicted independently
* (runMode = predictOne, predNormExpr, layerObjectPaths, rejectionCutoff)
* Example: `cellpy --runMode predictOne --predNormExpr /cellpy_example/pbmc_10k_normalized.csv --layerObjectPaths /cellpy_example/cellpy_results_20210720155257/training/Root_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/CD4_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/CD8_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/T-cell_object.pkl --rejectionCutoff 0.5`

predictAll: prediction of layers w/o val_metadata, each layer influences the next layer
* (runMode = predictAll, predNormExpr, layerObjectPaths, rejectionCutoff)
* Example: `cellpy --runMode predictAll --predNormExpr /cellpy_example/pbmc_10k_normalized.csv --layerObjectPaths /cellpy_example/cellpy_results_20210720155257/training/Root_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/CD4_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/CD8_object.pkl,/cellpy_example/cellpy_results_20210720155257/training/T-cell_object.pkl --rejectionCutoff 0.5`

## predictOne runMode Option (2a, 2b)

predictOne is a CellPy option that allows the user to use the CellPy algorithm to predict cell types in an **individual layer** in their single cell RNA query dataset based off of their previously trained dataset.

*Note: CellPy creates and stores information from model training in **Layer** objects, including the name of the layer, its depth in metadata files, the dictionary associated with each layer, and the XGBoost model itself. Prediction and feature ranking require **Layer** objects as input and can therefore only be run after **trainAll** has been completed. Details for the **Layer** class can be found at the end of this tutorial.*

### REQUIRED USER INPUTS:
-  **predNormExpr**: normalized expression matrix file
-  **predMetadata (optional)**: metadata file
-  **layerObjectPaths**: a list of path names to Layer objects
    - **NOTE:** the option `cardiacDevAtlas` can be directly utilized as an argument to layerObjectPaths to predict cell types in your cardiac dataset based on the cardiac developmental atlas used in our paper
-  **rejectionCutoff**: float between 0 and 1 denoting the minimum probability for a prediction to not be rejected

### OUTPUT OF PREDICT:
* creates directory 'prediction' in cellpy_results folder, defines 'Root' as topmost layer
* csv files containing all the predictions and probabilities associated with each label for each cell
* metric files detailing accuracy, precision, recall, confusion matrix (if metadata file has been provided in input; only present for predictOne mode)

### predNormExpr

##### FORMAT: csv file

-  cell 0,0 / A1 is 'gene'
-  first column contains gene names
-  first row contains cell names
-  eg. cell [i,j] is the gene expression value in cell (j-1) for gene (i-1)

##### EXAMPLE:

`gene,AAACCCAG........,AAACGAAC........,AAACGAAT........,GAGGGATC........
MIR1302-10,0,2.14693417019908,2.31409562022533,0
OR4F29,0,1.71783906814673,0,0
LINC00115,0,0,0,2.8499342352407
ISG15,2.99811039595896,0,2.41534932603235,0`

### predMetadata (Optional)

##### FORMAT: csv file

-  cell 0,0 / A1 should be NA or empty
-  first column contains cell names, should be in the same order as first row of trainNormExpr
-  each column following contains the layered identification for each cell
-  all other cells should be NA or empty

The prediction metadata file option allows the user to test the model on a new dataset.
EXAMPLE:

`,Celltype1,Celltype2,Celltype3
AAACCCAG........,Monocyte,NA,NA
AAACGAAC........,NK,NA,NA
AAACGAAT........,Macrophage,NA,NA
GAGGGATC........,T-cell,"CD8,"Effector Memory CD8`

### layerObjectPaths

##### FORMAT: a comma-separated list of paths to the trained Layer objects (pickle .pkl files)

* Layer objects were created by trainOnly
-  not all models have to be provided, can conduct prediction on individual targetted layers
* **NOTE: do not rename the .pkl Layer objects*

* the cardiac developmental atlas datasets found in our paper can be utilized as a training dataset by simply using the **`cardiacDevAtlas`** option instead of inputting your own Layer objects

##### EXAMPLES:


training one layer:
`--layerObjectPaths /scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/Root_object.pkl`

training multiple layers (predicted individually):
`--layerObjectPaths /scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/Root_object.pkl,/scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/CD4_object.pkl,/scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/CD8_object.pkl,/scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/T-cell_object.pkl`

### rejectionCutoff

##### FORMAT: float between 0 and 1

-  a rejection cutoff of 0.5 means a cell will be regarded as "Unclassified" if no class has a predicted probability greater than 50%
* **NOTE: See "Post-CellPy Analysis in R" section below for further analysis on varying rejection thresholds' impact on results. 

## predictAll runMode Option (3a)

predictAll is a CellPy option that allows the user to use the CellPy algorithm to predict cell types in **all layers** of their single cell RNA query dataset based off of their previously trained dataset. Each layer of cell types predicted influences the predictions for the next layer.

*Note: CellPy creates and stores information from model training in **Layer** objects, including the name of the layer, its depth in metadata files, the dictionary associated with each layer, and the XGBoost model itself. Prediction and feature ranking require **Layer** objects as input and can therefore only be run after **trainAll** has been completed. Details for the **Layer** class can be found at the end of this tutorial.*

### REQUIRED USER INPUTS:
-  **predNormExpr**: normalized expression matrix file
-  **layerObjectPaths**: a list of path names to Layer objects
    - **NOTE:** the option `cardiacDevAtlas` can be directly utilized as an argument to layerObjectPaths to predict cell types in your cardiac dataset based on the cardiac developmental atlas used in our paper
-  **rejectionCutoff**: float between 0 and 1 denoting the minimum probability for a prediction to not be rejected

### OUTPUT OF PREDICT:
* creates directory 'prediction' in cellpy_results folder, defines 'Root' as topmost layer
* csv files containing all the predictions and probabilities associated with each label for each cell
* metric files detailing accuracy, precision, recall, confusion matrix (if metadata file has been provided in input; only present for predictOne mode)

### predNormExpr

##### FORMAT: csv file

-  cell 0,0 / A1 is 'gene'
-  first column contains gene names
-  first row contains cell names
-  eg. cell [i,j] is the gene expression value in cell (j-1) for gene (i-1)

##### EXAMPLE:

`gene,AAACCCAG........,AAACGAAC........,AAACGAAT........,GAGGGATC........
MIR1302-10,0,2.14693417019908,2.31409562022533,0
OR4F29,0,1.71783906814673,0,0
LINC00115,0,0,0,2.8499342352407
ISG15,2.99811039595896,0,2.41534932603235,0`

### layerObjectPaths

##### FORMAT: a comma-separated list of paths to the trained Layer objects (pickle .pkl files)

* Layer objects were created by trainOnly
-  not all models have to be provided, can conduct prediction on individual targetted layers
* **NOTE: do not rename the .pkl Layer objects*

* the cardiac developmental atlas datasets found in our paper can be utilized as a training dataset by simply using the **`cardiacDevAtlas`** option instead of inputting your own Layer objects

##### EXAMPLE:

`--layerObjectPaths /scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/Root_object.pkl,/scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/CD4_object.pkl,/scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/CD8_object.pkl,/scratch/groups/smwu/sidraxu/cellpy_results_20210720155257/training/T-cell_object.pkl`

### rejectionCutoff

##### FORMAT: float between 0 and 1

-  a rejection cutoff of 0.5 means a cell will be regarded as "Unclassified" if no class has a predicted probability greater than 50%
* **NOTE: See "Post-CellPy Analysis in R" section below for further analysis on varying rejection thresholds' impact on results. 

# Back to Table of Contents

[Table of Contents](tableofcontents.ipynb#toc)