# 3. devCellPy Predict

predictOne and predictAll are devCellPy options that allow users to use a trained devCellPy layered prediction algorithm to make predictions on new datasets. Here we use the example of using the devCellPy trained algorithm on the cardiac atlas PBMC dataset to predict cell types in a new cardiac dataset from Li et al (2016). devCellPy allows users to predict individual layers or to fully automate the prediction of across all layers of an annotation hierarchy. Below we provide examples for how to run these distinct options. 


## Predict Options

*Note: devCellPy creates and stores information from model training in `Layer` objects, including the name of the layer, its depth in metadata files, the dictionary associated with each layer, and the XGBoost model itself. Prediction and feature ranking require `Layer` objects as input and can therefore only be run after `trainAll` has been completed. Details for the `Layer` class can be found at the end of this tutorial.*

predictOne: prediction of layers w/ val_metadata, each layer is predicted independently
* (runMode = predictOne, predNormExpr, predMetadata, layerObjectPaths, rejectionCutoff)
* Example: 

predictOne: prediction of layers w/o val_metadata, each layer is predicted independently
* (runMode = predictOne, predNormExpr, layerObjectPaths, rejectionCutoff)
* Example: 

predictAll: prediction of layers w/o val_metadata, each layer influences the next layer
* (runMode = predictAll, predNormExpr, layerObjectPaths, rejectionCutoff)
* Example: 

## predictOne Option

predictOne is a devCellPy option that allows the user to use the devCellPy algorithm to predict cell types in an **individual layer** in their single cell RNA query dataset based off of their previously trained dataset.

### REQUIRED USER INPUTS:
-  **predNormExpr**: normalized expression matrix file, csv file OR a scanpy h5ad file
-  **predMetadata (optional)**: metadata file
-  **layerObjectPaths**: a list of path names to Layer objects
-  **rejectionCutoff**: float between 0 and 1 denoting the minimum probability for a prediction to not be rejected

### predNormExpr

##### FORMAT: csv file OR h5ad file

Requirements for a csv file:
* contains normalized expression of genes for each single cell 
* first column: gene names
* row headers: cell barcodes
* row 1 column 1 is 'gene'

##### EXAMPLE:

`gene,AAACCCAG........,AAACGAAC........,AAACGAAT........,GAGGGATC........
MIR1302-10,0,2.14693417019908,2.31409562022533,0
OR4F29,0,1.71783906814673,0,0
LINC00115,0,0,0,2.8499342352407
ISG15,2.99811039595896,0,2.41534932603235,0`

### predMetadata (Optional)
predictMetadata option allows users to compare devCellPy predictions with a distinct method of annotation on a query dataset (i.e. manual annotation). 

##### FORMAT: csv file

-  row 1, column 1 should be `NA`, ie. empty when opened in Excel
-  first column contains cell names, should be in the same order as first row of trainNormExpr
-  each column following contains the layered identification for each cell
-  all other cells should be NA or empty
* a single row contains a cell barcode and then the cell label corresponding to each subtype category
*  all other cells should be `NA`, ie. empty when opened in Excel

The prediction metadata file option allows the user to test the model on a new dataset.
EXAMPLE:

`,Celltype1,Timepoint,Celltype2,Celltype3
AAACCTGGTAACGTTC-1_1_1,aSHF Progenitors,E7.75,NA,NA
AAACCTGTCACAATGC-1_1_1,FHF Progenitors,E7.75,NA,NA
AAACGGGTCTGCTGCT-1_1_1,Pharyngeal Mesoderm,E7.75,NA,NA
TGATTTCTCCACGACG-1_2_3,Cardiomyocytes,E9.25,E9.25_Ventricular_CM,RV CM
TGATTTCTCTCCCTGA-1_2_3,Cardiomyocytes,E9.25,E9.25_Ventricular_CM,Septal CM
TTTGTCACATTTCACT-1_6_6,Cardiomyocytes,E16.5,Ventricular CM,NA`

### layerObjectPaths

##### FORMAT: a comma-separated list of paths to the trained Layer objects (pickle .pkl files)

* Layer objects were created by trainOnly
-  not all models have to be provided, can conduct prediction on individual targetted layers
* **NOTE: do not rename the .pkl Layer objects*

##### EXAMPLES:

Training one layer:

Training multiple layers (predicted individually):

### rejectionCutoff

##### FORMAT: float between 0 and 1

-  a rejection cutoff of 0.5 means a cell will be regarded as "Unclassified" if no class has a predicted probability greater than 50%
* **NOTE: See "Post-devCellPy Analysis in R" section below for further analysis on varying rejection thresholds' impact on results.

### OUTPUT OF PREDICT:

* creates directory "devcellpy_predictOne_(time)"
* within, there will nested folders for each Layer, with 'Root' being the first Layer
* within "predictOne" there will be a separate folder for each Layer, with 'Root' being the first Layer
* each Layer folder contains the following:
    * csv files containing all the predictions and probabilities associated with each label for each cell
    * metric files detailing accuracy, precision, recall, confusion matrix (if metadata file has been provided in input; only present for predictOne mode)


## predictAll Option

predictAll is a devCellPy option that allows the user to use the devCellPy algorithm to predict cell types in **all layers** of their single cell RNA query dataset based off of their previously trained dataset. Each layer of cell types predicted influences the predictions for the next layer.

### REQUIRED USER INPUTS:
-  **predNormExpr**: normalized expression matrix file
-  **layerObjectPaths**: a list of path names to Layer objects
-  **rejectionCutoff**: float between 0 and 1 denoting the minimum probability for a prediction to not be rejected

### predNormExpr

##### FORMAT: csv file

* contains normalized expression of genes for each single cell 
* first column: gene names
* row headers: cell barcodes
* row 1 column 1 is 'gene'

##### EXAMPLE:

`gene,AAACCCAG........,AAACGAAC........,AAACGAAT........,GAGGGATC........
MIR1302-10,0,2.14693417019908,2.31409562022533,0
OR4F29,0,1.71783906814673,0,0
LINC00115,0,0,0,2.8499342352407
ISG15,2.99811039595896,0,2.41534932603235,0`

### layerObjectPaths

##### FORMAT: a comma-separated list of paths to the trained Layer objects (pickle .pkl files)

* Layer objects were created by trainOnly
-  not all models have to be provided, can conduct prediction on individual targetted layers
* **NOTE: do not rename the .pkl Layer objects*

##### EXAMPLE:

### rejectionCutoff

##### FORMAT: float between 0 and 1

-  a rejection cutoff of 0.5 means a cell will be regarded as "Unclassified" if no class has a predicted probability greater than 50%
* **NOTE: See "Post-devCellPy Analysis in R" section below for further analysis on varying rejection thresholds' impact on results.

### OUTPUT OF PREDICT:

* creates directory "devcellpy_predictAll_(time)"
* within, there will nested folders for each Layer, with 'Root' being the first Layer
* "predictAll" will contain a csv file of the predictions at all layers assigned to each cell


# Back to Table of Contents

[Table of Contents](https://github.com/devCellPy-Team/devCellPy/blob/main/Tutorial/0.tableofcontents.ipynb)