# CellPy Tutorial

CellPy is a Python package designed for hierarchical multilayered classification of cells based on single-cell RNA-sequencing (scRNA-seq). It implements the machine learning algorithm Extreme Gradient Boost (XGBoost) (Chen and Guestrin, 2016) to automatically predict cell identities across complex permutations of layers and sublayers of annotation. An example classification hierarchy is illustrated below.

![image.png](attachment:image.png)

Given CellPy's highly customizable classification scheme, users can input the annotation hierarchy of their scRNA-seq datasets into CellPy to guide the automatic classification and prediction of cells according to the provided hierarchy. CellPy allows users to designate any identity at each layer of classification and is not constrained by cell type——for example, assigning timepoint as one of the annotation layers allows for cell identity predictions at that layer to be conditioned on the age of the cells. In addition to hierarchical cell classification, CellPy implements the SHapley Additive exPlanations (SHAP) package (Lundberg etal, 2020), which provides the user with interpretability methods for the model and determines the positive and negative gene predictors of cell identities across all annotation layers.

Below we provide a comprehensive tutorial on CellPy's usage as well as overall concepts in its design.

Paper: Galdos Xu etal. 2021

## CellPy Back End

CellPy implements a `Layer` object to maintain information regarding each layer in the classification. Each `Layer` object is encapsulated and independent. `Layer` objects can be exported from and imported into the CellPy module.

## Installation Notes

CellPy has been formatted into a wrapper function that can be easily installed through pip and run through the command line of the Terminal or Command Prompt.

**NOTE:** All Python and XGBoost versions must remain the same throughout usage of all training, predicting, and feature ranking options. Ex) If Python 3.7 is used to train a dataset, Python 3.7 must be used to predict a query dataset using the trained dataset.

<a id='toc'></a>

# Table of Contents

1) [Pre-CellPy Data Preparation](1.precellpy_tutorial.ipynb#precellpy)

2) [CellPy Train](2.train_tutorial.ipynb#train)

3) [CellPy Predict](3.predict_tutorial.ipynb#predict)

4) [CellPy Feature Ranking](4.featureranking_tutorial.ipynb#featureranking)

5) [Post-CellPy R Analysis](5.postcellpy_tutorial.ipynb#postcellpy)

6) [Cardiac Developmental Atlas Option](6.cardiacdevatlas_tutorial.ipynb#cardiac)

7) [CellPy Run Options Summary and Examples](7.summary.ipynb#summary)

8) [CellPy Code](8.code.ipynb#code)

9) [References](9.references.ipynb#references)