# Cell type identification from scRNA-seq data using ACTINN

## Why is RNA important?

To understand what scRNA-seq(Single-cell RNA Sequencing) data represents, let's first explore the role of RNA (Ribonucleic Acid) inside cells. All cells in our body, regardless of whether they make up tissues in the heart, brain, or skin, contain the same genetic code (DNA). Yet, despite having identical DNA, cells look different, perform distinct functions, and form specialized tissues and organs.

How is this possible?

The key lies in differential [gene expression](https://en.wikipedia.org/wiki/Gene_expression), the process by which genetic instructions are used to synthesize gene products like proteins. DNA acts as the blueprint, but it is RNA, specifically messenger RNA (mRNA), that translates these instructions into action. Different types of cells transcribe distinct subsets of genes into mRNAs, leading to the synthesis of unique sets of proteins. These distinct protein profiles allow cells to specialize and perform various functions within the body [1,2].

Messenger RNA (mRNA) acts as a critical intermediary, carrying genetic instructions from DNA to ribosomes,the cellular machinery responsible for protein synthesis.

Proteins like enzymes drive metabolic pathways, while structural proteins such as actin and myosin give muscles their ability to contract and move. Cytoskeletal proteins help maintain cell shape and integrity.

Therefore, the unique set of mRNA molecules present in a cell at any given moment(its **transcriptome**) reveals which genes are "turned on." This transcriptome dictates which proteins the cell is making, which in turn defines the cell's type and function. Measuring this transcriptome is the fundamental goal of single-cell RNA sequencing (scRNA-seq).

A high-level overview of single-cell RNA sequencing (scRNA-seq) is illustrated in the image below.

<img src="https://learn.gencore.bio.nyu.edu/wp-content/uploads/2018/01/scRNA-overview.jpg">

For a more in-depth understanding of the various techniques used in scRNA-seq and the challenges associated with the field, refer to the comprehensive review by Hwang et al. (2018) published in Experimental & Molecular Medicine.[5]

## Single-cell RNA sequencing vs Bulk RNA sequencing

RNA-Seq (short for RNA sequencing) is a technique used to quantify and identify RNA molecules in a biological sample, providing a snapshot of the transcriptome at a specific time. Sequencing of RNA can be mainly conducted in two ways: Either by sequencing the mixed RNA from the source of interest across cells (bulk sequencing) or by sequencing the transcriptomes of the cells individually (single-cell sequencing).

Bulk RNA-Seq results in cell-averaged expression profiles, which are generally easier to analyze, but also hide some of the complexity such as cell expression profile heterogeneity, which may help answer the question of interest. Some drugs or perturbations may affect only specific cell types or interactions between cell types.

For example, in oncology, it is possible to have rare drug resistant tumor cells causing relapse, which is difficult to identify by simple bulk RNA-seq even on cultured cells.To uncover such relationships, it is vital to examine gene expression on a single-cell level.[3]

## Libraries commonly used for scRNA-seq analysis

[ScanPy](https://scanpy.readthedocs.io/en/stable/) is a scalable toolkit for analyzing single-cell gene expression data. It includes methods for preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks.

[ANNDATA](https://github.com/theislab/anndata) was presented alongside ScanPy as a generic class for handling annotated data matrices that can deal with the sparsity inherent in gene expression data.

This is also a good point to explore DeepChem's [tutorial](https://deepchem.io/tutorials/scanpy/) on building an scRNA-seq analysis pipeline using ScanPy.



## ACTINN (automated identification of cell types in single cell RNA sequencing )

This tutorial guides you through working with single-cell RNA sequencing (scRNA-seq) data and demonstrates how to train and evaluate a model using DeepChem. The goal is to replicate the cell type identification experiments presented in the [ACTINN](https://academic.oup.com/bioinformatics/article/36/2/533/5540320) paper.

**Background**

Cell type identification is a key task in scRNA-seq analysis. Traditional methods involve:
- Unsupervised clustering of cells,
- Identification of signature genes within each cluster,
- Manual annotation of clusters by referencing literature and public databases.

However, these methods come with several limitations:
- Clustering can be influenced by unwanted sources of variation.
- Canonical markers may be missing or ambiguous for some cell types

**ACTINN (Automated Cell Type Identification using Neural Networks)** addresses these challenges with a supervised learning approach. Its architecture includes a neural network with three hidden layers. [4]


## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Cell_type_identification_using_scRNAseq_data.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install DeepChem in your local machine again.

In [1]:
#!pip install deepchem
import deepchem as dc
dc.__version__

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!
No normalization for NumAmideBonds. Feature removed!
No normalization for NumAtomStereoCenters. Feature removed!
No normalization for NumBridgeheadAtoms. Feature removed!
No normalization for NumHeterocycles. Feature removed!
No normalization for NumSpiroAtoms. Feature removed!
No normalization for NumUnspecifiedAtomStereoCenters. Feature removed!
No normalization for Phi. Feature removed!


Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/Users/harin/Desktop/deepchem/deepchem/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


'2.8.1.dev'

## Data Loading

A subset of the 'tma_ss2_cleaned' dataset from this [source](https://figshare.com/articles/ACTINN/8967116), which was originally used in ACTINN, is utilized here.

**Data Source and Preparation:**
- The data was sourced from publicly available scRNA-seq datasets, typically provided in 10X Genomics formats such as 10X_V2 or 10X_V3.
- These datasets are loaded using libraries like scipy, which handle sparse matrix formats efficiently.
- After loading, the expression matrices are converted into pandas DataFrames.
- For ease of storage and reproducibility, the DataFrames are saved in .h5 format using df.to_hdf().

This preprocessed .h5 dataset is used as input for model training and evaluation.

The data preprocessing pipeline is outlined in the original implementation of ACTINN [here](https://github.com/mafeiyang/ACTINN)


**Data Splitting**

There are two approaches to creating the train and test sets:
1. Splitting a single dataset into training and testing portions.
2. Using datasets from different sources as the train and test sets.

In [2]:
import os
import pandas as pd
import numpy as np
current_dir = os.path.dirname(os.path.realpath('__file__'))

In [5]:
file_names = ['train_sample.h5', 'train_label_sample.txt', 'test_sample.h5', 'test_label_sample.txt']
base_url = 'https://raw.githubusercontent.com/Harindhar10/deepchem/actinn_tutorial/examples/tutorials/assets/scRNAseq'

for i in range(len(file_names)):
  dc.utils.download_url(
      f"{base_url}/{file_names[i]}",
      current_dir,
      f"{file_names[i]}"
  )

### 1. Same source

In [3]:
labels = pd.read_csv(os.path.join(current_dir,'train_label_sample.txt'), header=None, sep='\t') 
dataset = pd.read_hdf(os.path.join(current_dir,'train_sample.h5'))

In [4]:
dataset

Unnamed: 0,tma_facs_14301,tma_facs_4756,tma_facs_14382,tma_facs_423,tma_facs_15612,tma_facs_4729,tma_facs_8174,tma_facs_9855,tma_facs_9055,tma_facs_8668,...,tma_facs_5182,tma_facs_19106,tma_facs_11386,tma_facs_14006,tma_facs_7477,tma_facs_15352,tma_facs_1314,tma_facs_2172,tma_facs_1625,tma_facs_12420
Cts7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Sdr42e1,0,0,0,0,0,0,0,46,52,0,...,0,0,0,0,0,0,0,0,0,0
Commd5,0,0,0,0,0,0,0,89,0,0,...,0,0,0,0,0,0,0,0,50,0
Fam170a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Olfr748,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Bcar3,0,0,0,0,0,0,124,34,0,0,...,0,0,0,0,0,0,8,0,0,0
Hk1,0,0,5,0,2,0,71,149,37,0,...,0,0,10,0,0,149,0,0,3,0
Sh3glb1,18,0,10,0,6,678,205,79,86,4,...,187,0,408,69,12,0,11,0,13,36
Calr4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
labels # Columns: cell ID and cell type

Unnamed: 0,0,1
0,tma_facs_14301,Monocyte
1,tma_facs_4756,Endothelial cell
2,tma_facs_14382,Monocyte
3,tma_facs_423,B cell
4,tma_facs_15612,Monocyte
...,...,...
3995,tma_facs_15352,Monocyte
3996,tma_facs_1314,B cell
3997,tma_facs_2172,B cell
3998,tma_facs_1625,B cell


In [6]:
n_types = 0 # To store the number of cell types present in the dataset, which is need to define the output layer of the model later on.

# Convertes cell type labels from string ('B Cell', 'T Cell' etc) to integers (1, 2 etc)
def convert_type2label(types):
    global n_types
    types = list(types)
    unique_types = sorted(set(types)) # Sorting ensures consistent label order
    n_types = len(unique_types)

    type_to_label_dict = {t: i for i, t in enumerate(unique_types)}
    labels = np.array([type_to_label_dict[t] for t in types])

    return labels

labels_int = convert_type2label(labels[1])

In [7]:
# Reshaping labels from (n,) to (n, 1) to be compatible with DeepChem's Dataset class
labels_int = labels_int.reshape(labels_int.shape[0],1)

In [8]:
# gene IDs are stored
train_genes = dataset.index.to_numpy()

# genes x cells (features x samples) -> cells x genes (samples x features)
dataset = np.transpose(dataset)

# Creating Deepchem 'Dataset' object
dataset = dc.data.NumpyDataset(X=np.array(dataset,dtype=np.float32),y = labels_int, ids= labels[0])

In [9]:
splitter = dc.splits.SingletaskStratifiedSplitter()
train_dataset, test_dataset = splitter.train_test_split(dataset)

### 2. Different sources

When using datasets from different sources, the gene sets may not completely overlap. Since the ACTINN model defines its layers based on the gene set, it's essential that both the training and testing sets contain the same genes. To ensure this, we first identified the genes common to both datasets and filtered out those that were not shared.

We will now demonstrate how to prepare data from different sources, which is the method we'll use for the rest of this tutorial

In [10]:
train_set = pd.read_hdf(os.path.join(current_dir,'train_sample.h5'))
train_labels = pd.read_csv(os.path.join(current_dir,'train_label_sample.txt'), header=None, sep='\t')
test_set = pd.read_hdf(os.path.join(current_dir,'test_sample.h5'))
test_labels = pd.read_csv(os.path.join(current_dir,'test_label_sample.txt'), header=None, sep='\t')

common_genes = train_set.index.intersection(test_set.index)
common_genes = sorted(common_genes)

train_set = train_set.loc[common_genes]
test_set = test_set.loc[common_genes]

Convertes cell type labels from string ('B Cell', 'T Cell' etc) to integers (1, 2 etc)

In [11]:
train_labels_int = convert_type2label(train_labels[1])
test_labels_int = convert_type2label(test_labels[1])

# Reshaping labels from (n,) to (n, 1) to be compatible with DeepChem's Dataset class
train_labels_int = train_labels_int.reshape(train_labels_int.shape[0],1)
test_labels_int = test_labels_int.reshape(test_labels_int.shape[0],1)

In [12]:
# gene IDs are stored
train_genes = train_set.index.to_numpy()
test_genes = test_set.index.to_numpy()

In [13]:
print(len(train_genes),len(test_genes))

2000 2000


In [14]:
# genes x cells (features x samples) -> cells x genes (samples x features)
train_set = np.transpose(train_set)
test_set = np.transpose(test_set)

In [15]:
# Creating Deepchem 'Dataset' object
train_dataset = dc.data.NumpyDataset(X=np.array(train_set,dtype=np.float32),y = train_labels_int, ids= train_labels[0])
test_dataset = dc.data.NumpyDataset(X=np.array(test_set,dtype=np.float32), y = test_labels_int, ids= test_labels[0])

In [16]:
train_dataset

<NumpyDataset X.shape: (4000, 2000), y.shape: (4000, 1), w.shape: (4000, 1), task_names: [0]>

In [17]:
test_dataset

<NumpyDataset X.shape: (1000, 2000), y.shape: (1000, 1), w.shape: (1000, 1), task_names: [0]>

## Data Preprocessing

Starting from raw counts, a scRNA-Seq data analysis typically includes normalization, feature selection, and dimension reduction steps.

In a standard RNA-seq workflow, RNA molecules are first captured from cells, reverse transcribed(i.e RNA is converted to DNA) into complementary DNA and sequenced. The resulting short reads are then computationally aligned to reference genes to obtain count data. Each stage of this process introduces some degree of technical variability, even among cells that are biologically identical.

The preprocessing pipeline includes the following steps:
1) Library-size normalize to 10,000 counts per cell and Log2(x+1) transform
2) Filter genes by total expression (1st–99th percentile)
3) Filter genes by coefficient of variation (1st–99th percentile)

**Normalization** seeks to adjust for differences in experimental conditions between samples (individual cells), so that these do not confound true biological differences.[6]
This step aims to adjust the raw counts in the dataset for variable sampling effects by scaling the observable variance to a specified range. Several normalization techniques are used in practice varying in complexity. For our analysis, we'll be using the shifted normalisation method as described in [7]. Log transformation reduces data skewness, improving suitability for machine learning applications.

The image below illustrates how the distribution of counts changes after applying the shifted logarithm, compared to the total counts in the raw dataset

<img src="https://www.sc-best-practices.org/_images/e7db84d20620d812e8d3b77a196d247d0ac339a306ac38eacb69c4a0a69f0321.png">

**Feature selection**, or identification of informative genes, is accomplished by ranking genes using total expression and coefficient of variation(std/mean), followed by the removal of the top and bottom 1%. Highly expressed genes might be housekeeping genes(genes that are essential for basic cellular functions and expressed in all cells). Genes that have low counts are likely to be technical noise. Genes with a low CV have very stable expression levels across all cells relative to their mean, like housekeeping genes, their lack of variation makes them uninformative. Extremely high CV values are often artifacts associated with genes that have a very low average expression. For these genes, the detection of just a few transcripts in only a handful of cells can create a large standard deviation relative to a near-zero mean, resulting in an artificially inflated CV


## Training Data Preprocessing

Each cell’s expression value was normalized to its total expression value and multiplied by a scale factor of 10 000.
1) library-size normalize to 10,000 and log2(x + 1) transform

In [18]:
# extract gene names & data array
gene_names = train_genes
X = train_dataset.X

row_sums = X.sum(axis=1, keepdims=True)  # shape (n_cells, 1)
X /= row_sums # broadcast divide
X *= 10000

X.shape # (number of cells, number of genes)

(4000, 2000)

The counts are increased by 1, and the log2 value is calculated

In [19]:
# log2(x + 1) transform
X = np.log2(X + 1)

2) To filter out outlier genes, the genes with the highest 1% and lowest 1% expression were removed.

In [20]:
expr = X.sum(axis=0)
low, high = np.percentile(expr, [1, 99])
mask_expr = (expr >= low) & (expr <= high)

X = X[:, mask_expr]
gene_names = gene_names[mask_expr]


To remove genes with zero mean (genes that aren't expressed in any of the cells)

In [21]:
mean_expr = X.mean(axis=0)
mask_mean = mean_expr > 0
X = X[:, mask_mean]
gene_names = gene_names[mask_mean]

3) The gene with the highest 1% and the lowest 1% standard deviation are removed.

In [22]:
mean_expr = X.mean(axis=0)
cv = X.std(axis=0) / mean_expr
low_cv, high_cv = np.percentile(cv, [1, 99])
mask_cv = (cv >= low_cv) & (cv <= high_cv)

X = X[:, mask_cv]
gene_names = gene_names[mask_cv]
train_genes = gene_names

In [23]:
# Creating Deepchem 'Dataset' object
train_dataset = dc.data.NumpyDataset(X=X,y = train_labels_int, ids= train_labels[0])

In [24]:
train_dataset

<NumpyDataset X.shape: (4000, 1865), y.shape: (4000, 1), w.shape: (4000, 1), task_names: [0]>

## Test Data Preprocessing

The exact same normalisation aproach used for train set is used for test set.

First, each cell’s expression value was normalized to its total expression value and multiplied by a scale factor
of 10 000. The counts were increased by 1, and the log2 value is calculated. 

The gene list obtained from the train set filtering steps is used to mask the test set. In the original ACTINN implementation, genes were filtered using both the training and test sets. To avoid potential information leakage, we instead apply filtering based solely on the training set.

In [25]:
# extract data array
X = test_dataset.X

# library-size normalize to 10,000 (in-place)
row_sums = X.sum(axis=1, keepdims=True)  # shape (1, n_cells)
X /= row_sums  # broadcast divide
X *= 10000

# log2(x + 1) transform
X = np.log2(X + 1, out=X)


In [26]:
print(len(train_genes), len(test_genes))

1865 2000


In [27]:
# The gene list obtained from the train set filtering steps is used to mask the test set.

test_gene_mask=[i in train_genes for i in test_genes]
X = X[:,test_gene_mask]

In [28]:
test_dataset = dc.data.NumpyDataset(X=np.array(X,dtype=np.float32), y=test_labels_int, ids=test_labels[0])

In [29]:
test_dataset

<NumpyDataset X.shape: (1000, 1865), y.shape: (1000, 1), w.shape: (1000, 1), task_names: [0]>

## Model implementation

ACTINN implemented with deepchem Model class as wrapper

In [31]:
import torch.nn as nn
import torch.nn.functional as F
from deepchem.models.torch_models import TorchModel
from deepchem.models.losses import SparseSoftmaxCrossEntropy
from deepchem.models.optimizers import Adam
from deepchem.models.optimizers import ExponentialDecay
from deepchem.metrics import from_one_hot


class ActinnClassifier(nn.Module):

    def __init__(self, output_dim=None, input_size=None):

        if output_dim == None or input_size == None:
            raise ValueError('Must explicitly declare input dim (num features) and output dim (number of classes)')

        super(ActinnClassifier, self).__init__()
        self.inp_dim = input_size
        self.out_dim = output_dim

        # feed forward layers
        self.classifier = nn.Sequential(
                                        nn.Linear(self.inp_dim, 100),
                                        nn.ReLU(),

                                        nn.Linear(100, 50),
                                        nn.ReLU(),

                                        nn.Linear(50, 25),
                                        nn.ReLU(),

                                        nn.Linear(25, output_dim)
                                        )

    def forward(self, x):
        """
        Forward pass of the classifier
        """
        logits = self.classifier(x)
        probabilities = F.softmax(logits, dim=1)
        predictions = from_one_hot(probabilities.cpu().detach(), axis=1)
        return (predictions, logits)
    

class ACTINNModel(TorchModel):
    def __init__(self, output_dim = None, input_size = None, **kwargs):

        self.model = ActinnClassifier(output_dim, input_size)
        self.criterion = SparseSoftmaxCrossEntropy()
        cf_optimizer = Adam(learning_rate=0.0001,
                            beta1=0.9,
                            beta2=0.999,
                            epsilon=1e-08,
                            weight_decay=0.005,
                            )

        cf_decayRate = 0.95
        cf_lr_scheduler = ExponentialDecay(initial_rate=0.0001, decay_rate=cf_decayRate, decay_steps=1000)
        super(ACTINNModel,
              self).__init__(self.model,
                             loss=self.criterion,
                             optimizer=cf_optimizer,
                             learning_rate=cf_lr_scheduler,
                             output_types=['prediction', 'loss'],
                             **kwargs)


## Model training and evaluation

In [32]:
model = ACTINNModel(output_dim= n_types,input_size= len(train_genes))

In [161]:
model.fit(train_dataset, nb_epoch=7)
predictions = model.predict(test_dataset)

In [162]:
classification_metric = dc.metrics.Metric(dc.metrics.accuracy_score)
scores = model.evaluate(test_dataset, [classification_metric], n_classes=n_types)

In [163]:
print(scores['accuracy_score'])

0.971


### References
[1] Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., ... & Surani, M. A. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods, 6(5), 377-382.

[2] Alberts, B., Johnson, A., Lewis, J., Morgan, D., Raff, M., Roberts, K., & Walter, P. (2014). Molecular Biology of the Cell (6th ed.). Garland Science.

[3] Jorge A. Tzec-Interián, Daianna González-Padilla, Elsa B. Góngora-Castillo,.(2025). Bioinformatics perspectives on transcriptomics: A comprehensive review of bulk and single-cell RNA sequencing analyses.

[4] Feiyang Ma , Matteo Pellegrini. (2019), ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics Volume 36, Issue 2, January 2020, Pages 533–538.

[5] Hwang, B., Lee, J. H., & Bang, D. (2018). Single-cell RNA sequencing technologies and bioinformatics pipelines. Experimental & Molecular Medicine, 50(8), 96. https://doi.org/10.1038/s12276-018-0071-8

[6] https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6

[7] Ahlmann‑Eltze, Constantin, and Wolfgang Huber. 2023. “Comparison of Transformations for Single‑Cell RNA‑Seq Data.” Nature Methods 20 (5): 665–672. https://doi.org/10.1038/s41592‑023‑01814‑1. 


# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Discord
The DeepChem [Discord](https://discord.gg/cGzwCdrUqS) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

In [None]:
@manual {Bioinformatics,
 title={Identifying Cell Types using scRNA-seq Data with ACTINN and Deepchem},
 organization={DeepChem},
 author={Harindhar, Rakshit, Bharath},
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Cell_type_identification_using_scRNAseq_data.ipynb.ipynb}},
 year={2025},
}