# Introduction to cell type identification from scRNA-seq data using ACTINN

Cell type identification is one of the major goals in single cell RNA sequencing (scRNA-seq). Traditional methods for assigning cell types typically involve the use of unsupervised clustering, the identification of signature genes in each cluster, followed by a manual lookup of these genes in the literature and databases to assign cell types. However, there are several limitations associated with these approaches, such as unwanted sources of variation that influence clustering and a lack of canonical markers for certain cell types.[1]

In this tutorial, we demonstrate the use of ACTINN (Automated Cell Type Identification using Neural Networks), which employs a neural network with three hidden layers, trains on datasets with predefined cell types and predicts cell types for other datasets based on the trained parameters.

Here, ACTINN is implemented with a deepchem wrapper.

## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Advanced_Model_Training.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install DeepChem in your local machine again.

In [None]:
#!pip install --pre deepchem
import deepchem as dc
dc.__version__

No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!
No normalization for NumAmideBonds. Feature removed!
No normalization for NumAtomStereoCenters. Feature removed!
No normalization for NumBridgeheadAtoms. Feature removed!
No normalization for NumHeterocycles. Feature removed!
No normalization for NumSpiroAtoms. Feature removed!
No normalization for NumUnspecifiedAtomStereoCenters. Feature removed!
No normalization for Phi. Feature removed!


Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/Users/harin/Desktop/deepchem/deepchem/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


'2.8.1.dev'

In [None]:
import os
current_dir = os.path.dirname(os.path.realpath('__file__'))

In [None]:
dc.utils.download_url(
    'https://github.com/Harindhar10/deepchem/tree/actinn/deepchem/feat/tests/data/sc_rna_seq_data/scRNAseq_sample_1.h5',
    current_dir,
    'sample_1.h5'
)

dc.utils.download_url(
    'https://github.com/Harindhar10/deepchem/tree/actinn/deepchem/feat/tests/data/sc_rna_seq_data/labels_1.txt',
    current_dir,
    'labels_1.txt'
)

dc.utils.download_url(
    'https://github.com/Harindhar10/deepchem/tree/actinn/deepchem/feat/tests/data/sc_rna_seq_data/scRNAseq_sample_2.h5',
    current_dir,
    'sample_2.h5'
)

dc.utils.download_url(
    'https://github.com/Harindhar10/deepchem/tree/actinn/deepchem/feat/tests/data/sc_rna_seq_data/labels_2.txt',
    current_dir,
    'labels_2.txt'
)

## Data Loading

Subsets of data obtained from [Tabula Muris](https://tabula-muris.ds.czbiohub.org) is used for simplicity.

There are two approaches to creating the train and test sets:
1. Splitting a single dataset into training and testing portions.
2. Using datasets from different sources as the train and test sets.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### 1. Same source

In [None]:
labels = pd.read_csv(os.path.join(current_dir,'labels_1.txt'), header=None, sep='\t')
dataset = pd.read_hdf(os.path.join(current_dir,'sample_1.h5'))

# Stratified split based on cell_type
train_ids, test_ids = train_test_split(
    labels[0],
    test_size=0.2,  
    stratify=labels[1],
    random_state=42
)

train_labels = labels[labels[0].isin(train_ids)]
test_labels = labels[labels[0].isin(test_ids)]

train_set = dataset.loc[:, train_ids]
test_set = dataset.loc[:, test_ids]

TODO: Use deepchem splitter instead. Data is stored as genes x cells (features x samples), which isn't compatible with deepchem's splitter and dataset objects.

'labels' has columns cell id and cell type

### 2. Different sources

When using datasets from different sources, the gene sets may not completely overlap. Since the ACTINN model defines its layers based on the gene set, it's essential that both the training and testing sets contain the same genes. To ensure this, we first identified the genes common to both datasets and filtered out those that were not shared.

In [None]:
train_set = pd.read_hdf(os.path.join(current_dir,'sample_1.h5'))
train_labels = pd.read_csv(os.path.join(current_dir,'labels_1.txt'), header=None, sep='\t')
test_set = pd.read_hdf(os.path.join(current_dir,'sample_2.h5'))
test_labels = pd.read_csv(os.path.join(current_dir,'labels_2.txt'), header=None, sep='\t')

common_genes = train_set.index.intersection(test_set.index)
common_genes = sorted(common_genes)

train_set = train_set.loc[common_genes]
test_set = test_set.loc[common_genes]

### To convert the cell type labels from strings to integers. Typically labels are present as 'B Cell', 'T Cell' etc.

In [None]:
def convert_type2label(self, types):

    all_celltype = list(set(types))
    n_types = len(all_celltype)

    type_to_label_dict = {}

    for i in range(len(all_celltype)):
        type_to_label_dict[all_celltype[i]] = i

    types = list(types)
    labels = list()
    for type in types:
        labels.append(type_to_label_dict[type])
    return np.array(labels)

train_labels = convert_type2label(train_labels[1])
test_labels = convert_type2label(test_labels[1])

## Train set data transformation and gene filtering

Next, each cell’s expression value was normalized to its total expression value and multiplied by a scale factor of 10 000.

In [None]:
# 1) extract gene names & data array
gene_names = train_set.index.to_numpy()
X = np.array(train_set, dtype=np.float32)

# 2) library-size normalize to 10,000 (in-place)
col_sums = X.sum(axis=0, keepdims=True)  # shape (1, n_cells)
X /= col_sums  # broadcast divide
X *= 10000

The counts were increased by 1, and the log2 value was calculated

In [None]:
# 3) log2(x + 1) transform (in-place)
np.log2(X + 1, out=X)

To filter out outlier genes, the genes with the highest 1% and lowest 1% expression were removed.

In [None]:
# 4) filter by total expression
expr = X.sum(axis=1)  # total per gene
low, high = np.percentile(expr, [1, 99])
mask_expr = (expr >= low) & (expr <= high)
X = X[mask_expr, :]
gene_names = gene_names[mask_expr]

The gene with the highest 1% and the lowest 1% standard deviation were also removed.

In [None]:
# 5) filter by coefficient of variation
mean_expr = X.mean(axis=1)
cv = X.std(axis=1) / mean_expr
low_cv, high_cv = np.percentile(cv, [1, 99])
mask_cv = (cv >= low_cv) & (cv <= high_cv)
X = X[mask_cv, :]
gene_names = gene_names[mask_cv]

# genes x cells to cells x genes
train_set = np.transpose(X)

train_genes = gene_names

## Test set transformation and filtering

The exact same normalisation aproach used for train set is used for test set.
First, each cell’s expression value was normalized to its total expression value and multiplied by a scale factor
of 10 000. The counts were increased by 1, and the log2 value was
calculated

In [None]:
# 1) extract gene names & data array
test_genes = test_set.index.to_numpy()
X = np.array(test_set, dtype=np.float32)

# 2) library-size normalize to 10,000 (in-place)
col_sums = X.sum(axis=0, keepdims=True)  # shape (1, n_cells)
X /= col_sums  # broadcast divide
X *= 10000

# 3) log2(x + 1) transform (in-place)
np.log2(X + 1, out=X)

# genes x cells --> cells x genes
test_set = np.transpose(X)


The gene list obtained from the train set filtering steps is used to mask the test set. In the original ACTINN implementation, genes were filtered using both the training and test sets. To avoid potential information leakage, we instead apply filtering based solely on the training set.

In [None]:
test_set = test_set.loc[:, train_genes]

## Model Definition

ACTINN implemented with deepchem Model class as wrapper

In [None]:
import torch.nn as nn
import torch
from deepchem.models.torch_models import TorchModel
from deepchem.models.optimizers import Adam
from deepchem.models.optimizers import ExponentialDecay
from typing import List


class ActinnClassifier(nn.Module):

    def __init__(self, output_dim=None, input_size=None):
        """
        The Classifer class: We are developing a model similar to ACTINN for good accuracy
        """
        if output_dim == None or input_size == None:
            raise ValueError('Must explicitly declare input dim (num features) and output dim (number of classes)')

        super(ActinnClassifier, self).__init__()
        self.inp_dim = input_size
        self.out_dim = output_dim

        # feed forward layers
        self.classifier_sequential = nn.Sequential(
                                        nn.Linear(self.inp_dim, 100),
                                        nn.ReLU(),

                                        nn.Linear(100, 50),
                                        nn.ReLU(),

                                        nn.Linear(50, 25),
                                        nn.ReLU(),

                                        nn.Linear(25, output_dim)
                                        )

    def forward(self, x):
        """
        Forward pass of the classifier
        """
        out = self.classifier_sequential(x)
        return out

class ACTINNModel(TorchModel):
    def __init__(self, output_dim = None, input_size = None, **kwargs):
        """
        """
        self.model = ActinnClassifier(output_dim, input_size)
        
        print('model', self.model)
        cf_optimizer = Adam(learning_rate=0.0001,
                            beta1=0.9,
                            beta2=0.999,
                            epsilon=1e-08,
                            weight_decay=0.005,
                            )

        cf_decayRate = 0.95
        cf_lr_scheduler = ExponentialDecay(initial_rate=0.0001, decay_rate=cf_decayRate, decay_steps=1000)
        super(ACTINNModel,
              self).__init__(self.model,
                             loss=self.loss_fn,
                             optimizer=cf_optimizer,
                             learning_rate=cf_lr_scheduler,
                             output_types=['prediction'],
                             **kwargs)

    def loss_fn(self, outputs: List, labels: List[torch.Tensor],
                    weights: List[torch.Tensor]) -> torch.Tensor:
        outputs = outputs[0]
        labels = labels[0][:,0]
        return nn.CrossEntropyLoss()(outputs,labels)


## Model training and evaluation

In [None]:
model = ACTINNModel(output_dim= n_types,input_size= train_set.shape[1])

In [None]:
model.fit(train_dataset, nb_epoch=5)
logits= model.predict(test_dataset)

In [None]:
import torch.nn.functional as F

In [None]:
logits_tensor = torch.from_numpy(logits)
probabilities = F.softmax(logits_tensor, dim=1)  # Shape: (100, 12)
predictions = np.argmax(probabilities, axis=1)

In [None]:
from sklearn.metrics import accuracy_score

# Compute accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy:.4f}")

### References
1. [ACTINN: automated identification of cell types in single cell RNA sequencing](https://academic.oup.com/bioinformatics/article/36/2/533/5540320)