<a href="https://colab.research.google.com/github/feiyoung/ReadPapers/blob/master/dance_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Installation

DANCE is published on [PyPI](https://pypi.org/project/pydance/). Thus, installing DANCE is as easy as

```bash
pip install pydance
```

Or, to install the latest dev version on GitHub as

```bash
pip install git+https://github.com/OmicsML/dance
```

But becaues DANCE includes many deep learning based methods, there are also deep learning library dependencies, such as [PyTorch](https://pytorch.org/), [PyG](https://www.pyg.org/), and [DGL](https://www.dgl.ai/). We will walk through the installation process below.

In [1]:
!pip install git+https://github.com/OmicsML/dance

Collecting git+https://github.com/OmicsML/dance
  Cloning https://github.com/OmicsML/dance to /tmp/pip-req-build-alosdfg8
  Running command git clone --filter=blob:none --quiet https://github.com/OmicsML/dance /tmp/pip-req-build-alosdfg8
  Resolved https://github.com/OmicsML/dance to commit 1d94be91625a352ad8510c16d5d23b9ce5e02b53
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting KDEpy (from pydance==1.1.0.dev0)
  Downloading KDEpy-1.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (553 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.4/553.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting igraph (from pydance==1.1.0.dev0)
  Downloading igraph-0.11.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m22.1 MB/s[0m

### 1.1. Install torch related dependencies

In [None]:
# Colab comes with torch installed, so we do not need to install pytorch here
# !pip3 install torch torchvision torchaudio

!pip install -q torch_geometric==2.3.1
!pip install -q dgl==1.1.0 -f https://data.dgl.ai/wheels/cu117/repo.html
!pip install -q torchnmf==0.3.4

### 1.2 Install DANCE v1.0.0

In [None]:
!pip install -q pydance==1.0.0

### 1.3 Check if DANCE is installed successfully

In [None]:
import dance
print(f"Installed DANCE version {dance.__version__}")

## 2. Data loading and processing

DANCE comes with several benchmarking datasets in a unified dataset object format. This makes data downloading, processing, and caching easy for users through our dataset object interface.

### 2.1. Check available data options and load data object

In [None]:
import os
os.environ["DGLBACKEND"] = "pytorch"
from pprint import pprint
from dance.datasets.singlemodality import ClusteringDataset, ScDeepSortDataset

In [None]:
print("Available dataset option for ClusteringDataset:")
pprint(ClusteringDataset.get_available_data())

In [None]:
print("Available dataset option for ScDeepSortDataset:")
pprint(ScDeepSortDataset.get_available_data())

#### Example: ClusteringDataset

In [None]:
dataset = ClusteringDataset("10X_PBMC")
print(dataset)

In [None]:
# The dataset object do not contain data, it only loads the data upon calling
# the load_data function
data = dataset.load_data()

#### Example: ScDeepSortDataset

In [None]:
dataset = ScDeepSortDataset(species="mouse", tissue="Brain",
                            train_dataset=["3285", "753"], test_dataset=["2695"])
data = dataset.load_data()

### 2.2. A quick primer on AnnData

<img
  src="https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg"
  align="right" width="450" alt="image"
/>

The [dance data object](https://github.com/OmicsML/dance/blob/912405cb5ab43caf16eb22b9216865c7e3976eaf/dance/data/base.py#L40) is heavily built on top of [AnnData](https://anndata.readthedocs.io/en/latest/), which is a widely used data object to represent, store, and manipulate large annotated matrices.

> anndata is a Python package for handling annotated data matrices in memory and on disk, positioned between pandas and xarray. anndata offers a broad range of computationally efficient features...

AnnData falls into the ecosystem of scVerse, providing extra advantage and ease for handeling single-cell data using, for example, [Scanpy](https://scanpy.readthedocs.io/en/stable/).

The dance data object essentially wraps around an AnnData object,
which can be accessed in the `.data` attribute.

In [None]:
adata = data.data
print(adata)

In [None]:
num_cells, num_genes = adata.shape
print(f"There are {num_cells:,} cells and {num_genes:,} genes in this data object.")

There are several key attributes in AnnData objects. For example, `.X` typically holds the main data, such as gene expression. `obs` and `obsm` hold metadata for each sample (i.e., a cell).

In [None]:
adata.X

In [None]:
adata.obsm["cell_type"]

### 2.3. Data pre-processing using transforms

Applying individual in-place transformations to data


In [None]:
import scanpy as sc
from dance.transforms import AnnDataTransform, FilterGenesPercentile

In [None]:
print(f"Library sizes before normalization: {data.data.X.sum(1).round(0)}")

# Library size normalization
AnnDataTransform(sc.pp.normalize_total, target_sum=1e4)(data)

print(f"Library sizes after normalization: {data.data.X.sum(1).round(0)}")

In [None]:
# Shifted log transformation
AnnDataTransform(sc.pp.log1p)(data)

print(f"Sum of expression per cell after log1p transformation: {data.data.X.sum(1)}")

In [None]:
print(f"Number of genes before filtering: {data.shape[1]:,}")

# Filter out genes that have extreme coefficient of variation
FilterGenesPercentile(min_val=1, max_val=99, mode="sum")(data)

print(f"Number of genes before filtering: {data.shape[1]:,}")

Composing transformations into a a pre-precoessing pipeline (feat. caching)

In [None]:
from dance.transforms import Compose

preprocessing_pipeline = Compose(
    AnnDataTransform(sc.pp.normalize_total, target_sum=1e-4),
    AnnDataTransform(sc.pp.log1p),
    FilterGenesPercentile(min_val=1, max_val=99, mode="sum"),
)

# Now we can apply the preprocessing pipeline transformation to our data
# data = dataset.load_data()
# preprocessing_pipeline(data)

# Alternatively, we can also pass the transformation to the loading function
data = dataset.load_data(transform=preprocessing_pipeline, cache=True)

In [None]:
# Reloading the data with cache enabled using the same transformation
# before can significantly reduce the data loading and pre-processing
# time. Making it easier for researcher to run evaluation with different
# configurations many times but with the same pre-processed data
data = dataset.load_data(transform=preprocessing_pipeline, cache=True)

## 3. Single modality tasks

### 3.1 Example: ACTINN for Cell Type Annotation

#### Model structure

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/singlemodality/mlp_visualization.png)

#### Visualization of annotation results

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/singlemodality/cell_type_visualization.png)

#### Load data

In [None]:
print("Available dataset option for ScDeepSortDataset:")
pprint(ScDeepSortDataset.get_available_data())

In [None]:
import numpy as np

from dance.modules.single_modality.cell_type_annotation.actinn import ACTINN
from dance.utils import set_seed

# Initialize model and get model specific preprocessing pipeline
model = ACTINN(hidden_dims=[256, 256], lambd=0.01, device='cuda')
preprocessing_pipeline = model.preprocessing_pipeline(normalize=True, filter_genes=True)

# Load data and perform necessary preprocessing
dataset = ScDeepSortDataset(species="mouse", tissue="Brain",
                            train_dataset=["3285", "753"], test_dataset=["2695"])
data = dataset.load_data(transform=preprocessing_pipeline, cache=True)

#### Train and evaluate model

In [None]:
# Obtain training and testing data
x_train, y_train = data.get_train_data(return_type="torch")
x_test, y_test = data.get_test_data(return_type="torch")

In [None]:
print(x_train)

In [None]:
print(y_train)

In [None]:
# Train and evaluate model
set_seed(42)
model.fit(x_train, y_train, lr=0.001, num_epochs=21,
          batch_size=1000, print_cost=True)
print(f"ACC: {model.score(x_test, y_test):.4f}")

In [None]:
print(model.model)

### 3.2 Example: GraphSCI for Imputation

#### Model structure

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/singlemodality/graphsci_visualization.png)

#### Reported results

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/singlemodality/imputation_results_example.png)

#### Load data

In [None]:
import torch

from dance.datasets.singlemodality import ImputationDataset
from dance.modules.single_modality.imputation.graphsci import GraphSCI
from dance.utils import set_seed

# Load data and perform preprocessing
set_seed(42)
dataloader = ImputationDataset(data_dir='./data', dataset='pbmc_data', train_size=0.9)
preprocessing_pipeline = GraphSCI.preprocessing_pipeline(mask=True, mask_rate=0.1)
data = dataloader.load_data(transform=preprocessing_pipeline, cache=True)

In [None]:
data.data.layers['train_mask']

In [None]:
data.data.layers['valid_mask']

#### Train and evaluate model

In [None]:
# Obtain training and testing data
X, X_raw, g, mask = data.get_x(return_type="default")
device = 'cuda:0'
X = torch.tensor(X.toarray()).to(device)
X_raw = torch.tensor(X_raw.toarray()).to(device)
g = g.to(device)
train_idx = data.train_idx
test_idx = data.test_idx

# Train and evaluate model
model = GraphSCI(num_cells=X.shape[0], num_genes=X.shape[1],
                 dataset='pbmc_data', gpu=0)
model.fit(X, X_raw, g, train_idx, mask, n_epochs=10, la=1e-7)
model.load_model()
imputed_data = model.predict(X, X_raw, g, mask)
score = model.score(X_raw, imputed_data, test_idx, mask, metric='RMSE')
print("RMSE: %.4f" % score)

### 3.3 Example: scDeepCluster for Clustering

#### Model structure

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/singlemodality/scdeepcluster_visualization.png)

#### Reported results

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/singlemodality/clustering_results_example.png)

#### Load data

In [None]:
from dance.datasets.singlemodality import ClusteringDataset
from dance.modules.single_modality.clustering.scdeepcluster import ScDeepCluster
from dance.utils import set_seed


# Load data and perform necessary preprocessing
dataloader = ClusteringDataset('./data', '10X_PBMC')
preprocessing_pipeline = ScDeepCluster.preprocessing_pipeline()
data = dataloader.load_data(transform=preprocessing_pipeline)

#### Train and evaluate model

In [None]:
# inputs: x, x_raw, n_clusters
inputs, y = data.get_train_data()
n_clusters = len(np.unique(y))
in_dim = inputs[0].shape[1]

# Build and train model
set_seed(42)
model = ScDeepCluster(input_dim=in_dim, z_dim=32, encodeLayer=[256, 64], decodeLayer=[64, 256], device='cuda')
model.fit(inputs, y, n_clusters=n_clusters, lr=0.01, epochs=3, pt_epochs=3)

# Evaluate model predictions
score = model.score(None, y)
print(f"ARI: {score:.4f}")

## 4. Multi-modality tasks

### 4.1 Modality Prediction

#### Task and Model Description

Modality Prediction: predicting the flow of information from DNA to RNA and RNA to Protein.

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/multimodality/modality_prediction_visualization.svg)

In this section, we take RNA-to-Protein as an example task, where the data are obtained from CITE-seq technology. We use BABEL[1] model as an example to demonstrate the workflow of DANCE package.

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/multimodality/babel_visualization.jpeg)

[1] Wu, Kevin E., et al. "BABEL enables cross-modality translation between multiomic profiles at single-cell resolution." Proceedings of the National Academy of Sciences 118.15 (2021): e2023070118.

#### Import packages and initializations

In [None]:
import argparse
import os
import random

import anndata
import mudata
import scanpy as sc
import torch
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD

from dance import logger
from dance.data import Data
from dance.datasets.multimodality import ModalityPredictionDataset
from dance.modules.multi_modality.predict_modality.babel import BabelWrapper
from dance.utils import set_seed

set_seed(42)
device = 'cuda'

#### Load data and perform necessary preprocessing

In [None]:
dataset = ModalityPredictionDataset("openproblems_bmmc_cite_phase2_rna_subset")
data = dataset.load_data()

In [None]:
# Construct data object
data.set_config(feature_mod="mod1", label_mod="mod2")

# Obtain training and testing data
x_train, y_train = data.get_train_data(return_type="torch")
x_test, y_test = data.get_test_data(return_type="torch")

In [None]:
x_test, y_test, x_test.shape, y_test.shape

#### Specify hyperparameters and initialize the model

In [None]:
parser = argparse.ArgumentParser()

######## Important hyperparameters
parser.add_argument("--subtask", default="openproblems_bmmc_cite_phase2_rna_subset")
parser.add_argument("--max_epochs", type=int, default=40)
parser.add_argument("--lr", "-l", type=float, default=0.01, help="Learning rate")
parser.add_argument("--batchsize", "-b", type=int, default=64, help="Batch size")
parser.add_argument("--hidden", type=int, default=64, help="Hidden dimensions")
parser.add_argument("--earlystop", type=int, default=2, help="Early stopping after N epochs")
parser.add_argument("--naive", "-n", action="store_true", help="Use a naive model instead of lego model")
parser.add_argument("--lossweight", type=float, default=1., help="Relative loss weight")
########

parser.add_argument("--model_folder", default="./")
parser.add_argument("--outdir", "-o", default="./", help="Directory to output to")
parser.add_argument("--resume", action="store_true")
parser.add_argument("--device", default="cuda")
parser.add_argument("--cpus", default=1, type=int)
parser.add_argument("--rnd_seed", default=42, type=int)

args_defaults = parser.parse_args([])
args = argparse.Namespace(**vars(args_defaults))
args

In [None]:
model = BabelWrapper(args, dim_in=x_train.shape[1], dim_out=y_train.shape[1])

#### Train and evaluate model

In [None]:
model.fit(x_train.float(), y_train.float(), val_ratio=0.15)

In [None]:
model.predict(x_test.float())

In [None]:
model.score(x_test.float(), y_test.float())

### 4.2 Modality Matching

Matching profiles of each cell from different modalities.

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/multimodality/modality_matching_visualization.jpeg)

In this section, we take RNA-to-Protein as an example task, where the data are obtained from CITE-seq technology. We use scMoGNN[2] model as an example to demonstrate the workflow of DANCE package.

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/multimodality/scmogcn_visualization.jpeg)

[2] Wen, Hongzhi, et al. "Graph neural networks for multimodal single-cell data integration." Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022.

#### Load data and perform necessary preprocessing

In [None]:
from dance.datasets.multimodality import ModalityMatchingDataset
from dance.modules.multi_modality.match_modality.scmogcn import ScMoGCNWrapper
from dance.transforms.graph.cell_feature_graph import CellFeatureBipartiteGraph
import numpy as np
import torch.nn.functional as F

dataset = ModalityMatchingDataset('openproblems_bmmc_cite_phase2_rna_subset', root='./data', preprocess="pca", pkl_path='lsi_input_pca_count.pkl')
data = dataset.load_data()

In [None]:
# ScMoGNN graph construction
data = CellFeatureBipartiteGraph(cell_feature_channel="X_pca", mod="mod1")(data)
data = CellFeatureBipartiteGraph(cell_feature_channel="X_pca", mod="mod2")(data)
data.set_config(feature_mod=["mod1", "mod2", "mod1", "mod2"], feature_channel_type=["uns", "uns", "obs", "obs"],
                feature_channel=["g", "g", "batch", "batch"], label_mod="mod1", label_channel="labels")

In [None]:
(g_mod1, g_mod2, batch_mod1, batch_mod2), z = data.get_data(return_type="default")
train_size = len(data.get_split_idx("train"))
test_idx = np.arange(train_size, g_mod1.num_nodes("cell"))
z_test = F.one_hot(torch.from_numpy(z[train_size:]).long())
labels1 = torch.argmax(z_test, dim=0).to(device)
labels2 = torch.argmax(z_test, dim=1).to(device)
g_mod1 = g_mod1.to(device)
g_mod2 = g_mod2.to(device)

#### Specify hyperparametsr and initialize the model

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument("--layers", default=4, type=int, choices=[3, 4, 5, 6, 7])
parser.add_argument("--learning_rate", default=6e-4, type=float)
parser.add_argument("--disable_propagation", default=0, type=int, choices=[0, 1, 2])
parser.add_argument("--auxiliary_loss", default=True, type=bool)
parser.add_argument("--epochs", default=2000, type=int)
parser.add_argument("--hidden_size", default=64, type=int)
parser.add_argument("--temperature", default=2.739896, type=float)
parser.add_argument("--device", default='cuda', type=str)
parser.add_argument("--rnd_seed", default=42, type=int)

args_defaults = parser.parse_args([])
args = argparse.Namespace(**vars(args_defaults))
data_folder = './data/'
device = 'cuda'
args

In [None]:
model = ScMoGCNWrapper(
    args,
    [
        [(g_mod1.num_nodes("feature"), 512, 0.25), (512, 512, 0.25), (512, args.hidden_size)],
        [(g_mod2.num_nodes("feature"), 512, 0.2), (512, 512, 0.2), (512, args.hidden_size)],
        [(args.hidden_size, 512, 0.2), (512, g_mod1.num_nodes("feature"))],
        [(args.hidden_size, 512, 0.2), (512, g_mod2.num_nodes("feature"))],
    ],
    args.temperature,
)

#### Train and evaluate model

In [None]:
model.fit(g_mod1, g_mod2, labels1, labels2, train_size=train_size)

In [None]:
model.predict(test_idx, enhance=True, batch1=batch_mod1, batch2=batch_mod2)

In [None]:
model.score(test_idx, labels_matrix=z_test, enhance=True, batch1=batch_mod1, batch2=batch_mod2)

## 5. Spatial tasks

### 5.1 Spatial Domain

#### SpaGCN model for spatial domain identification

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/spatial/spagcn_framework.png)

In [None]:
from dance.transforms import Compose
from dance.datasets.spatial import SpatialLIBDDataset
from dance.modules.spatial.spatial_domain.spagcn import SpaGCN
from dance.utils import set_seed
from dance.transforms import CellPCA, Compose, FilterGenesMatch, SetConfig
from dance.transforms.graph import SpaGCNGraph, SpaGCNGraph2D

#### Initialize model and get model specific preprocessing *pipeline*

In [None]:
model = SpaGCN()
# In SpaGCN, alpha and beta are used for graph construction
preprocessing_pipeline = model.preprocessing_pipeline(alpha=1, beta=49)

#### User defined customized transform

In [None]:
preprocessing_pipeline = Compose(
    FilterGenesMatch(prefixes=["ERCC", "MT-"]),
    SpaGCNGraph(alpha=1, beta=49),
    SpaGCNGraph2D(),
    CellPCA(n_components=40),
    SetConfig({
        "feature_channel": ["CellPCA", "SpaGCNGraph", "SpaGCNGraph2D"],
        "feature_channel_type": ["obsm", "obsp", "obsp"],
        "label_channel": "label",
        "label_channel_type": "obs"
    }),
)

#### Load data and perform necessary preprocessing

In [None]:
dataloader = SpatialLIBDDataset(data_id="151673")
data = dataloader.load_data(transform=preprocessing_pipeline, cache=True)

In [None]:
data

In [None]:
data.data.obsp["SpaGCNGraph"].shape

In [None]:
data.x[0].shape

In [None]:
data.x[1].shape

In [None]:
(x, adj, adj_2d), y = data.get_train_data()

In [None]:
x, x.shape

In [None]:
adj, adj.shape

In [None]:
y, y.shape

#### Train and evaluate model

In [None]:
l = model.search_l(0.05, adj, start=0.01, end=1000, tol=5e-3, max_run=200)
model.set_l(l)
res = model.search_set_res((x, adj), l=l, target_num=7, start=0.4, step=0.1,
                           tol=5e-3, lr=0.05, epochs=200, max_run=200)

In [None]:
pred = model.fit_predict((x, adj), init_spa=True, init="louvain", tol=5e-3,
                         lr=0.05, epochs=200, res=res)

In [None]:
score = model.default_score_func(y, pred)
print(f"ARI: {score:.4f}")

### 5.2 Cell Type Deconvolution

#### DSTG model for cell type deconvolution

![image](https://github.com/OmicsML/dance-tutorials/raw/main/imgs/tutorial_v1/spatial/dstg_framework.png)

In [None]:
import numpy as np
import torch
from dance.datasets.spatial import CellTypeDeconvoDataset
from dance.modules.spatial.cell_type_deconvo import DSTG
from dance.utils import set_seed

#### Get model specific preprocessing *pipeline*

In [None]:
preprocessing_pipeline = DSTG.preprocessing_pipeline(
    n_pseudo=500,
    n_top_genes=2000,
    k_filter=200,
    num_cc=30,
)

#### Load data and perform necessary preprocessing

In [None]:
dataset = CellTypeDeconvoDataset(data_dir="data/spatial", data_id="CARD_synthetic")
data = dataset.load_data(transform=preprocessing_pipeline, cache=True)

In [None]:
data.x

In [None]:
len(data.x)

In [None]:
data.x[0], data.x[0].shape

In [None]:
data.x[1], data.x[1].shape

In [None]:
(adj, x), y = data.get_data(return_type="default")
x, y = torch.FloatTensor(x), torch.FloatTensor(y.values)
adj = torch.sparse.FloatTensor(torch.LongTensor([adj.row.tolist(), adj.col.tolist()]),
                               torch.FloatTensor(adj.data.astype(np.int32)))

In [None]:
x, x.shape

In [None]:
adj, adj.shape

In [None]:
y, y.shape

In [None]:
train_mask = data.get_split_mask("pseudo", return_type="torch")
inputs = (adj, x, train_mask)
train_mask, train_mask.shape

#### Train and evaluate model

In [None]:
model = DSTG(nhid=16, bias=False, dropout=0, device="auto")
pred = model.fit_predict(inputs, y, lr=0.01, max_epochs=25, weight_decay=0.0001)
pred, pred.shape

In [None]:
test_mask = data.get_split_mask("test", return_type="torch")
test_mask, test_mask.shape

In [None]:
score = model.default_score_func(y[test_mask], pred[test_mask])
print(f"MSE: {score:7.4f}")