scJoint

scJoint is a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. For more information, please see scJoint manuscript: https://doi.org/10.1101/2020.12.31.424916.

scJoint is developed using PyTorch 1.0.0 and has been tested under both PyTorch 1.0.0 and 1.4.0. scJoint requires 1 GPU to run.

Tutorials

A step-by-step tutorial using CITE-seq and ASAP-seq PBMC data from control condition generated by Mimitou et al. 2020 (GSE156478) is demonstrated here: link
Tutorial for 10x Genomics data:
- process data from SingleCellExperiment to scJoint's input link
- scJoint integration analysis link

Installation

scJoint can be obtained by simply clonning the github repository:

git clone https://github.com/SydneyBioX/scJoint.git

The following python packages are required to be installed before running scJoint: h5py, torch, itertools, scipy, numpy, os, random, sys, time, and datetime.

Preparing intput for scJoint

scJoint's main function takes expression data in .npz format and cell type labels in .txt format. To prepare the input for scJoint, modifying dataset paths in process_db.py which:

take .h5 files of expression matrix stored in matrix/data as input and generate .npz files for each expression matrix.
transform .csv files of cell type labels to numeric and stored in .txt files; and output label_to_idx.txt file indicates the correpondence of the numeric labels and the cell type labels.

Note:

The expression matrix for scRNA-seq data are the gene expression matrix (either normalised or raw data), and gene actvitiy matrix for scATAC-seq data.
The cell type labels for scRNA-seq is required, while the labels for scATAC-seq is optional and will only be used in accuracy calculation.

Running scJoint

Edit config.py according to the data input (See Arguments section for more details).

In terminal, run

python main.py

The output will be saved in ./output folder.

Arguments

The script config.py indicate the arguments for scJoint, which needs to be modified according to the data.

Dataset information

DB: name of the study
number_of_class: Number of cell type in the training data (scRNA-seq data)
input_size: Number of genes in both training and test data
rna_paths: A list of file paths of the .npz files of scRNA-seq gene expression datasets
rna_labels: A list of file paths of the .txt files of scRNA-seq cell type inforamtion
atac_paths: A list of file paths of the .npz files of scATAC-seq gene activity expression datasets
atac_labels: A list of file paths of the .txt files of scATAC-seq cell type inforamtion (optional, if atac_labels are provided, accuracy after knn would be provided)
rna_protein_paths: A list of paths of the .npz files of protein expression data for CITE-seq data (optional)
atac_protein_paths: A list of paths of the .npz files of protein expression data for ASAP-seq data (optional)

Training config

batch_size: Batch size (set as 256 by default)
lr_stage1: Learning rate for stage 1
lr_stage3: Learning rate for stage 3
lr_decay_epoch: Number of epoch learning rate decay
epochs_stage1: Number of epochs for stage 1
epochs_stage3: Number of epochs for stage 3
p: The fraction of data pairs expected to have high cosine similarity scores (set as 0.8 by default)
embedding_size: Number of nodes in the embedding (hidden) layer (set as 64 by default)
momentum: Momentum for SGD (set as 0.9 by default)
center_weight: The weight for center loss (set as 1 by default)
num_threads: Number of threads used (set as 1 by default)
seed: seed to be used (set as none by default)

The configuration we used in our paper can be found in link.

Output

scJoint will output 4 types of .txt files:

_embeddings.txt: Output of embeddings layer for each dataset
_knn_predictions.txt: Predicted results of KNN for each scATAC-seq data (final predictions), where the numeric corresponding to the label_to_idx.txt file.
_knn_probs.txt: Probability of KNN predictions for each scATAC-seq data
_predictions.txt: Output of prediction layer for each dataset

Visualisation

To generate tSNE and UMAP plots for the output data using R, run the following codes in terminal

Rscript embedding_visualisation_R.R --output_dir output/ --input_dir data/ --TSNE TRUE --UMAP TRUE --proportion 1

where

output_dir: Directory of the output folder
input_dir: Directory of intput folder (where the label_to_idx.txt file is saved)
TSNE: TRUE/FALSE indicates whether to run TSNE
UMAP: TRUE/FALSE indicates whether to run UMAP
Proportion: proprotion of cells used in visualisation

Note:

The script assumes the output folder only have results from one study
Please install the following packages before running the embedding_visualisation_R.R script by running the following codes in R:

install.packages(c("ggplot2", "ggthemes", "scattermore", "ggpubr", "Rtsne", "uwot", "pals", "grDevices", "optparse"))

Output of embedding_visualisation_R.R:

TSNE and/or UMAP embedding will be generating in the output_dir folder: tsne_embedding.txt, umap_embedding.txt
Visualisation of TSNE and UMAP: TSNE_plot.pdf, UMAP_plot.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__MACOSX		__MACOSX
__pycache__		__pycache__
data		data
data_10x		data_10x
models		models
output		output
tutorial		tutorial
util		util
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
config.py		config.py
config.pyc		config.pyc
data.zip		data.zip
data_to_h5.R		data_to_h5.R
embedding_visualisation_R.R		embedding_visualisation_R.R
main.py		main.py
plot_experiments.py		plot_experiments.py
plot_space.py		plot_space.py
process_db.py		process_db.py
test.py		test.py
train.sh		train.sh
validate_functions.py		validate_functions.py

bchao1/scJoint

Folders and files

Latest commit

History

Repository files navigation

scJoint

Tutorials

Installation

Preparing intput for scJoint

Running scJoint

Arguments

Dataset information

Training config

Output

Visualisation

About

Resources

Stars

Watchers

Forks

Languages