Skip to content

bchao1/scJoint

 
 

Repository files navigation

scJoint

scJoint is a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. For more information, please see scJoint manuscript: https://doi.org/10.1101/2020.12.31.424916.

scJoint is developed using PyTorch 1.0.0 and has been tested under both PyTorch 1.0.0 and 1.4.0. scJoint requires 1 GPU to run.

Tutorials

  • A step-by-step tutorial using CITE-seq and ASAP-seq PBMC data from control condition generated by Mimitou et al. 2020 (GSE156478) is demonstrated here: link
  • Tutorial for 10x Genomics data:
    • process data from SingleCellExperiment to scJoint's input link
    • scJoint integration analysis link

Installation

scJoint can be obtained by simply clonning the github repository:

git clone https://github.com/SydneyBioX/scJoint.git

The following python packages are required to be installed before running scJoint: h5py, torch, itertools, scipy, numpy, os, random, sys, time, and datetime.

Preparing intput for scJoint

scJoint's main function takes expression data in .npz format and cell type labels in .txt format. To prepare the input for scJoint, modifying dataset paths in process_db.py which:

  1. take .h5 files of expression matrix stored in matrix/data as input and generate .npz files for each expression matrix.
  2. transform .csv files of cell type labels to numeric and stored in .txt files; and output label_to_idx.txt file indicates the correpondence of the numeric labels and the cell type labels.

Note:

  1. The expression matrix for scRNA-seq data are the gene expression matrix (either normalised or raw data), and gene actvitiy matrix for scATAC-seq data.
  2. The cell type labels for scRNA-seq is required, while the labels for scATAC-seq is optional and will only be used in accuracy calculation.

Running scJoint

Edit config.py according to the data input (See Arguments section for more details).

In terminal, run

python main.py

The output will be saved in ./output folder.

Arguments

The script config.py indicate the arguments for scJoint, which needs to be modified according to the data.

Dataset information

  • DB: name of the study
  • number_of_class: Number of cell type in the training data (scRNA-seq data)
  • input_size: Number of genes in both training and test data
  • rna_paths: A list of file paths of the .npz files of scRNA-seq gene expression datasets
  • rna_labels: A list of file paths of the .txt files of scRNA-seq cell type inforamtion
  • atac_paths: A list of file paths of the .npz files of scATAC-seq gene activity expression datasets
  • atac_labels: A list of file paths of the .txt files of scATAC-seq cell type inforamtion (optional, if atac_labels are provided, accuracy after knn would be provided)
  • rna_protein_paths: A list of paths of the .npz files of protein expression data for CITE-seq data (optional)
  • atac_protein_paths: A list of paths of the .npz files of protein expression data for ASAP-seq data (optional)

Training config

  • batch_size: Batch size (set as 256 by default)
  • lr_stage1: Learning rate for stage 1
  • lr_stage3: Learning rate for stage 3
  • lr_decay_epoch: Number of epoch learning rate decay
  • epochs_stage1: Number of epochs for stage 1
  • epochs_stage3: Number of epochs for stage 3
  • p: The fraction of data pairs expected to have high cosine similarity scores (set as 0.8 by default)
  • embedding_size: Number of nodes in the embedding (hidden) layer (set as 64 by default)
  • momentum: Momentum for SGD (set as 0.9 by default)
  • center_weight: The weight for center loss (set as 1 by default)
  • num_threads: Number of threads used (set as 1 by default)
  • seed: seed to be used (set as none by default)

The configuration we used in our paper can be found in link.

Output

scJoint will output 4 types of .txt files:

  • _embeddings.txt: Output of embeddings layer for each dataset
  • _knn_predictions.txt: Predicted results of KNN for each scATAC-seq data (final predictions), where the numeric corresponding to the label_to_idx.txt file.
  • _knn_probs.txt: Probability of KNN predictions for each scATAC-seq data
  • _predictions.txt: Output of prediction layer for each dataset

Visualisation

To generate tSNE and UMAP plots for the output data using R, run the following codes in terminal

Rscript embedding_visualisation_R.R --output_dir output/ --input_dir data/ --TSNE TRUE --UMAP TRUE --proportion 1

where

  • output_dir: Directory of the output folder
  • input_dir: Directory of intput folder (where the label_to_idx.txt file is saved)
  • TSNE: TRUE/FALSE indicates whether to run TSNE
  • UMAP: TRUE/FALSE indicates whether to run UMAP
  • Proportion: proprotion of cells used in visualisation

Note:

  • The script assumes the output folder only have results from one study
  • Please install the following packages before running the embedding_visualisation_R.R script by running the following codes in R:
install.packages(c("ggplot2", "ggthemes", "scattermore", "ggpubr", "Rtsne", "uwot", "pals", "grDevices", "optparse"))

Output of embedding_visualisation_R.R:

  • TSNE and/or UMAP embedding will be generating in the output_dir folder: tsne_embedding.txt, umap_embedding.txt
  • Visualisation of TSNE and UMAP: TSNE_plot.pdf, UMAP_plot.pdf

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.2%
  • Python 4.5%
  • Other 0.3%