Skip to content

chenhcs/GIANT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GIANT

License: MIT

GIANT (Gene-based data Integration and ANalysis Technique) is a method for unified analysis of atlas-level single cell data. The method generates a unified gene embedding space across multiple data modalities and tissues. GIANT first constructs gene graphs for cell clusters of each tissue from each data modality. A dendrogram is then built to connect the gene graphs in a hierarchy. GIANT next combines the gene graphs and the dendrogram to recursively embed genes in the graphs to a common latent space. Locations of genes in the space reflect their functions in their cell clusters.

System requirements

Operating system

The software has been tested on the CentOS Linux 7 system.

Software requirements

  • python 3.9.7
  • anndata 0.7.5
  • cython 0.29.30
  • goatools 1.1.12
  • joblib 1.1.0
  • leidenalg 0.8.10
  • matplotlib 3.5.3
  • networkx 2.6.3
  • numpy 1.22.4
  • pandas 1.4.3
  • pyensembl 1.9.4
  • scanpy 1.8.2
  • scikit-learn 1.0.1
  • scipy 1.9.0
  • setuptools 63.4.1

Installation

It is recommended to create a virtual environment using Conda. After successfully installing Anaconda/Miniconda, create an environment using the provided environment.yml file:

conda env create -f environment.yml
conda activate GIANT-env

GIANT supports distributed training based on the Gensim package implemented in Cython. Following the steps to compile the Cython code:

cd src/embedding
python setup.py build_ext --inplace

Usage

Below is the instruction of command usages on example data. Use python script.py --help to check the specific usage of each command.

Build gene graphs

  • Unzip the .h5ad files in the example_data folder.
  • The .h5ad files are in the standard output format of the Scanpy pipelines.
  • Run the following commands to build co-expression graphs for scRNA-seq data or Slide-seq data, the graphs will be saved in the specified --outdir:
cd src/build_graphs
python coexpression_knn_graph.py --K 10 --datatype "rna" --tissue 'Heart' --datadir "../../example_data/secondary_analysis.h5ad" --genescopefile "../../example_data/gene_scope.txt" --outdir "../../graphs/edgelists"
  • The .narrowPeak files used in this step are generated from snap files using the SnapATAC package. Peaks that contain TF motifs (*_peak_by_tf.txt) can be identified using the motifmatchr package with TF binding motifs from the JASPAR database(*_tf_motifs.txt).
  • Run the following commands to build gene-TF hypergraphs for scATAC-seq data, the graphs will be saved in the specified --outdir:
cd src/build_graphs
python atacseq_hypergraph.py --bpupstream 500 --tissue "Thymus" --datadir "../../example_data/atac_peaks" --genescopefile "../../example_data/gene_scope.txt" --generangefile "../../example_data/GenesRanges.csv" --outdir "../../graphs/edgelists/"
  • Run the following commands to build spatial co-expression hypergraphs for Slide-seq data, the graphs will be saved in the specified --outdir:
cd src/build_graphs
python spatial_coexpression_knn_graph.py --K 10 --tissue "Kidney" --datadir "../../example_data/spatial_secondary_analysis.h5ad" --genescopefile "../../example_data/gene_scope.txt" --outdir "../../graphs/edgelists/"

Build dendrogram of graphs

  • The differential genes used in this step can be computed from the gene expression/activity matrices using the Scanpy package.
  • Run the following commands to build dendrogram, the dendrogram edges will be saved in the specified --outdir, a figure of the dendrogram will be saved in the specified --figdir:
cd src/build_dendrogram
python build_dendrogram.py --K 50 --degenedir "../../example_data/DE_genes/" --graphdir "../../graphs/edgelists" --outdir "../../graphs/" --figdir "../../graphs/"

Learn gene embeddings

  • Run the following commands to learn gene embeddings, the generated embedding file gene_vectors.emb will be saved in the specified --outdir:
cd src/embedding
python change_id.py --graphdir "../../graphs/edgelists/" #Map gene IDs to numbers for the model input
python main.py --input "../../graphs/graph.list" --outdir "emb" --hierarchy "../../graphs/graph.hierarchy" --iter 20 --regstrength 1 --workers 8 --dimension 128

Identify enriched functions or gene regulons in different parts of the embedding space

  • Download gene annotations from the GOA database.
  • Download the ontology from the Gene Ontology database.
  • Run the following commands to split the embedding space into components (cluster genes in the whole space), then run enrichment analysis for each embedding component:
cd src/analysis
python format_emb_file.py --emb '../embedding/emb/gene_vectors.emb' --idmap '../../graphs/edgelists/gmap.map' --outdir './'
python cluster_genes.py --emb "../../example_data/gene.emb" --resolution 10 --nneighbors 30 --annotationfile "goa_human.gaf" --outdir "outdir"
python GO_TF_enrichment.py "outdir/cluster_0.txt" "outdir/population.txt" "goa_human_graph.gaf" --ev_exc=IEA --pval=0.05 --method=fdr_bh --pval_field=fdr_bh

Infer novel gene functions

  • Run the following commands to cluster each gene's embeddings of different graphs, then infer the functions of each gene in each cluster by running enrichment analysis on its neighbor genes in the cluster:
cd src/analysis
python neighbors_in_graphs.py --emb "../../example_data/gene.emb" --annotationfile "goa_human.gaf" --outdir "gene_list_outdir"
python GO_enrichment_in_neighbors.py "gene_list_outdir/A2M_nlist_0.txt" "gene_list_outdir/population.txt" "goa_human_name.gaf" --ev_exc=IEA --pval=0.05 --method=fdr_bh --pval_field=fdr_bh

Pretrained embeddings

Pretrained gene embeddings for both the HuBMAP dataset and the human fetal dataset are available in the embeddings folder.

Credits

The software is an implementation of the method GIANT, jointly developed by Hao Chen, Nam D. Nguyen, Matthew Ruffalo, and Ziv Bar-Joseph from the System Biology Group @ Carnegie Mellon University.

Contact

Contact us if you have any questions:
Hao Chen: hchen4 at andrew.cmu.edu
Ziv Bar-Joseph: zivbj at andrew.cmu.edu

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

GIANT (Gene-based data Integration and ANalysis Technique) is a method for large-scale joint analyses of atlas-level single cell data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published