cancerAlign

Pytorch implementation for cancerAlign: use adversarial learning to align different cancer types

If you use any source codes or datasets included in this toolkit in your work, please cite the following paper. The bibtex is listed below:

@article{cancerAlign,
  title={cancerAlign: Stratifying tumors by unsupervised alignment across cancer types},
  author={Gao, Bowen and Luo, Yunan and Ma, Jianzhu and Wang, Sheng},
  journal={arXiv preprint arXiv:2011.xxxxx},
  year={2020}
}

Abstract

Tumor stratification, which aims at clustering tumors into biologically meaningful subtypes, is the key step towards personalized treatment. Large-scale profiled cancer genomics data enables us to develop computational methods for tumor stratification. However, most of the existing approaches only considered tumors from an individual cancer type during clustering, leading to the overlook of common patterns across cancer types and the vulnerability to the noise within that cancer type. To address these challenges, we proposed cancerAlign to map tumors of the target cancer type into latent spaces of other source cancer types. These tumors were then clustered in each latent space rather than the original space in order to exploit shared patterns across cancer types. Due to the lack of aligned tumor samples across cancer types, cancerAlign used adversarial learning to learn the mapping at the population level. It then used consensus clustering to integrate cluster labels from different source cancer types. We evaluated cancerAlign on 7,134 tumors spanning 24 cancer types from TCGA and observed substantial improvement on tumor stratification and cancer gene prioritization. We further revealed the transferability across cancer types, which reflected the similarity among them based on the somatic mutation profile. cancerAlign is an unsupervised approach that provides deeper insights into the heterogeneous and rapidly accumulating somatic mutation profile and can be also applied to other genome-scale molecular information.

Model Architecture

Dataset

download from http://gdac.broadinstitute.org/
In total, we collected somatic mutation profiles of 7,134 tumors belonging to 24 different cancer cohorts, including BRCA, BLCA, CESC, CHOL, COAD, DLBC, GBM, HNSC, KICH, KIRC, LGG, LIHC, LUAD, LUSC. OV, PAAD, PRAD, READ, SARC, STES, TFCT, THCA, UCEC, UVM.

patients' mutation data are in raw_survival/*_mut.txt * are cancer type names

patients' survival data are in raw_survival/*_surv.txt_clean * are cancer type names

A preprocessed file that contains the cancer type, mutatated genes for each patient: data.csv

A known cancer gene list: known_cancer_genes.csv

Experiments

To run the cancerAlign for a specific target cancer type:

python3 run.py  --target="cancer type name"

To generate the final clustering labels of a target cancer type when number of clusters is k by cancerAlign:

python3 final_cluster.py  --target="cancer type name" --num_clusters=k

It would generate a file (target cancer type)_k.txt. For example, for BLCA and k=2, the file name is BLCA_2.txt. Inside the file, first column is patient names, second column is corresponding labels.

To generate cancer genes produced by cancerAlign:

python3 cancer_genes.py  --target="cancer type name"

It would print top 10 genes generated by cancerAlign for the target cancer type.

Questions

For questions about the data and code, please contact bgao@caltech.edu. We will do our best to provide support and address any issues. We appreciate your feedback!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
plots		plots
raw_survival		raw_survival
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cancer_genes.py		cancer_genes.py
cc2.py		cc2.py
data_preprocessing.py		data_preprocessing.py
final_cluster.py		final_cluster.py
id2gene.txt		id2gene.txt
known_cancer_genes.csv		known_cancer_genes.csv
name_mapping.txt		name_mapping.txt
new_data.csv		new_data.csv
requirements.txt		requirements.txt
run.py		run.py

License

bowen-gao/cancerAlign

Folders and files

Latest commit

History

Repository files navigation

cancerAlign

Abstract

Model Architecture

Dataset

Experiments

Questions

About

Resources

License

Stars

Watchers

Forks

Languages