PopPhy-CNN

PopPhy-CNN,a novel convolutional neural networks (CNN) learning architecture that effectively exploits phylogentic structure in microbial taxa. PopPhy-CNN provides an input format of 2D matrix created by embedding the phylogenetic tree that is populated with the relative abundance of microbial taxa in a metagenomic sample. This conversion empowers CNNs to explore the spatial relationship of the taxonomic annotations on the tree and their quantitative characteristics in metagenomic data.

Publication:

Derek Reiman, Ahmed A. Metwally, Yang Dai. "PopPhy-CNN: A Phylogenetic Tree Embedded Architecture for Convolution Neural Networks for Metagenomic Data", bioRxiv, 2018. [paper]

Execution:

Prerequisites

Python 2.7.14
Libraries: pip install theano numpy pandas joblib xmltodict untangle sklearn network

Datasets

Datasets are stored in respective folders under the data directory. Each dataset needs the following:

count_matrix.csv
labels.txt
otu.csv
newick.txt

The file count_matrix.csv is a comma separated file representing the count table. Each row should represent a sample and each column should represent the abundance of an OTU. There should be no headers or index column in this file. The file labels.txt should contain the class labels with samples ordered in the same way as in count_matrix.csv. There should be one label per line. The file otu.csv should contain all the OTU features, ordered in the same way as the columns appear in count_matrix.csv. This should be represented as a single comma-separated list. The file newick.txt is the newick formatted text file for the phylogenetic taxonomic tree.

To generate 10 times 10-fold cross validation sets for the Cirrhosis dataset:

python prepare_data.py -d=Cirrhosis -m=CV -n=10 -s=10

To train PopPhy-CNN using the generated 10 times 10-fold cross validation Cirrhosis sets for 400 epochs with early stopping of 20 epochs:

python train.py -d=Cirrhosis -m=CV -n=10 -s=10 -e=400 -p=20

To extract feature importance scores from the learned models:

python feature_map_analysis -d=Cirrhosis -m=CV -n=10 =s=10

To generate files to use for Cytoscape visualization:

python generate_tree_scores.py -d=Cirrhosis

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PopPhy-CNN

Publication:

Execution:

Prerequisites

Datasets

To generate 10 times 10-fold cross validation sets for the Cirrhosis dataset:

To train PopPhy-CNN using the generated 10 times 10-fold cross validation Cirrhosis sets for 400 epochs with early stopping of 20 epochs:

To extract feature importance scores from the learned models:

To generate files to use for Cytoscape visualization:

About

Releases

Packages

Languages

derekreiman/PopPhy-CNN

Folders and files

Latest commit

History

Repository files navigation

PopPhy-CNN

Publication:

Execution:

Prerequisites

Datasets

To generate 10 times 10-fold cross validation sets for the Cirrhosis dataset:

To train PopPhy-CNN using the generated 10 times 10-fold cross validation Cirrhosis sets for 400 epochs with early stopping of 20 epochs:

To extract feature importance scores from the learned models:

To generate files to use for Cytoscape visualization:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages