Skip to content

YDaiLab/Meta-Signer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

Meta-Signer

Meta-Signer is a machine learning aggregated approach for feature evaluation of metagenomic datasets. Random forest, support vector machines, logistic regression, and multi-layer neural networks. Features are then aggregated across models and partitions into a single ranked list of the top k features.

Execution:

We provide a python environment which can be imported using the Conda python package manager.

Deep learning models are built using Tensorflow. Meta-Signer was designed using Tensorflow v1.14.0.

To fully utilize GPUs for faster training of the deep learning models, users will need to be sure that both CUDA and cuDNN are properly installed.

Other dependencies should be downloaded upon importing the provided environment.

Clone Repository

git clone https://github.com/YDaiLab/Meta-Signer.git
cd Meta-Signer

Import Conda Environment

conda env create -f meta-signer.yml
source activate meta-signer

Meta-Signer's Required Input

To use Meta-Signer on a dataset, first create a directory in the data folder. This directory requires two files:

File Description
abundance.tsv A tab separated file where each row is a feature and each column is a sample. The first column should be the feature ID. There should be no header of sample IDs
labels.txt A text file where each row is the sample class value. Rows should be in the same order as columns found in abundance.tsv

Examples can be found in the PRISM and PRISM_3 datasets provided.

Set configuration settings

Meta-Signer offers a flexible framework which can be customized in the configuration file. The configuration file offers the following parameters:

Evaluation
NumberTestSplits Number of partitions for cross-validation
NumberRuns Number of indepenendant iterations of cross-validation to run
Normalization Normalization method applied to data (Standard or MinMax)
DataSet Directory in data directory to load data from
FilterThreshCount Remove features who are present in fewer than the specified fraction of samples
FilterThreshMean Remove features with a mean value less than the specified value
MaxK The maximum number of features to generate in the rank aggregation
AggregateMethod The method used for rank aggregation (GA or CE)
RF
Train Use Random Forest for feature ranking and aggregation
NumberTrees Number of decision trees per forest
ValidationModels Number of partitions for internal cross-validation for tuning
SVM
Train Use SVM for feature ranking and aggregation
MaxIterations Maximum number of iterations to train
GridCV Number of partitions for internal cross-validation for tuning
Logistic Regression
Train Use logistic regression for feature ranking and aggregation
MaxIterations Maximum number of iterations to train
GridCV Number of partitions for internal cross-validation for tuning
MLPNN
Train Use MLPNN for feature ranking and aggregation
LearningRate Learning rate for neural network models
BatchSize Size of each batch during neural network training
Patience Number of epochs to stop training after no improvement

Run the Meta-Signer pipeline:

Once the configuration is set to desired values, generate the aggregated feature list using:

cd src
python generate_feature_ranking.py

Upon completion, Meta-Signer will generate a directory in the results folder with the same name as set to the DataSet flag in the configuration file. This directory will contain important files of interest including:

File Description
training_performance.html A portable HTML file showing cross-validated evaluation of ML methods
feature_evaluation/ensemble_rank_table.csv ranked lists of features for each method and each cross-validated run
feature_evaluation/aggregated_rank_table.csv Aggregated ranked list of features
prediction_evaluation/results.tsv Results table for cross-validated evaluation of ML methods

Once the features have been aggregated into a single ranked list, the user can decide on how many features to use for the final training of ML models. Meta-Signer can generate these final trained ML models using a user specified number of features using:

cd src
python generate_models.py <DataSet> <k>

Where DataSet is the directory in the results folder to use and k is the final number of features to use during training. Additionally, the models can be trained on an external datset using:

cd src
python generate_models.py <DataSet> <k> -e <ExternalDataSet>

Where ExternalDataSet is a directory in the data folder with abundance.tsv and labels.txt files.

Upon completion, Meta-Signer will create a directory within the dataset's results directory that will contain:

File Description
feature_ranking.html A portable HTML file the ranked features up to the specified value of k
rf_model.pkl The trained random forest model in pickle format
logistic_regression_model.pkl The trained logistic regression model in pickle format
svm_model.pkl The trained SVM model in pickle format
mlpnn.h5 The trained neural network model in H5 format
training_results.tsv The performance of trained models on the training set
external_results.tsv The performance of trained models on the external test set