Skip to content

Training

Shawn Baker edited this page Sep 5, 2019 · 1 revision

Quick Use

Once you have generated the binary files with 'spin', you can use the 'train' operation to produce feature selection sets and train them based on some predefined classifiers. Important flags are below, and you must specify the location of the binary files as the last positional argument.

Important Flags

-l -or --label

You must specify the label column to select and train against (e.g. '-l super_pop').

-m or --maximum-features

Specify the maximum number of features to select, train, and output. For the ILP, this sets a constraint on before finding an optimal solution. For the other algorithms, this will pick the top K features by information gain.

-i or --information-gain

Specify a minimum mutual information for consideration as a floating point number between 0 and 1. For the ILP, this prevents any feature from being added to the ILP if it's below the threshold. Set this to a higher value if you're spending too long finding an optimal solution, but set it to a lower value if you find it selects features too close to the threshold. For all other algorithms, this is a post-processing step that removes selected features below the threshold.

-p or --permutation-test

Run every node of the feature graph through a permutation test, where the information gain of feature vectors with shuffled labels is compared against the information gain of the feature vectors with the true labels.

-X or --cross-validation

Run N fold cross validation (e.g. -X 5). You must specify the total number of folds! This will output N different solution sets, which can be used to avoid overfitting or as an input to 'reselect' to find robust features.

-s or --feature-selection

Explicitly specify a feature selection algorithm (defaults to ILP). This is useful if you want to benchmark against another hierarchical feature selection algorithm (i.e. SHSEL), or if you want to benchmark against the pre-processing step (information gain threshold, graph collapse, etc.)

Examples

If you have produced binary files from the 1000 Genomes Project with the examples given in the 'Spinning' section, you can output relevant features and train stock models with the ancestry labels the project provides. Assuming you have placed your output in the directory 'tgs_data', you can run the following commands:

Run the ILP feature selector to find the top 500 features that predict the five continental labels:

./HarvestmanConsole train -l super_pop -m 500 -i 0.3 -v <data-dir>

Run the SHSEL feature selector to do the same:

./HarvestmanConsole train -l super_pop -s shsel -m 500 -i 0.3 -v <data-dir>

Run the ILP selector, but do five-fold cross validation

./HarvestmanConsole train -l super_pop -X 5 -m 500 -v <data-dir>

Clone this wiki locally