Skip to content
Shawn Baker edited this page Sep 5, 2019 · 2 revisions

Introduction

Harvestman is a program for extracting informative features from genomic sequence data. The selected features can be used for insight into how genotypes are translated into phenotypes, or they can simply be used to train ML classifiers. There are three major modes of operation: a VCF preprocessing mode, a feature selection and training mode, and a mode for selecting robust features from previous runs.

VCF Pre-Processing (spin)

Harvestman needs to pre-process input VCF files into a special binary format for faster feature selection. This step requires selection of a reference genome, one or more well-formatted VCF files, and a list of labels for every sample in the VCF file.

Feature selection (train)

Using the files produced with the pre-processing step, Harvestman has a 'train' operation to select features and train data on a few predefined classifiers. This will output a JSON file with the selected features, feature vectors associated with them, and other metadata.

Robust feature selection (reselect)

Once you have produced several output files using the 'train' operation, you can find the intersection of features selected and attempt to find only the most relevant features. You may input feature selection sets from N-Fold cross validation, or with multiple solutions from a single ILP formulation.

Clone this wiki locally