-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Harvestman is a program for extracting informative features from genomic sequence data. The selected features can be used for insight into how genotypes are translated into phenotypes, or they can simply be used to train ML classifiers. There are three major modes of operation: a VCF preprocessing mode, a feature selection and training mode, and a mode for selecting robust features from previous runs.
Harvestman needs to pre-process input VCF files into a special binary format for faster feature selection. This step requires selection of a reference genome, one or more well-formatted VCF files, and a list of labels for every sample in the VCF file.
Using the files produced with the pre-processing step, Harvestman has a 'train' operation to select features and train data on a few predefined classifiers. This will output a JSON file with the selected features, feature vectors associated with them, and other metadata.
Once you have produced several output files using the 'train' operation, you can find the intersection of features selected and attempt to find only the most relevant features. You may input feature selection sets from N-Fold cross validation, or with multiple solutions from a single ILP formulation.