This is a machine learning tool to classify the function of proteins from obscure datasets.
The pipeline takes two lists of genes with some distinguishing characteristic, uses machine learning to categorize the difference between them, and then parses through a new unidentified gene set making predictions about which category genes belong to. The tool can identify patterns in the data too complicated for humans to parse, and use them to identify potential functions in unidentified genes.
The Woolf pipeline uses sci-kit learn to provide the following machine learning alrogithms:
- Random forest trees
- k nearest neighboor (kNN)
For a complete tutorial please see the User Guide.
These dependancies should be installed automatically if they are not already present:
- pandas
- argparse
- biopython
- scikit-learn
- numpy
$ pip install woolfRun the following on the comand line, with both FASTA files in the current working directory:
featureTable [-h] [-c COMPARISONFILENAME] [-f FOLDER]
[-b | -t] [-p POSFASTA [POSFASTA ...]]
[-n NEGFASTA [NEGFASTA ...]]
Options:
-h, --help Show help message and exit.
-c, --comparisonFileName COMPARISONFILENAME An identifying tag for all output files.
-f, --folder FOLDER A folder to contain the output files.
-b, --binary Creates a feature table with binary class markers.
-t, --predict Creates a feature table with no class markers for use in prediction.
-p, --posFasta POSFASTA ... One or more FASTA files containing amino acid sequences belonging to the positive class.
-n, --negFasta NEGFASTA... One or more FASTA files containing amino acid sequences belonging to the negative class.
-u, --unknownFasta UNKNOWNFASTA... One or more FASTA files containing amino acid sequences of unknown function.
Note that posFasta and negFasta should contain genes from a similar class with one distinct functional difference.
The output file outputFile will contain a feature table with percent composition amino acids and length as features. Binary tables will also contain a binary class marker, predict tables will not.
Run the following on the comand line, with the feature table/tables in the current working directory:
trainWoolf [-h] [-k | -f] [-n NNEIGHBORS] [-t NTREES] [-l MINLEAFSIZE]
[-s FEATURESCALER] [-c CROSSVALIDATIONFOLDS]
[-a ACCURACYMETRIC] [-p PREDICTFEATURETABLE] [-e] [-v]
featureTable
Options:
-h, --help Show help message and exit.
-k, --kNN Select a kNN algorithm for training.
-f, --randomForest Select a random forest algorithm for training.
-n, --nNeighbors NNEIGHBORS Number of neighboors for kNN classifier. Ranges are expresed as low-hi,jump (1-7,2 would test 1,3,5 and 7).
-t, --nTrees NTREES Number of trees for random forest classifier. Ranges are expresed as low-hi,jump (1-7,2 would test 1,3,5 and 7).
-l, --minLeafSize MINLEAFSIZE Minimum size of leaves in each tree of the random forest classifier. Ranges are expresed as low-hi,jump (1-7,2 would test 1,3,5 and 7).
-s, --featureScaler FEATURESCALER A scikit learn scaler object to scale in the input features.
-c, --crossValidationFolds CROSSVALIDATIONFOLDS The number of cross validation folds to execute.
-a, --accuracyMetric ACCURACYMETRIC A scikit learn accuracy metric for training.
-p, --predictFeatureTable PREDICTFEATURETABLE A unclassified feature table to be predicted by the model.
-e, --listErrors Include to see a list of which sequences in the training dataset were missclassified.
-v, --verbose Inlcude to get more detailed output.
These presentations were given to either the VKC Lab durring lab meetings, or to my thesis committee. They contain background information, and document my working progress.
- 12/11/18 Thesis Committee Meeting: an overview of my fall semester work
- 12/3/18 Machine Algorithm Research : algorithms under consideration presented in lab meeting
- 11/5/18 Biological Background and Initial Project Plan: the biological introduction to the project
- 10/2/18 Initial Project Proposals: the inial ideas for this project and other potential theses
See references.txt