Skip to content

afarrellsherman/Woolf

Repository files navigation

The Woolf Classifier Building Pipeline: A Tool for Protein Differentiation through Machine Learning

This is a machine learning tool to classify the function of proteins from obscure datasets.

Overview

The pipeline takes two lists of genes with some distinguishing characteristic, uses machine learning to categorize the difference between them, and then parses through a new unidentified gene set making predictions about which category genes belong to. The tool can identify patterns in the data too complicated for humans to parse, and use them to identify potential functions in unidentified genes.

Algorithms

The Woolf pipeline uses sci-kit learn to provide the following machine learning alrogithms:

  • Random forest trees
  • k nearest neighboor (kNN)

Getting Started

For a complete tutorial please see the User Guide.

Dependancies

These dependancies should be installed automatically if they are not already present:

  • pandas
  • argparse
  • biopython
  • scikit-learn
  • numpy

Installation

$ pip install woolf

Basic Usege

Creating a Feature Table

Run the following on the comand line, with both FASTA files in the current working directory:

featureTable [-h] [-c COMPARISONFILENAME] [-f FOLDER]
                             [-b | -t] [-p POSFASTA [POSFASTA ...]]
                             [-n NEGFASTA [NEGFASTA ...]]

Options:

-h, --help Show help message and exit.

-c, --comparisonFileName COMPARISONFILENAME An identifying tag for all output files.

-f, --folder FOLDER A folder to contain the output files.

-b, --binary Creates a feature table with binary class markers.

-t, --predict Creates a feature table with no class markers for use in prediction.

-p, --posFasta POSFASTA ... One or more FASTA files containing amino acid sequences belonging to the positive class.

-n, --negFasta NEGFASTA... One or more FASTA files containing amino acid sequences belonging to the negative class.

-u, --unknownFasta UNKNOWNFASTA... One or more FASTA files containing amino acid sequences of unknown function.

Note that posFasta and negFasta should contain genes from a similar class with one distinct functional difference.

The output file outputFile will contain a feature table with percent composition amino acids and length as features. Binary tables will also contain a binary class marker, predict tables will not.

Training a Woolf Classifier:

Run the following on the comand line, with the feature table/tables in the current working directory:

trainWoolf [-h] [-k | -f] [-n NNEIGHBORS] [-t NTREES] [-l MINLEAFSIZE]
                  [-s FEATURESCALER] [-c CROSSVALIDATIONFOLDS]
                  [-a ACCURACYMETRIC] [-p PREDICTFEATURETABLE] [-e] [-v]
                  featureTable

Options:

-h, --help Show help message and exit.

-k, --kNN Select a kNN algorithm for training.

-f, --randomForest Select a random forest algorithm for training.

-n, --nNeighbors NNEIGHBORS Number of neighboors for kNN classifier. Ranges are expresed as low-hi,jump (1-7,2 would test 1,3,5 and 7).

-t, --nTrees NTREES Number of trees for random forest classifier. Ranges are expresed as low-hi,jump (1-7,2 would test 1,3,5 and 7).

-l, --minLeafSize MINLEAFSIZE Minimum size of leaves in each tree of the random forest classifier. Ranges are expresed as low-hi,jump (1-7,2 would test 1,3,5 and 7).

-s, --featureScaler FEATURESCALER A scikit learn scaler object to scale in the input features.

-c, --crossValidationFolds CROSSVALIDATIONFOLDS The number of cross validation folds to execute.

-a, --accuracyMetric ACCURACYMETRIC A scikit learn accuracy metric for training.

-p, --predictFeatureTable PREDICTFEATURETABLE A unclassified feature table to be predicted by the model.

-e, --listErrors Include to see a list of which sequences in the training dataset were missclassified.

-v, --verbose Inlcude to get more detailed output.

Lab presentations and Thesis Committee Presentations

These presentations were given to either the VKC Lab durring lab meetings, or to my thesis committee. They contain background information, and document my working progress.

References

See references.txt

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages