Skip to content
Machine learning analysis of BglB data set
Jupyter Notebook Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
crystal_structures
data_exploration
feature_correlations
feature_sets
foldx_runs
machine_learning
pyrosetta_runs
reference
reference_files
rosetta_feature_sets
rosetta_runs
.gitignore
README.ipynb
README.md
Untitled.ipynb
all_residue_feature_selection.ipynb
all_residue_feature_set.ipynb
elastic_net_with_bagging.ipynb
elastic_net_with_bagging_appendix.ipynb
feature_analysis.ipynb
grid_spec_is_cool.ipynb
reweight_score_terms.ipynb

README.md

Predicting effects of mutations on enzyme function, stability, and structure using molecular modeling and machine learning tools

Molecular modeling protocols

Each molecular modeling protocol is a self-contained directory containing all the required scripts to run the protocol. Currently, most (but not all) protocols have the following files:

  1. a make_list.py preprocessing script that creates the input files for parallelization
  2. a submit script sub.sh to submit the parallel runs to SLURM
  3. two directories, logs and out, where the output goes
  4. a data_processing.py script to transform the output of the protocol into a feature set

Each protocol has its own specialized version of each of these scripts, to account for all the different kinds of output the protocol produces. Details of each protocol are below.

Rosetta protocols

Benchmark modeling set

  • Files: rosetta_runs/benchmark

  • This feature set contains 45 features from Rosetta's enzyme design protocols, using scorefunction Talaris 2014. One hundred structures for each single point mutant are created using the MutateResidue mover, repacked and minimized by EnzRepackMinimize mover (10 Monte Carlo trials to minimize total system energy), scored using -jd2:enzdes_out. Features from the lowest 10 structures for each mutant are averaged.

  • Run time on Cabernet: about 6 days for all 9000 possible point mutants

Shells (shells)

  • Files: rosetta_runs/benchmark

  • Same as benchmark protocol, except that here constraint energy is optimized by Monte Carlo and also there is a constraint optimization step where the protein is mutated to all alanine and constraint energy is minimized.

  • Run time on Cabernet: about 6 days for all 9000 possible point mutants

Reduced sampling protocol (new_protocol)

  • Files: rosetta_runs/new_protocol

  • This feature set contains 45 features from Rosetta's enzyme design protocols. Ten structures for each single point mutant are created using the MutateResidue mover, repacked and minimized by EnzRepackMinimize mover (10 Monte Carlo trials to minimize total system energy), scored using -jd2:enzdes_out.

  • Run time on Cabernet: about 8 hours for all 9000 possible point mutants

FoldX protocols

Position-specific scoring matrix (PSSM) feature set

  • Files: foldx_runs/pssm

  • This feature set contains 13 features output by the position-specific scoring matrix (PSSM) function of FoldX

  • Run time on Cabernet: about 24 hours for all possible point mutants

Machine learning

Prediction of kinetic constants and thermal stability (regression)

Prediction of protein expression in Escherichia coli (classification)

  • SVM classifiers for prediction of protein expression from above feature sets (see machine_learning/protein_expression)
You can’t perform that action at this time.