In brief, this software pipeline will build comparative models of the members of an enzyme family, choose low-energy models, dock a transition state structure appropriate for the enzyme mechanism, model each point mutation to the protein, and generate features for each mutation. This pipeline is designed to handle thousands of input sequences and run on Cabernet in a few days. The requirements:
phmmer
(HMMER)promals
(Promals3D) and a Python 2 environment- Rosetta
- PyRosetta
Most of the glue code is written in Bash and Python.
The glycosyl hydrolase 1 family consists of over 11,000 sequences found by searching genomic databases using the Pfam's HMM. To choose which of these we are able to model, we first inspect the alignment for the known catalytic residues in this family (information which Pfam does not use). We select only proteins having all three known catalytic residues in the alignment. This code is located in alignment
, with the chosen sequences in target.fa
.
This family (TIM-barrel) is very well represented in structure databases. We first search each sequence against the PDB, and selected 10 template structures based on coverage and identity. Each target sequence is computationally folded into a 3D structure using Rosetta's Hybridize protocol.
Next, the target substrate or substrates is created, parameterized for Rosetta, and then docked into the active site. In order to do this, we infer the identity of the catalytic residues based on a multiple sequence alignment. A defined set of distance, angle, and dihedral restraints derived from quantum mechanical modeling of the enzyme reaction are used to place the modeled transition state structure in the binding pocket.
Once we have a complete model of the enzyme with transition state structure, we can model all possible point mutations to the enzyme-substrate complex (including differnet chemical groups on the substrate) to predict function. For now, we perform a computational deep mutational scan, and assess structural features for the mutated structures.
Using our training data set, which consists of quantitative determination of enzyme kinetic constants for about 200 point mutants of BglB (a member of GH1), we are able to train a machine learning model to predict the functional effect of mutations across the enzyme family using the structural features generated by our molecular modeling.