A fast method for predicting the stability change upon mutation from a 3D structure. Predicting protein stability changes upon mutation using a simple orientational potential. I. Martín-Hernández, Y. Dehouck, U. Bastolla, J.R. López-Blanco and P. Chacón. 2023. Bioinformatics, 39, 1, 2023
Git users: Be sure that LFS is installed to support long files, otherwise download the repo as a zip file.
The size of korp6Dv1.bin should be 331.8MB
The usage is very simple:
sbg/bin/korpm input.txt --dir Ssym --score_file pot/korp6Dv1.bin -o out.txt
it only requires: 1) a two-column input file specifying both the PDB file and mutation, i.e. 1BNI IA76A, and 2) the path where the PDB files are located (--dir) and the KORP potential file (--score_file). The results are also stored in the out.txt (-o option) file.
more out.txt
1BNI IA76A -1.396
1EY0 TA44V 0.108
1IHB FA82Q -0.727
The mutation columns stand for: the 1st letter is the wild-type amino acid, 2nd is the chain ID, the digits correspond to the PDB residue position, and the last letter is the mutated amino acid. We follow the standard convention ΔΔG >= 0 (positives) is stabilizing and ΔΔG < 0 (negatives) is destabilizing.
The complete source C++ code of KORPM is under sbg directory. Just execute compile_korpm.sh script.
We extracted from Thermomut and ProThermDB unique mutations trying to avoid entries that potentially interact with ligands or belong to protein-protein interfaces, and removing entries measured at extreme temperature or pH conditions. The initial curated database data comprise 3824 mutations from 139 protein families (seq. identity <25%) with an average of ΔΔG -1.0 Kcal/mol and a standard deviation of 1.6 Kcal/mol. In total, 73% are destabilizing (ΔΔG>0) and 27% are stabilizing (ΔΔG<0). By removing mainly alanines' destabilizing mutations, we obtain a more balanced subset that includes 2371 mutations from 129 protein families, 58% destabilizing and 42% stabilizing with an average of ΔΔG -0.7 Kcal/mol and a standard deviation of 1.6 Kcal/mol. This subset, named Id25c03_1merNCLB.txt, was used for extract training and validation datasets for k-fold cross-validation experiments. Note that this subset is far from being perfectly balanced, e.g., the most frequent amino acid involved in the mutation still is alanine, and cysteines, tryptophans, and, prolines still are underpopulated. This dataset does not include any of the mutations of the Ssym test set. There are two additional subsets excluding all the proteins presenting at least 25% of the sequence of identity with either Ssym dataset Id25c03_1merNCLB_Ssym.txt or S641 dataset Id25c03_1merNCLB_S461.txt. In our case, these different datasets were employed to test the absence of significant overfitting in our simple potential.
Id25c03_1merNCL.txt | Id25c03_1merNCLB.txt |
In the directory Db you can find all the corresponding PDB files.
Ssym is a data set with an equal number of stabilizing and destabilizing mutations compiled by Pucci et al. (https://doi.org/10.1093/bioinformatics/bty348) for which the structure of both the wild-type and mutant protein are available.
sbg/bin/korpm Ssym.txt --dexp --dir Ssym --score_file pot/korp6Dv1.bin -o Ssym_all.txt
Where Ssym.txt is the mutations input file and the Ssym directory is where the input PDB files are stored. Since this input contains the experimental ΔΔG (see Appendix for small corrections) you can cross-check the predictions by:
scripts/Mstat.pl Ssym_all.txt 10 11 2
# Current Positives|Negatives threshold (thr) is 0 (ddG >= 0 are not-destabilizing [positives] and ddG < 0 are destabilizing [negatives]).
aa S D T TP avg err FP TN avg err FN P N SEN SPE PPV NPV ACC accn RMSE MAE PCC Sc Ob1 Ob2 MCC
X 342 342 684 264 1.5 0.9 71 271 -1.5 0.8 78 335 349 0.772 0.792 0.788 0.777 0.782 0.782 1.331 0.969 0.695 64.6 34.6 0.7 0.56
A 97 97 194 80 1.9 1.0 15 82 -1.9 0.9 17 95 99 0.825 0.845 0.842 0.828 0.835 0.835 1.525 1.089 0.744 67.0 32.0 1.0 0.67
V 106 106 212 82 1.2 0.8 24 82 -1.2 0.8 24 106 106 0.774 0.774 0.774 0.774 0.774 0.774 1.187 0.871 0.691 64.6 35.4 0.0 0.55
I 68 68 136 57 1.6 0.8 11 57 -1.6 0.7 11 68 68 0.838 0.838 0.838 0.838 0.838 0.838 1.128 0.890 0.810 66.9 31.6 1.5 0.68
L 41 41 82 30 1.7 1.0 12 29 -1.7 1.0 11 42 40 0.732 0.707 0.714 0.725 0.720 0.720 1.507 1.123 0.638 62.2 37.8 0.0 0.44
.....etc
sbg/bin/korpm SsymD.txt --dexp --dir Ssym --score_file pot/korp6Dv1.bin -o Ssym_dir.txt
sbg/bin/korpm SsymR.txt --dexp --dir Ssym --score_file pot/korp6Dv1.bin -o Ssym_rev.txt
paste Ssym_dir.txt Ssym_rev.txt > temp
awk 'function abs(x){return (x < 0) ? -x : x;} {printf "%s %s %s %s %s %s %s %f %f %s %s\n",$1,$19, $2, $20, $10, $11,$29, ($11+$29), abs(($11+$29)), $3, $4 }' temp > KORPM_Ssym.txt
you can see the results in your favourite plot, for example in gnuplot:
plot "KORPM_Ssym.txt" u 6:7 stat "KORPM_Ssym.txt" u 6:7 ... Linear Model: y = -0.8207 x + 0.03672 Slope: -0.8207 +- 0.02468 Intercept: 0.03672 +- 0.03807 Correlation: r = -0.8745 ... |
Here you can find some comparative results with state-of-the-art stability prediction programs:
METHOD | RMSE | MAE | PCC | Sc | Ob1 | Ob2 | TPR | TNR | PPV | NPV | ACC | MCC | AROC | APRC |
KORPM | 1.33 | 0.97 | 0.69 | 64.6 | 34.6 | 0.7 | 0.77 | 0.79 | 0.79 | 0.78 | 0.78 | 0.56 | 0.86 | 0.86 |
ACDCNN | 1.38 | 1.01 | 0.69 | 61.5 | 38.1 | 0.0 | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.40 | 0.80 | 0.80 |
FoldX | 1.86 | 1.29 | 0.54 | 60.1 | 34.5 | 5.4 | 0.55 | 0.78 | 0.71 | 0.63 | 0.66 | 0.33 | 0.74 | 0.75 |
EvoFF | 1.56 | 1.12 | 0.54 | 61.7 | 34.9 | 3.4 | 0.61 | 0.70 | 0.67 | 0.64 | 0.66 | 0.31 | 0.74 | 0.75 |
PopMusic-S | 1.58 | 1.15 | 0.52 | 56.6 | 42.4 | 1.0 | 0.67 | 0.71 | 0.70 | 0.68 | 0.69 | 0.38 | 0.76 | 0.74 |
Dynamut | 1.88 | 1.37 | 0.38 | 54.4 | 38.2 | 7.5 | 0.21 | 0.88 | 0.64 | 0.53 | 0.55 | 0.13 | 0.62 | 0.62 |
DDGun3D | 1.43 | 1.04 | 0.63 | 61.8 | 37.4 | 0.7 | 0.68 | 0.69 | 0.69 | 0.69 | 0.69 | 0.37 | 0.75 | 0.76 |
ThermoNet | 1.53 | 1.09 | 0.55 | 58.2 | 40.9 | 0.9 | 0.65 | 0.70 | 0.69 | 0.67 | 0.68 | 0.35 | 0.75 | 0.74 |
Cartddg | 3.44 | 2.63 | 0.63 | 52.3 | 41.1 | 6.6 | 0.58 | 0.87 | 0.82 | 0.67 | 0.73 | 0.47 | 0.81 | 0.82 |
you can find complete results in Ssym_results
you can find complete results in S461_results
Corrections of original Ssym dataset based on ThermoMutDB data.
PDB | Mutation | Original | Corrected | Medline References from ThermoMutDB |
1BNI | IA96V | -3.1 | -0.9 | 2669964 (-0.90); 1569557 (0.95); 9551101 (-0.80) |
1BNI | SA91A | -2.4 | -1.8 | 14516751 |
1L63 | SA44T | 0.0 | 0.01 | 8289284 |
1L63 | SA38N | 0.0 | -0.01 | 1911773 |
1L63 | LA91A | -3.9 | -2.6 | 10545167 |
1L63 | AA130S | 1.0 | -1.0 | 8218201 |
1LZ1 | VA2G | -1.3 | -2.29 | 11087397 |
1LZ1 | VA2L | 0.3 | -0.05 | 11087397 |
1LZ1 | IA56T | -4.3 | -3.6 | 9010773; 10556244 |
1LZ1 | VA74I | -1.9 | 0.45 | 11087397 |
1LZ1 | VA74L | -0.4 | 0.19 | 11087397 |
1LZ1 | VA74M | -0.4 | 0.65 | 11927576; 11087397 |
1LZ1 | VA110G | -2.2 | 0.48 | 11927576; 11087397 |
1LZ1 | VA110I | -0.8 | 0.86 | 11927576; 11087397 |
1LZ1 | VA110F | -1.9 | -0.05 | 11927576; 11087397 |
2LZM | IA3C | 0.0 | 1.2 | 3405287 |
2LZM | RA119E | 0.0 | -0.04 | 1942034 |
4LYZ | GA49A | -0.7 | -1.9 | 11112507; 8771183 |
4LYZ | GA71A | -2.1 | -0.38 | 11112507 |
4LYZ | GA102A | -1.2 | 0.02 | 11112507 |
4LYZ | GA117A | -0.8 | -1.46 | 11112507 |
1RN1 | QC25K | 1.4 | 0.93 | 2663837 |
2LZM | RA96K | 0.0 | -0.001 | Avoiding zero for the binary classification |
1L63 | SA44E | 0.0 | 0.001 | Avoiding zero for the binary classification |
2LZM | KA60P | 0.0 | -0.001 | Avoiding zero for the binary classification |
This work was supported by Spanish grants PID2019-109041GB-C21/AEI/10.13039/501100011033