In this paper, a new lysine glutarylation(Kglu) site prediction model GBDT_Kglu was proposed, which adopted seven feature encoding methods to convert protein sequences into digital information, including BE, BLOSUM62, EAAC, CTDC, PSSM, CKSAAP, and Secondary Structural information. Then, the NearMiss-3 method dealed with the imbalanced data set issue ,and Elastics Net was used to filter redundant information in the features. Finally, the prediction model for identify Kglu site based on GBDT was established
Backend = Tensorflow(1.14.0)
keras(2.3.1)
Numpy(1.20.2)
scikit-learn(1.0.2)
pandas(1.3.5)
matplotlib(3.5.2)\
The data uploaded in DataSet is the original data before dividing the dataset, with 707 positive samples and 4369 negative samples, all with a sample length of 33, where X stands for virtual amino acids. Glutarylation.csv is the original dataset, Glutarylation208.csv is obtained by removing duplicate data using CD-hit, and contains a total of 208 proteins. The folder Train contains all training data, while Test contains all independent test data.
There are seven features were used in GBDT_KgluSite model. Two of them were generated by one_hot.py, and CKSAAP.py, the PSSM feature was generated by PSI-BLAST, The rest of them were obtained by iLearnPlus.
GBDT_ KgluSite.py can be directly used to predict glutarylation modification sites when load the pretrained model GBDT_KgluSite.pickle
Feel free to contact us if you nedd any help: flyinsky6@gmail.com