Skip to content

dfdt2/BioinformaticsProject

Repository files navigation

Classification of contacts in protein structures

This software uses machine learning algorithms to classify residue-residue contacts within protein structures into categories as defined by the RING software (HBOND, VDW, SBOND, etc.) using structural and physicochemical features.

Project structure

BioinformaticsProject-main/
├── 3di_model/             # FoldSeek model files
├── config/                # Configuration files
├── data/                  # Input data and PDB structures
├── models/                # Trained models
├── output_3di/            # 3Di encoding output
├── output_features/       # Structural feature output
├── report/                # Figures and plots for analysis
├── scripts/               # Main scripts for feature extraction, training, evaluation
├── requirements.txt       # Python dependencies
└── README.md              # Project documentation

Description of features and data

Features used

Each contact is described using:

s_ss8 : Secondary structure from DSSP
s_rsa : Relative Solvent Accessibility
s_phi : Phi torsion angle
s_psi : Psi torsion angle
s_a1 : Atchley factor 1 - Hydrophilicity vs. hydrophobicity s_a2 : Atchley factor 2 - Secondary structure propensity
s_a3 : Atchley factor 3 - Molecular size s_a4 : Atchley factor 4 - Codon usage / Polarizability s_a5 : Atchley factor 5 - Electrostatic charge
s_3di_state : 3Di FoldSeek numeric state ID
s_3di_letter : 3Di FoldSeek letter

These are computed for both the source and target residues.

Prediction classes

The following are the supported contact types, matching RING nomenclature:

  • Hydrogen bond (HBOND)
  • Van der Waals (VDW)
  • Disulfide bridge (SBOND)
  • Salt bridge (IONIC)
  • π–π stacking (PIPISTACK)
  • π–cation (PICATION)
  • π–hydrogen bond (PIHBOND)
  • Metal ion coordination (METAL_ION)
  • Halogen bond (HALOGEN)
  • Unclassified

Instructions for software setup and usage

1. Install dependencies and DSSP

To install all packages used: pip install -r requirements.txt

2. Configuration

All paths and parameters are stored in config/config.json. Update this file to match your system setup.

3. Execution

a. Models and model retraining

Model training is performed in the following Jupyter notebook: train_model.ipynb

This script performs the following steps:

  1. Loads training data from the path specified in config/config.json
  2. Preprocesses the features
  3. Visualizes feature distributions and correlations
  4. Trains and evaluates multiple models, including:
    • Naive Bayes
    • Random Forest
    • LightGBM
  5. Saves trained models as .pkl files into the models/ directory
  6. Generates evaluation metrics such as:
    • Classification report
    • Confusion matrix
    • Matthews Correlation Coefficient (MCC)
    • Balanced Accuracy
    • Average Precision Score
    • Area Under ROC Curve (AUC)

To execute the full training and evaluation process, run this command in your terminal:

jupyter notebook train_model.ipynb

b. Model application to new .pdb structure

Prediction is performed using the notebook: model_prediction.ipynb

This script performs the following steps:

  1. Extracts structural and 3Di features from the input PDB file using:
    • scripts/calc_features.py
    • scripts/calc_3di.py (within run_feature_extraction from scripts/runFeatureExtraction.py)
  2. Visualizes extracted features using:
    • scripts/data_visualization.py
  3. Loads a pre-trained model from path defined in config/config.json (e.g., Random Forest or Naive Bayes model in models/ directory)
  4. Applies the model to the newly extracted contact features
  5. Outputs predicted contact types and their classification scores

To apply a model to a new protein structure run the notebook using:

jupyter notebook model_prediction.ipynb

Ensure that:

  • The input file exists in the data/pdbs/ directory
  • Paths in config/config.json are set correctly
  • Feature extraction tools (DSSP, FoldSeek) are installed and configured

Authors

Christina Caporale Natalya Lavrenchuk

About

Classification of contacts in protein structures

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •