# Structure checking tutorial

A complete checking analysis of a single structure follows.
use .revert_changes() at any time to recover the original structure

Structure checking is a key step before setting up a protein system for simulations. 
A number of normal issues found in structures at Protein Data Bank may compromise the success of the simulation, or may suggest that longer equilibration procedures are necessary.

The biobb_structure_checking modules allow to 
- Do basic manipulations on structures (selection of models, chains, alternative locations
- Detect and fix amide assignments, wrong chirality
- Detect and fix protein backbone issues (missing fragments, and atoms, capping)
- Detect and fix missing side-chain atoms
- Add hydrogen atoms according to several criteria
- Detect and classify clashes
- Detect possible SS bonds

biobb_structure_checking modules can used at the command line biobb_structure_checking/bin/check_structure


In [1]:
%load_ext autoreload
%autoreload 2

## Installation

#### Basic imports and initialization

In [2]:
import biobb_structure_checking as bsch
from biobb_structure_checking.structure_checking import StructureChecking
from biobb_structure_checking.constants import help, set_defaults
base_dir_path = bsch.__path__[0]
args = set_defaults(base_dir_path)

## General help

In [3]:
help()


BioBB's check_structure.py performs MDWeb structure checking set as a command line
utility.

commands:     Help on available commands
command_list: Run all tests from conf file or command line list
checkall:     Perform all checks without fixes
load:         Stores structure on local cache and provides basic statistics

1. System Configuration
sequences 
    Print canonical and structure sequences in FASTA format
models [--select model_num]
    Detect/Select Models
chains [--select chain_ids | molecule_type]
    Detect/Select Chains
inscodes 
    Detects residues with insertion codes. No fix provided (yet)
altloc [--select occupancy| alt_id | list of res_id:alt_id]
    Detect/Select Alternative Locations
metals [--remove All | None | Met_ids_list | Residue_list]
    Detect/Remove Metals
ligands [--remove All | None | Res_type_list | Residue_list]
    Detect/Remove Ligands
getss      Detect SS Bonds
    --mark Replace relevant CYS by CYX to mark SS Bond (HG atom removed if present)
wat

Set input (PDB or local file, pdb or mmCif formats allowed) and output (local file, pdb format).  
Use pdb:pdbid for downloading structure from PDB (RCSB)

In [4]:
args['input_structure_path'] = 'pdb:6axg'
args['output_structure_path'] = '6axg_fixed.pdb'
args['output_structure_path_pdbqt'] = '6axg_fixed.pdbqt'

Initializing checking engine, loading structure and showing statistics

In [5]:
structure = StructureChecking(base_dir_path,args)

Structure exists: 'tmpPDB/ax/6axg.cif' 
Structure pdb:6axg loaded
 PDB id: 6AXG
 Title: Structure of RasGRP4 in complex with HRas
 Experimental method: X-RAY DIFFRACTION
 Keywords: SIGNALING PROTEIN
 Resolution (A): 3.302

 Num. models: 1
 Num. chains: 12 (A: Protein, B: Protein, C: Protein, D: Protein, E: Protein, F: Protein, G: Protein, H: Protein, I: Protein, J: Protein, K: Protein, L: Protein)
 Num. residues:  2936
 Num. residues with ins. codes:  0
 Num. HETATM residues:  0
 Num. ligands or modified residues:  0
 Num. water mol.:  0
 Num. atoms:  23168



#### models
Checks for the presence of models in the structure. 
MD simulations require a single structure, although some structures (e.g. biounits) may be defined as a series of models, in such case all of them are usually required.  
Use models('--select N') to select model num N for further analysis

In [6]:
structure.models()

Running models.
1 Model(s) detected
Single model found


#### chains
Checks for chains (also obtained from print_stats), and allow to select one or more.   
MD simulations are usually performed with complete structures. However input structure may contain several copies of the system, or contains additional chains like peptides or nucleic acids that may be removed. 
Use chains('X,Y') to select chain(s) X and Y to proceed

In [7]:
structure.chains()

Running chains.
12 Chain(s) detected
 A: Protein
 B: Protein
 C: Protein
 D: Protein
 E: Protein
 F: Protein
 G: Protein
 H: Protein
 I: Protein
 J: Protein
 K: Protein
 L: Protein


6axg have 6 copies in the crystal assimetric unit, to get a single copy, choose only A and B chains

In [8]:
structure.chains('A,B')

Running chains. Options: A,B
12 Chain(s) detected
 A: Protein
 B: Protein
 C: Protein
 D: Protein
 E: Protein
 F: Protein
 G: Protein
 H: Protein
 I: Protein
 J: Protein
 K: Protein
 L: Protein
Selecting chain(s) A,B


#### altloc
Checks for the presence of residues with alternative locations. Atoms with alternative coordinates and their occupancy are reported.  
MD simulations requires a single position for each atom.  
Use altloc('occupancy | alt_ids | list of res:id) to select the alternative


In [9]:
structure.altloc()

Running altloc.
No residues with alternative location labels detected


#### metals
Detects HETATM being metal ions allow to selectively remove them.  
To remove use metals (' All | None | metal_type list | residue list ')

In [10]:
structure.metals()

Running metals.
No metal ions found


#### ligands
Detects HETATM (excluding Water molecules) to selectively remove them.  
To remove use ligands('All | None | Residue List (by id, by num)')


In [11]:
structure.ligands()

Running ligands.
No ligands found


#### rem_hydrogen
Detects and remove hydrogen atoms. 
MD setup can be done with the original H atoms, however to prevent from non standard labelling, remove them is safer.  
To remove use rem_hydrogen('yes')


In [12]:
structure.rem_hydrogen()

Running rem_hydrogen.
No residues with Hydrogen atoms found


#### water
Detects water molecules and allows to remove them
Crystallographic water molecules may be relevant for keeping the structure, however in most cases only some of them are required. These can be later added using other methods (titration) or manually.

To remove water molecules use water('yes')


In [13]:
structure.water()

Running water.
No water molecules found


#### amide
Amide terminal atoms in Asn ang Gln residues can be labelled incorrectly.  
amide suggests possible fixes by checking the sourrounding environent.

To fix use amide ('All | None | residue_list')

Note that the inversion of amide atoms may trigger additional contacts. 

In [14]:
structure.amide()

Running amide.
9 unusual contact(s) involving amide atoms found
 LEU A103.O   GLN A107.OE1    2.782 A
 THR A104.O   GLN A107.OE1    2.867 A
 GLU A134.OE2 GLN A223.OE1    3.066 A
 ASP A139.OD1 GLN A141.OE1    2.784 A
 ASN A248.OD1 GLU B63.O       2.760 A
 PRO A262.O   GLN A266.OE1    3.066 A
 GLN A285.OE1 ALA A344.O      2.867 A
 PRO A377.O   ASN A381.OD1    2.976 A
 ALA A391.O   GLN A395.OE1    2.601 A


Fix all amide residues and recheck

In [15]:
structure.amide('all')

Running amide. Options: all
9 unusual contact(s) involving amide atoms found
 LEU A103.O   GLN A107.OE1    2.782 A
 THR A104.O   GLN A107.OE1    2.867 A
 GLU A134.OE2 GLN A223.OE1    3.066 A
 ASP A139.OD1 GLN A141.OE1    2.784 A
 ASN A248.OD1 GLU B63.O       2.760 A
 PRO A262.O   GLN A266.OE1    3.066 A
 GLN A285.OE1 ALA A344.O      2.867 A
 PRO A377.O   ASN A381.OD1    2.976 A
 ALA A391.O   GLN A395.OE1    2.601 A
Amide residues fixed all (8)
Rechecking
4 unusual contact(s) involving amide atoms found
 GLU A134.OE2 GLN A223.OE1    2.730 A
 VAL A244.O   ASN A248.OD1    2.772 A
 ASN A248.ND2 SER B65.N       3.064 A
 ASN A248.OD1 GLU B63.O       3.071 A


Comparing both checks it becomes clear that GLN A233 and ASN A248 are now in a worse situation, so should be changed back to the original labelling

In [16]:
structure.amide('A223,A248')

Running amide. Options: A223,A248
4 unusual contact(s) involving amide atoms found
 GLU A134.OE2 GLN A223.OE1    2.730 A
 VAL A244.O   ASN A248.OD1    2.772 A
 ASN A248.ND2 SER B65.N       3.064 A
 ASN A248.OD1 GLU B63.O       3.071 A
Amide residues fixed A223,A248 (2)
Rechecking
2 unusual contact(s) involving amide atoms found
 GLU A134.OE2 GLN A223.OE1    3.066 A
 ASN A248.OD1 GLU B63.O       2.760 A


#### chiral
Side chains of Thr and Ile are chiral, incorrect atom labelling lead to the wrong chirality.  
To fix use chiral('All | None | residue_list')

In [17]:
structure.chiral()

Running chiral.
No residues with incorrect side-chain chirality found


#### Backbone
Detects and fixes several problems with the backbone
use any of 
--fix_atoms All|None|Residue List 
--fix_chain All|None|Break list
--add_caps All|None|Terms|Breaks|Residue list
--no_recheck
--no_check_clashes


In [18]:
structure.backbone()

Running backbone.
7 Residues with missing backbone atoms found
 SER A74    OXT
 LYS A108   OXT
 ASP A169   OXT
 GLU A430   OXT
 ASN B26    OXT
 LYS B117   OXT
 HIS B166   OXT
5 Backbone breaks found
 SER A74    - ASP A79    
 LYS A108   - ASP A112   
 ASP A169   - LEU A195   
 ASN B26    - ASP B33    
 LYS B117   - ALA B122   
No unexpected backbone links


Re-building backbone breaks with Modeller (Modeller requires a license key)

In [30]:
args['modeller_key'] = 'XXXXXXXXX' #Need to register to Modeller
opts = {
    'fix_chain': 'all',
    'add_caps' : 'none',
    'fix_atoms': 'none',
    'no_recheck': True
}
structure.backbone(opts)
#structure.backbone('--fix_chain all --add_caps none --fix_atoms none --no_recheck')

Running backbone. Options: {'fix_chain': 'all', 'add_caps': 'none', 'fix_atoms': 'none', 'no_recheck': True}
1 Residues with missing backbone atoms found
 ASP A169   OXT
1 Backbone breaks found
 ASP A169   - LEU A170   
No unexpected backbone links
Consecutive residues too far away to be covalently linked
 ASP A169   - LEU A170  , bond distance    6.793 
Main chain fixes
Fixing chain/model A/0
0 atoms in HETATM/BLK residues constrained
to protein atoms within 2.30 angstroms
and protein CA atoms within 10.00 angstroms
0 atoms in residues without defined topology
constrained to be rigid bodies
>> Model assessment by DOPE potential
DOPE score               : -41859.894531

>> Summary of successfully produced models:
Filename                          molpdf     DOPE score    GA341 score
----------------------------------------------------------------------
target.B99990001.pdb          1722.17029   -41859.89453        1.00000

Fixing ASP A169 - LEU A170
  Adding ASP A169
  Adding LEU A170


#### fixside
Detects and re-built missing protein side chains.   
To fix use fixside('All | None | residue_list')

In [31]:
structure.fixside()

Running fixside.
No residues with missing or unknown side chain atoms found


#### getss
Detects possible -S-S- bonds based on distance criteria.
Proper simulation requires those bonds to be correctly set.

In [32]:
structure.getss()

Running getss.
No SS bonds detected


#### Add_hydrogens
 Add Hydrogen Atoms. Auto: std changes at pH 7.0. His->Hie. pH: set pH value
    list: Explicit list as [*:]HisXXHid, Interactive[_his]: Prompts for all selectable residues
    Fixes missing side chain atoms unless --no_fix_side is set
    Existing hydrogen atoms are removed before adding new ones unless --keep_h set.

In [33]:
structure.add_hydrogen()

Running add_hydrogen.
149 Residues requiring selection on adding H atoms
 CYS A65,A76,A122,A237,A298,A343,A399,B51,B80,B118
 ASP A70,A79,A97,A112,A139,A166,A169,A198,A224,A272,A307,A315,A357,A367,A371,A404,A414,A420,B30,B33,B38,B47,B54,B57,B69,B92,B105,B107,B108,B119,B132,B154
 GLU A63,A78,A115,A134,A143,A144,A156,A201,A204,A213,A241,A322,A325,A363,A388,A403,A419,A421,A424,A430,B3,B31,B37,B49,B62,B63,B76,B91,B98,B126,B143,B153,B162
 LYS A64,A108,A189,A190,A192,A273,A306,A318,A356,A378,B5,B16,B42,B88,B101,B104,B117,B147
 ARG A101,A117,A118,A126,A131,A148,A155,A162,A163,A191,A215,A226,A235,A252,A261,A267,A280,A304,A335,A337,A338,A347,A368,A373,A385,A429,B41,B68,B73,B97,B102,B123,B128,B135,B149,B161,B164
 TYR A106,A127,A211,A228,A333,A336,A383,A417,A423,A427,B4,B32,B40,B64,B71,B96,B137,B141,B157


#### clashes
Detects steric clashes based on distance criteria.  
Contacts are classified in: 
* Severe: Too close atoms, usually indicating superimposed structures or badly modelled regions. Should be fixed.
* Apolar: Vdw colissions.Usually fixed during the simulation.
* Polar and ionic. Usually indicate wrong side chain conformations. Usually fixed during the simulation


In [34]:
structure.clashes()

Running clashes.


Complete check in a single method

In [None]:
structure.checkall()

Running models.
1 Model(s) detected
Single model found
Running chains.
2 Chain(s) detected
 A: Protein
 B: Protein
Running inscodes.
No residues with insertion codes found
Running altloc.
No residues with alternative location labels detected
Running rem_hydrogen.
No residues with Hydrogen atoms found
Running add_hydrogen.
171 Residues requiring selection on adding H atoms
 CYS A65,A76,A122,A237,A298,A343,A399,B51,B80,B118
 ASP A70,A79,A97,A112,A139,A166,A169,A198,A224,A272,A307,A315,A357,A367,A371,A404,A414,A420,B30,B33,B38,B47,B54,B57,B69,B92,B105,B107,B108,B119,B132,B154
 GLU A63,A78,A115,A134,A143,A144,A156,A201,A204,A213,A241,A322,A325,A363,A388,A403,A419,A421,A424,A430,B3,B31,B37,B49,B62,B63,B76,B91,B98,B126,B143,B153,B162
 HIS A77,A80,A89,A123,A132,A137,A199,A208,A276,A282,A299,A309,A311,A330,A354,A362,A375,A396,A407,B27,B94,B166
 LYS A64,A108,A189,A190,A192,A273,A306,A318,A356,A378,B5,B16,B42,B88,B101,B104,B117,B147
 ARG A101,A117,A118,A126,A131,A148,A155,A162,A163,A191,A215,A22

In [None]:
structure.save_structure(args['output_structure_path'])

'6axg_fixed.pdb'

In [None]:
import nglview as nv
nv.show_biopython(structure.strucm.st[0])



NGLWidget()

In [27]:
#structure.backbone('--fix_atoms A430 --fix_chain all --add_caps none --no_recheck')
opts = {
    'fix_atoms':'A430',
    'fix_chain':'all',
    'add_caps':'none',
    'no_recheck': True,
}
structure.backbone(opts)

Running backbone. Options: {'fix_atoms': 'A430', 'fix_chain': 'all', 'add_caps': 'none', 'no_recheck': True}
4 Residues with missing backbone atoms found
 SER A168   OXT
 LEU A195   OXT
 GLU A430   OXT
 HIS B166   OXT
2 Backbone breaks found
 SER A168   - ASP A169   
 LEU A195   - LEU A196   
No unexpected backbone links
Consecutive residues too far away to be covalently linked
 SER A168   - ASP A169  , bond distance    6.291 
 LEU A195   - LEU A196  , bond distance    4.205 
Main chain fixes
Fixing chain/model A/0
0 atoms in HETATM/BLK residues constrained
to protein atoms within 2.30 angstroms
and protein CA atoms within 10.00 angstroms
0 atoms in residues without defined topology
constrained to be rigid bodies
>> Model assessment by DOPE potential
DOPE score               : -42173.949219

>> Summary of successfully produced models:
Filename                          molpdf     DOPE score    GA341 score
----------------------------------------------------------------------
target.B999

In [28]:
opts = {
    'add_mode':'auto',
    'add_charges': 'ADT'
}

structure.add_hydrogen(opts)

#structure.add_hydrogen('--add_mode auto --add_charges ADT')

Running add_hydrogen. Options: {'add_mode': 'auto', 'add_charges': 'ADT'}
171 Residues requiring selection on adding H atoms
 CYS A65,A76,A122,A237,A298,A343,A399,B51,B80,B118
 ASP A70,A79,A97,A112,A139,A166,A169,A198,A224,A272,A307,A315,A357,A367,A371,A404,A414,A420,B30,B33,B38,B47,B54,B57,B69,B92,B105,B107,B108,B119,B132,B154
 GLU A63,A78,A115,A134,A143,A144,A156,A201,A204,A213,A241,A322,A325,A363,A388,A403,A419,A421,A424,B3,B31,B37,B49,B62,B63,B76,B91,B98,B126,B143,B153,B162
 HIS A77,A80,A89,A123,A132,A137,A199,A208,A276,A282,A299,A309,A311,A330,A354,A362,A375,A396,A407,B27,B94,B166
 LYS A64,A108,A189,A190,A192,A273,A306,A318,A356,A378,B5,B16,B42,B88,B101,B104,B117,B147
 ARG A101,A117,A118,A126,A131,A148,A155,A162,A163,A191,A215,A226,A235,A252,A261,A267,A280,A304,A335,A337,A338,A347,A368,A373,A385,A429,B41,B68,B73,B97,B102,B123,B128,B135,B149,B161,B164
 TYR A106,A127,A211,A228,A333,A336,A383,A417,A423,A427,B4,B32,B40,B64,B71,B96,B137,B141,B157
Running fixside. Options: --fix all
11 

In [29]:
structure.save_structure('6axg.pdbqt')

'6axg.pdbqt'