```
This script can be used for any purpose without limitation subject to the
conditions at http://www.ccdc.cam.ac.uk/Community/Pages/Licences/v2.aspx

This permission notice and the following statement of attribution must be
included in all copies or substantial portions of this script.

2023-05-15: Made available by the Cambridge Crystallographic Data Centre.

```

# Protein Features in the CSD Python API

In [1]:
from pathlib import Path
import sys
sys.path.append('../..')
from ccdc_notebook_utilities import run_hermes, create_logger

import os
from time import time

import warnings
from ccdc.protein import Protein

In [2]:
from IPython.display import HTML

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### Initialization

In [3]:
logger = create_logger()

[23-05-19 10:26:50 INFO   ] 
Platform:                     Windows-10-10.0.19045-SP0

Python exe:                   C:\Users\cole\Anaconda3\envs\latest_csd_python_api\python.exe
Python version:               3.9.16

CSD version:                  544
CSD directory:                C:/Users/cole/CCDC/ccdc-data/csd
API version:                  3.0.15

CSDHOME:                      C:/Users/cole/CCDC/ccdc-data/csd
CCDC_LICENSING_CONFIGURATION: Not set



### Getting started

Lets start by importing the protein object. This, unsurprisingly, is part of the protein module in the CSD Python API. We can load it easily enough.

In [4]:
from ccdc.protein import Protein

First, we need to load a protein. We can do this by using the from_file() method:

In [5]:
my_protein_pdb = Protein.from_file('3kk6.pdb')

Proteins support molecules in PDB, mmCIF, or Mol2 format

In [6]:
my_protein_cif = Protein.from_file('3kk6.cif')

You can write PDB format or mol2 format files in the usual way, but note this only saves the 3D coordinate information.

In [7]:
from ccdc.io import EntryWriter
with EntryWriter('3kk6_out.pdb') as ew:
    ew.write(my_protein_cif)
    
with open('3kk6_out.pdb','r') as outf:
    lines = outf.readlines()
    print("The output file looks like this note that it is orthogonal coordinates:\n")
    for line in lines[:10]:
        print(line[:-1])
    print("... etc ...")

The output file looks like this note that it is orthogonal coordinates:

HEADER    CSD ENTRY 3KK6
CRYST1   1.0000   1.0000   1.0000  90.00  90.00  90.00          
SCALE1      1.000000  0.000000  0.000000       0.000000
SCALE2      0.000000  1.000000  0.000000       0.000000
SCALE3      0.000000  0.000000  1.000000       0.000000
ATOM      1  N   PRO A   1      -1.471  73.907  -4.091  1.00111.70           N  
ATOM      2  CA  PRO A   1      -2.058  73.750  -2.771  1.00118.46           C  
ATOM      3  C   PRO A   1      -3.273  72.889  -2.915  1.00124.49           C  
ATOM      4  O   PRO A   1      -4.153  72.917  -2.071  1.00121.31           O  
ATOM      5  CB  PRO A   1      -2.500  75.159  -2.446  1.00116.04           C  
... etc ...


We can access the usual attributes of a protein through the returned object: first we have the chains:

In [8]:
chains = my_protein_cif.chains
print(chains)

(<ccdc.protein.Protein.Chain object at 0x0000029AB69E31F0>, <ccdc.protein.Protein.Chain object at 0x0000029AB69E3940>)


This protein has 2 chain objects: lets look at what's inside a chain:

In [9]:
print([x for x in dir(chains[0]) if not x.startswith('_')])

['identifier', 'index', 'residues', 'sequence']


Each chain has an identifier and an index, a list of individual residues, and a sequence

In [10]:
print(chains[0].sequence)

PVNPCCYYPCQHQGICVRFGLDRYQCDCTRTGYSGPNCTIPEIWTWLRTTLRPSPSFIHFLLTHGRWLWDFVNATFIRDTLMRLVLTVRSNLIPSPPTYNIAHDYISWESFSNVSYYTRILPSVPRDCPTPMGTKGKKQLPDAEFLSRRFLLRRKFIPDPQGTNLMFAFFAQHFTHQFFKTSGKMGPGFTKALGHGVDLGHIYGDNLERQYQLRLFKDGKLKYQMLNGEVYPPSVEEAPVLMHYPRGIPPQSQMAVGQEVFGLLPGLMLYATIWLREHNRVCDLLKAEHPTWGDEQLFQTARLILIGETIKIVIEEYVQQLSGYFLQLKFDPELLFGAQFQYRNRIAMEFNQLYHWHPLMPDSFRVGPQDYSYEQFLFNTSMLVDYGVEALVDAFSRQPAGRIGGGRNIDHHILHVAVDVIKESRVLRLQPFNEYRKRFGMKPYTSFQELTGEKEMAAELEELYGDIDALEFYPGLLLEKCHPNSIFGESMIEMGAPFSLKGLLGNPICSPEYWKASTFGGEVGFNLVKTATLKKLVCLNTKTCPYVSFHVPD


The sequence is a direct reflection of the atomic residues in the input: it does not reflect the SEQRES record in the PDB file. For example - here we read the same PDB but only with the first 6 residues included and we get a tiny chain.

In [11]:
tiny = Protein.from_file('tiny.pdb')

In [12]:
print(tiny.chains[0].sequence)

PVNPCC


Lets look at the various chain identifiers:

In [13]:
for chain in my_protein_cif.chains:
    print(chain.identifier)
    print(chain.sequence)

A
PVNPCCYYPCQHQGICVRFGLDRYQCDCTRTGYSGPNCTIPEIWTWLRTTLRPSPSFIHFLLTHGRWLWDFVNATFIRDTLMRLVLTVRSNLIPSPPTYNIAHDYISWESFSNVSYYTRILPSVPRDCPTPMGTKGKKQLPDAEFLSRRFLLRRKFIPDPQGTNLMFAFFAQHFTHQFFKTSGKMGPGFTKALGHGVDLGHIYGDNLERQYQLRLFKDGKLKYQMLNGEVYPPSVEEAPVLMHYPRGIPPQSQMAVGQEVFGLLPGLMLYATIWLREHNRVCDLLKAEHPTWGDEQLFQTARLILIGETIKIVIEEYVQQLSGYFLQLKFDPELLFGAQFQYRNRIAMEFNQLYHWHPLMPDSFRVGPQDYSYEQFLFNTSMLVDYGVEALVDAFSRQPAGRIGGGRNIDHHILHVAVDVIKESRVLRLQPFNEYRKRFGMKPYTSFQELTGEKEMAAELEELYGDIDALEFYPGLLLEKCHPNSIFGESMIEMGAPFSLKGLLGNPICSPEYWKASTFGGEVGFNLVKTATLKKLVCLNTKTCPYVSFHVPD
B
PVNPCCYYPCQHQGICVRFGLDRYQCDCTRTGYSGPNCTIPEIWTWLRTTLRPSPSFIHFLLTHGRWLWDFVNATFIRDTLMRLVLTVRSNLIPSPPTYNIAHDYISWESFSNVSYYTRILPSVPRDCPTPMGTKGKKQLPDAEFLSRRFLLRRKFIPDPQGTNLMFAFFAQHFTHQFFKTSGKMGPGFTKALGHGVDLGHIYGDNLERQYQLRLFKDGKLKYQMLNGEVYPPSVEEAPVLMHYPRGIPPQSQMAVGQEVFGLLPGLMLYATIWLREHNRVCDLLKAEHPTWGDEQLFQTARLILIGETIKIVIEEYVQQLSGYFLQLKFDPELLFGAQFQYRNRIAMEFNQLYHWHPLMPDSFRVGPQDYSYEQFLFNTSMLVDYGVEALVDAFSRQPAGRIGGGRNIDHHILHVAVDVIKESRVLRLQPFNEYRKRFGMK

Sequences are text strings, so we can compare them as is.

In [14]:
if my_protein_cif.chains[0].sequence == my_protein_cif.chains[1].sequence:
    print("The chain sequences are identical")

The chain sequences are identical


### Editing Protein Objects

We can edit a protein using the add and remove methods. For example, lets remove a chain

In [15]:
chains = my_protein_cif.chains
my_protein_cif.remove_chain(chains[1].identifier)
for chain in my_protein_cif.chains:
    print(chain.identifier)
    print(chain.sequence)

A
PVNPCCYYPCQHQGICVRFGLDRYQCDCTRTGYSGPNCTIPEIWTWLRTTLRPSPSFIHFLLTHGRWLWDFVNATFIRDTLMRLVLTVRSNLIPSPPTYNIAHDYISWESFSNVSYYTRILPSVPRDCPTPMGTKGKKQLPDAEFLSRRFLLRRKFIPDPQGTNLMFAFFAQHFTHQFFKTSGKMGPGFTKALGHGVDLGHIYGDNLERQYQLRLFKDGKLKYQMLNGEVYPPSVEEAPVLMHYPRGIPPQSQMAVGQEVFGLLPGLMLYATIWLREHNRVCDLLKAEHPTWGDEQLFQTARLILIGETIKIVIEEYVQQLSGYFLQLKFDPELLFGAQFQYRNRIAMEFNQLYHWHPLMPDSFRVGPQDYSYEQFLFNTSMLVDYGVEALVDAFSRQPAGRIGGGRNIDHHILHVAVDVIKESRVLRLQPFNEYRKRFGMKPYTSFQELTGEKEMAAELEELYGDIDALEFYPGLLLEKCHPNSIFGESMIEMGAPFSLKGLLGNPICSPEYWKASTFGGEVGFNLVKTATLKKLVCLNTKTCPYVSFHVPD


Now, as you can see we have only one chain. Protein has several 'remove' and 'add' methods for adding and removing attributes of a protein. Lets remove the first 50 residues:

In [16]:
print(f"Currently we have {len(my_protein_cif.residues)} residues and {len(my_protein_cif.atoms)} atoms")

_ = [ my_protein_cif.remove_residue(r.identifier) for r in my_protein_cif.residues[:50] ]
    
print(f"After editing, we have {len(my_protein_cif.residues)} residues and {len(my_protein_cif.atoms)} atoms")
print()
print(my_protein_cif.chains[0].sequence)

Currently we have 553 residues and 9380 atoms
After editing, we have 503 residues and 8622 atoms

LRPSPSFIHFLLTHGRWLWDFVNATFIRDTLMRLVLTVRSNLIPSPPTYNIAHDYISWESFSNVSYYTRILPSVPRDCPTPMGTKGKKQLPDAEFLSRRFLLRRKFIPDPQGTNLMFAFFAQHFTHQFFKTSGKMGPGFTKALGHGVDLGHIYGDNLERQYQLRLFKDGKLKYQMLNGEVYPPSVEEAPVLMHYPRGIPPQSQMAVGQEVFGLLPGLMLYATIWLREHNRVCDLLKAEHPTWGDEQLFQTARLILIGETIKIVIEEYVQQLSGYFLQLKFDPELLFGAQFQYRNRIAMEFNQLYHWHPLMPDSFRVGPQDYSYEQFLFNTSMLVDYGVEALVDAFSRQPAGRIGGGRNIDHHILHVAVDVIKESRVLRLQPFNEYRKRFGMKPYTSFQELTGEKEMAAELEELYGDIDALEFYPGLLLEKCHPNSIFGESMIEMGAPFSLKGLLGNPICSPEYWKASTFGGEVGFNLVKTATLKKLVCLNTKTCPYVSFHVPD


Similar methods exist for adding and removing ligands, cofactors, nucleotides, metals and waters. we can also remove individual atoms. Currently removing atoms can be a bit slow.

Lets remove all the sidechains atoms in our protein. First we build the list of atoms to remove:

In [17]:
atoms_to_remove = [ atom for residue in my_protein_cif.residues for atom in residue.sidechain_atoms ]

In [18]:
print(f"First ten atoms are {atoms_to_remove[:10]}")

First to atoms are [Atom(CB), Atom(CG), Atom(CD1), Atom(CD2), Atom(HA), Atom(HB2), Atom(HB3), Atom(HG), Atom(HD11), Atom(HD12)]


In [19]:
my_protein_cif.remove_atoms(atoms_to_remove)

Lets write it out and view it in Hermes:

In [24]:
from ccdc.io import MoleculeWriter
with MoleculeWriter('saved.mol2') as mw:
    mw.write(my_protein_cif)
    
run_hermes('saved.mol2')

As you can see, the protein now just has the backbone atoms left.