# Exercise 2 - databases, molecular modelling
The goal of this exercise is to extract data from uniprot database

partially based on <br>
https://github.com/volkamerlab/TeachOpenCADD <br>
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0351-x

## databases

In [1]:
# importing 
import requests
import xml.etree.ElementTree as ET
import pandas as pd
import math
import numpy as np
from matplotlib import pyplot as plt

In [2]:
search_dict = dict(offset=0, size=-1, protein='ABL1', reviewed='true', organism='human')
#perform a search in uniprot
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?" 
response = requests.get(requestURL, params=search_dict, headers={ "Accept" : "application/json"})
data = response.json()

print('entries found', len(data))

entries found 7


In [3]:
data[0].keys()

dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', 'secondaryAccession', 'protein', 'gene', 'comments', 'features', 'dbReferences', 'keywords', 'references', 'sequence'])

In [4]:
data[0]

{'accession': 'Q9BTV7',
 'id': 'CABL2_HUMAN',
 'proteinExistence': 'Evidence at protein level',
 'info': {'type': 'Swiss-Prot',
  'created': '2003-02-12',
  'modified': '2022-10-12',
  'version': 165},
 'organism': {'taxonomy': 9606,
  'names': [{'type': 'scientific', 'value': 'Homo sapiens'},
   {'type': 'common', 'value': 'Human'}],
  'lineage': ['Eukaryota',
   'Metazoa',
   'Chordata',
   'Craniata',
   'Vertebrata',
   'Euteleostomi',
   'Mammalia',
   'Eutheria',
   'Euarchontoglires',
   'Primates',
   'Haplorrhini',
   'Catarrhini',
   'Hominidae',
   'Homo']},
 'secondaryAccession': ['Q5JWL0', 'Q9BYK0'],
 'protein': {'recommendedName': {'fullName': {'value': 'CDK5 and ABL1 enzyme substrate 2'}},
  'alternativeName': [{'fullName': {'value': 'Interactor with CDK3 2'},
    'shortName': [{'value': 'Ik3-2'}]}]},
 'gene': [{'name': {'value': 'CABLES2'},
   'synonyms': [{'value': 'C20orf150'}]}],
 'comments': [{'type': 'FUNCTION',
   'text': [{'value': 'Unknown. Probably involved in 

In [5]:
# basic info for each entry
for entry in data:
    print(entry['accession'], entry['id'])

Q9BTV7 CABL2_HUMAN
Q8TDN4 CABL1_HUMAN
Q8TDN4-3 CABL1_HUMAN
Q8TDN4-2 CABL1_HUMAN
P00519-2 ABL1_HUMAN
P00519 ABL1_HUMAN
Q8TDN4-4 CABL1_HUMAN


In [6]:
data[0]['organism']

{'taxonomy': 9606,
 'names': [{'type': 'scientific', 'value': 'Homo sapiens'},
  {'type': 'common', 'value': 'Human'}],
 'lineage': ['Eukaryota',
  'Metazoa',
  'Chordata',
  'Craniata',
  'Vertebrata',
  'Euteleostomi',
  'Mammalia',
  'Eutheria',
  'Euarchontoglires',
  'Primates',
  'Haplorrhini',
  'Catarrhini',
  'Hominidae',
  'Homo']}

In [7]:
search_dict = dict(offset=0, size=-1, protein='ABL1', reviewed='true', organism='human')
#perform a search in uniprot
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?" 
response = requests.get(requestURL, params=search_dict, headers={ "Accept" : "application/xml"})
data_xml = response.text

print(data_xml)

<?xml version='1.0' encoding='UTF-8'?><uniprot xmlns="http://uniprot.org/uniprot" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2003-02-12" modified="2022-10-12" version="165"><accession>Q9BTV7</accession><accession>Q5JWL0</accession><accession>Q9BYK0</accession><name>CABL2_HUMAN</name><protein><recommendedName><fullName>CDK5 and ABL1 enzyme substrate 2</fullName></recommendedName><alternativeName><fullName>Interactor with CDK3 2</fullName><shortName>Ik3-2</shortName></alternativeName></protein><gene><name type="primary">CABLES2</name><name type="synonym">C20orf150</name></gene><organism><name type="scientific">Homo sapiens</name><name type="common">Human</name><dbReference type="NCBI Taxonomy" id="9606"/><lineage><taxon>Eukaryota</taxon><taxon>Metazoa</taxon><taxon>Chordata</taxon><taxon>Craniata</taxon><taxon>V

In [8]:
# let’s retrieve the data
formats = ('xml', 'json', 'txt') # some of the formats supported by uniprot 
uniprot_id = 'P00519'
data = []
for data_format in formats:
    temp_dict = {'query':uniprot_id, 'format':data_format}
    temp_response = requests.get('https://www.uniprot.org/uniprot/{:}.{:}'.format(uniprot_id, data_format))
    data.append(temp_response.text)

this is what we got<br>
you can take a look at it online as well<br>
https://www.uniprot.org/uniprot/P00519.xml<br>
https://www.uniprot.org/uniprot/P00519.json<br>
https://www.uniprot.org/uniprot/P00519.txt<br>

additionally, one could also retreive the html source (very trickly to get the data...)
https://www.uniprot.org/uniprot/P00519

In [9]:
print(data[0])
#https://www.uniprot.org/uniprot/P00519.xml

<?xml version="1.0" encoding="UTF-8"  standalone="no" ?>
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="1986-07-21" modified="2022-10-12" version="276" xmlns="http://uniprot.org/uniprot">
  <accession>P00519</accession>
  <accession>A3KFJ3</accession>
  <accession>Q13869</accession>
  <accession>Q13870</accession>
  <accession>Q16133</accession>
  <accession>Q17R61</accession>
  <accession>Q45F09</accession>
  <name>ABL1_HUMAN</name>
  <protein>
    <recommendedName>
      <fullName>Tyrosine-protein kinase ABL1</fullName>
      <ecNumber evidence="44 50">2.7.10.2</ecNumber>
    </recommendedName>
    <alternativeName>
      <fullName>Abelson murine leukemia viral oncogene homolog 1</fullName>
    </alternativeName>
    <alternativeName>
      <fullName>Abelson tyrosine-protein kinase 1</fullName>
    </alternati

In [10]:
print(data[1])
#https://www.uniprot.org/uniprot/P00519.json

{"entryType":"UniProtKB reviewed (Swiss-Prot)","primaryAccession":"P00519","secondaryAccessions":["A3KFJ3","Q13869","Q13870","Q16133","Q17R61","Q45F09"],"uniProtkbId":"ABL1_HUMAN","entryAudit":{"firstPublicDate":"1986-07-21","lastAnnotationUpdateDate":"2022-10-12","lastSequenceUpdateDate":"2006-01-24","entryVersion":276,"sequenceVersion":4},"annotationScore":5.0,"organism":{"scientificName":"Homo sapiens","commonName":"Human","taxonId":9606,"lineage":["Eukaryota","Metazoa","Chordata","Craniata","Vertebrata","Euteleostomi","Mammalia","Eutheria","Euarchontoglires","Primates","Haplorrhini","Catarrhini","Hominidae","Homo"]},"proteinExistence":"1: Evidence at protein level","proteinDescription":{"recommendedName":{"fullName":{"value":"Tyrosine-protein kinase ABL1"},"ecNumbers":[{"evidences":[{"evidenceCode":"ECO:0000269","source":"PubMed","id":"20357770"},{"evidenceCode":"ECO:0000269","source":"PubMed","id":"28428613"}],"value":"2.7.10.2"}]},"alternativeNames":[{"fullName":{"value":"Abelson

In [11]:
print(data[2])
#https://www.uniprot.org/uniprot/P00519.txt

ID   ABL1_HUMAN              Reviewed;        1130 AA.
AC   P00519; A3KFJ3; Q13869; Q13870; Q16133; Q17R61; Q45F09;
DT   21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
DT   24-JAN-2006, sequence version 4.
DT   12-OCT-2022, entry version 276.
DE   RecName: Full=Tyrosine-protein kinase ABL1;
DE            EC=2.7.10.2 {ECO:0000269|PubMed:20357770, ECO:0000269|PubMed:28428613};
DE   AltName: Full=Abelson murine leukemia viral oncogene homolog 1;
DE   AltName: Full=Abelson tyrosine-protein kinase 1;
DE   AltName: Full=Proto-oncogene c-Abl;
DE   AltName: Full=p150;
GN   Name=ABL1; Synonyms=ABL, JTK7;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;
OC   Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM IA), ALTERNATIVE SPLICING, CHROMOSOMAL
RP   TRANSLOCATION WITH BRC, AND VARIANT PRO-140.
RX   PubMed=3021337; DOI=10.1016/0

### xml.etree.ElementTree

In [12]:
# here we use xml.etree.ElementTree to parse the retrieved xml table
data_xml = ET.fromstring(data[0])

In [13]:
# loop over tags at the first level - this would correspond to 2 tags both called tag1 in TABLE 1 
# one that contains more children tags and the other one which is empty
for child in data_xml:
    print(child.tag, child.attrib)
    print('\n')

{http://uniprot.org/uniprot}entry {'dataset': 'Swiss-Prot', 'created': '1986-07-21', 'modified': '2022-10-12', 'version': '276'}


{http://uniprot.org/uniprot}copyright {}




In [14]:
# let’s now loop over children of the first tag
# we get uniprot ids related to our protein, with the first one being the ID of our protein
counter = 0
for child in data_xml[0]:
    print(child.tag, child.attrib, child.text)
    if counter ==3:
        break
    counter+=1

{http://uniprot.org/uniprot}accession {} P00519
{http://uniprot.org/uniprot}accession {} A3KFJ3
{http://uniprot.org/uniprot}accession {} Q13869
{http://uniprot.org/uniprot}accession {} Q13870


In [15]:
# this xml is structured such that all tags start with {http://uniprot.org/uniprot}
# which is a bit annoying, so let’s remove this

def remove_xmlns(line):
    return line[:line.find('xmlns')] + '>'

N_lines_2_fix = 3
temp = data[0].split('\n', N_lines_2_fix)
xml_txt = ''
for line in temp[:-1]:
    xml_txt += remove_xmlns(line) + '\n'
xml_txt += temp[-1]

xml = ET.fromstring(xml_txt)
xml = xml[0]

In [16]:
# quick check that {http://uniprot.org/uniprot} is removed from the tag names
for child in xml:
    print(child.tag, child.attrib, child.text)
    break

accession {} P00519


In [17]:
# the power of such tables lies in the ability to search for different tags using tag names or attributes
# let’s retrieve the protein name
print(xml.findall('protein')[0])# retreiving protein tag

<Element 'protein' at 0x7f8ebe78f470>


In [18]:
prot_recname_tag = xml.find('protein/recommendedName') # tag named recommendedName, which is child of the tag named protein
list(prot_recname_tag) # list of tags from the tag named protein

[<Element 'fullName' at 0x7f8ebe78f510>,
 <Element 'ecNumber' at 0x7f8ebe78f5b0>]

In [19]:
element = xml.find('protein/recommendedName/fullName') # we go one level more, to get to the fullName tag
print(element.text)

Tyrosine-protein kinase ABL1


In [20]:
# loop over all names
for child in xml.findall('protein')[0]:
    print(child.tag, child.find('fullName').text)

recommendedName Tyrosine-protein kinase ABL1
alternativeName Abelson murine leukemia viral oncogene homolog 1
alternativeName Abelson tyrosine-protein kinase 1
alternativeName Proto-oncogene c-Abl
alternativeName p150


In [21]:
# let’s now find tags that report on pdb structures
# if you look at https://www.uniprot.org/uniprot/P00519.xml, pdb data is stored in tags with dbReference name
all_dbReference_tags = xml.findall('dbReference')
len(all_dbReference_tags)
# however, there are many such tags, where not all of them are related to pdb structures

467

In [22]:
# tags that are related to pdb structures are have attribute type that is equal to PDB
# this is how we search for these tags
# findall('./*[@attribute="attribute_value"]')
pdb_tags = xml.findall('./*[@type="PDB"]')
len(pdb_tags)

70

In [23]:
# there is one more attribute for each of these tags, which stores the pdb id
for element in pdb_tags:
    print(element.attrib)

{'type': 'PDB', 'id': '1AB2'}
{'type': 'PDB', 'id': '1AWO'}
{'type': 'PDB', 'id': '1BBZ'}
{'type': 'PDB', 'id': '1JU5'}
{'type': 'PDB', 'id': '1OPL'}
{'type': 'PDB', 'id': '1ZZP'}
{'type': 'PDB', 'id': '2ABL'}
{'type': 'PDB', 'id': '2E2B'}
{'type': 'PDB', 'id': '2F4J'}
{'type': 'PDB', 'id': '2FO0'}
{'type': 'PDB', 'id': '2G1T'}
{'type': 'PDB', 'id': '2G2F'}
{'type': 'PDB', 'id': '2G2H'}
{'type': 'PDB', 'id': '2G2I'}
{'type': 'PDB', 'id': '2GQG'}
{'type': 'PDB', 'id': '2HIW'}
{'type': 'PDB', 'id': '2HYY'}
{'type': 'PDB', 'id': '2HZ0'}
{'type': 'PDB', 'id': '2HZ4'}
{'type': 'PDB', 'id': '2HZI'}
{'type': 'PDB', 'id': '2O88'}
{'type': 'PDB', 'id': '2V7A'}
{'type': 'PDB', 'id': '3CS9'}
{'type': 'PDB', 'id': '3EG0'}
{'type': 'PDB', 'id': '3EG1'}
{'type': 'PDB', 'id': '3EG2'}
{'type': 'PDB', 'id': '3EG3'}
{'type': 'PDB', 'id': '3EGU'}
{'type': 'PDB', 'id': '3K2M'}
{'type': 'PDB', 'id': '3PYY'}
{'type': 'PDB', 'id': '3QRI'}
{'type': 'PDB', 'id': '3QRJ'}
{'type': 'PDB', 'id': '3QRK'}
{'type': '

now we can go deeper into each of these tags and look for more information where we find which method was used for structure determination, resolution, which chain it corresponds to and for which part of the protein (which segment of amino-acid sequence) is the structure determined<br><br>
note: if a pdb code represents a structure of 2 or more proteins bound together, each of the individual proteins would be separated in a different chain, e.g. 5DC0 reports a structure of Fibronectin (chain A) and ABL1 (chain B)

In [24]:
c = 0
for element in xml.findall('./*[@type="PDB"]'):
    print(element.attrib)
    for child in element:
        print('\t',child.attrib)
    print('')
    c+=1
    if c==5:
        break

{'type': 'PDB', 'id': '1AB2'}
	 {'type': 'method', 'value': 'NMR'}
	 {'type': 'chains', 'value': 'A=120-220'}

{'type': 'PDB', 'id': '1AWO'}
	 {'type': 'method', 'value': 'NMR'}
	 {'type': 'chains', 'value': 'A=65-119'}

{'type': 'PDB', 'id': '1BBZ'}
	 {'type': 'method', 'value': 'X-ray'}
	 {'type': 'resolution', 'value': '1.65 A'}
	 {'type': 'chains', 'value': 'A/C/E/G=64-121'}

{'type': 'PDB', 'id': '1JU5'}
	 {'type': 'method', 'value': 'NMR'}
	 {'type': 'chains', 'value': 'C=62-122'}

{'type': 'PDB', 'id': '1OPL'}
	 {'type': 'method', 'value': 'X-ray'}
	 {'type': 'resolution', 'value': '3.42 A'}
	 {'type': 'chains', 'value': 'A/B=27-512'}



In [25]:
# more complex search to find only NMR structures
NMR_tags = xml.findall('./*[@type="PDB"]/*[@value="NMR"]')
for element in NMR_tags:
    print(element.tag, element.attrib)
# this search finds the tags that have an attribute called value, which is equal to "NMR"
# and is a parent of a tag that has an attribute called type, which is equal to "PDB"

property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}
property {'type': 'method', 'value': 'NMR'}


In [26]:
# similarly, this search finds the tags that have an attribute called type, which is equal to "PDB"
# and is a child of a tag that has an attribute called value, which is equal to "NMR"
PDB_tags_Xray_only = xml.findall('.//*[@value="NMR"]/..[@type="PDB"]')
for element in PDB_tags_Xray_only:
    print(element.tag, element.attrib)
# these are the parents
# the search above gave the children

dbReference {'type': 'PDB', 'id': '1AB2'}
dbReference {'type': 'PDB', 'id': '1AWO'}
dbReference {'type': 'PDB', 'id': '1JU5'}
dbReference {'type': 'PDB', 'id': '1ZZP'}
dbReference {'type': 'PDB', 'id': '6AMV'}
dbReference {'type': 'PDB', 'id': '6AMW'}
dbReference {'type': 'PDB', 'id': '6XR6'}
dbReference {'type': 'PDB', 'id': '6XR7'}
dbReference {'type': 'PDB', 'id': '6XRG'}


In [27]:
# let’s loop over X-ray structure and get some information (pdb id, resolution, chain, sequence)
for pdb_elem in xml.findall('.//*[@value="X-ray"]/..[@type="PDB"]'):
    pdb_id = pdb_elem.attrib['id']
    resolution = pdb_elem.find('./*[@type="resolution"]').attrib['value']
    chain_seq = pdb_elem.find('./*[@type="chains"]').attrib['value']
    print(pdb_id, resolution, chain_seq)
    break

1BBZ 1.65 A A/C/E/G=64-121


In [28]:
# looks for post-translational modifications
xml.findall('./*[@type="PTM"]')

[<Element 'comment' at 0x7f8ebe691940>,
 <Element 'comment' at 0x7f8ebe6919e0>,
 <Element 'comment' at 0x7f8ebe691a80>]

In [29]:
for PTM_elem in xml.findall('./*[@type="PTM"]'):
    print(PTM_elem[0].text)
    print('\n')
# this is more general information

Acetylated at Lys-711 by EP300 which promotes the cytoplasmic translocation.


Phosphorylation at Tyr-70 by members of the SRC family of kinases disrupts SH3 domain-based autoinhibitory interactions and intermolecular associations, such as that with ABI1, and also enhances kinase activity. Phosphorylation at Tyr-226 and Tyr-393 correlate with increased activity. DNA damage-induced activation of ABL1 requires the function of ATM and Ser-446 phosphorylation (By similarity). Phosphorylation at Ser-569 has been attributed to a CDC2-associated kinase and is coupled to cell division (By similarity). Phosphorylation at Ser-618 and Ser-619 by PAK2 increases binding to CRK and reduces binding to ABI1. Phosphorylation on Thr-735 is required for binding 14-3-3 proteins for cytoplasmic translocation. Phosphorylated by PRKDC (By similarity).


Polyubiquitinated. Polyubiquitination of ABL1 leads to degradation.




In [30]:
# this one gives a list of known PTM sites in a protein
for mod in xml.findall('./*[@type="modified residue"]'):
    mod_type = (mod.attrib['description'])
    position = (mod.find('.//position').attrib['position']) # position in the sequence if stored in some of the children
    print(mod_type, position)

Phosphoserine 50
Phosphotyrosine; by autocatalysis 70
Phosphotyrosine 115
Phosphotyrosine 128
Phosphotyrosine 139
Phosphotyrosine 172
Phosphotyrosine 185
Phosphotyrosine 215
Phosphotyrosine; by autocatalysis 226
Phosphoserine 229
Phosphotyrosine 253
Phosphotyrosine 257
Phosphotyrosine; by autocatalysis and SRC-type Tyr-kinases 393
Phosphotyrosine 413
Phosphoserine 446
Phosphoserine 559
Phosphoserine 569
Phosphoserine; by PAK2 618
Phosphoserine; by PAK2 619
Phosphoserine 620
Phosphoserine 659
Phosphoserine 683
N6-acetyllysine; by EP300 711
Phosphoserine 718
Phosphothreonine 735
Phosphothreonine 751
Phosphothreonine 781
Phosphothreonine 814
Phosphothreonine 823
Phosphothreonine 844
Phosphothreonine 852
Phosphoserine 855
Phosphoserine 917
Phosphoserine 977


In [31]:
# positions of all phosphoserines
for pos in xml.findall('./*[@type="modified residue"][@description="Phosphoserine"].//position'):
    print(pos.attrib['position'])

50
229
446
559
569
620
659
683
718
855
917
977
