Utworzyć ramkę danych zawierającą informacje o białkach, z którymi poszczególne leki wchodzą w interakcje. Białka te to tzw. targety. Ramka danych powinna zawierać przynajmniej
- DrugBank ID targetu,
- informację o zewnętrznej bazie danych (ang. *source*, np. Swiss-Prot),
- identyfikator w zewnętrznej bazie danych,
- nazwę polipeptydu,
- nazwę genu kodującego polipeptyd,
- identyfikator genu GenAtlas ID,
- numer chromosomu,
- umiejscowienie w komórce.

In [29]:
import pandas as pd
import lxml
import xml.etree.ElementTree as ET
import networkx as nx
import matplotlib.pyplot as plt
from pprint import pprint

In [30]:
path = 'data/drugbank_partial.xml'
ns = {'db': 'http://www.drugbank.ca'}

In [31]:
tree = ET.parse(path)
root = tree.getroot()

In [32]:
def unwrap(field, node: ET.Element):
    return getattr(node.find(f"db:{field}", ns), "text", None)

In [33]:
def unwrap_attrib(attrib, node: ET.Element):
    return node.attrib.get(attrib)

In [40]:
def targets_df(drug_id):
    data = []
    for target in root.findall(f"db:drug[db:drugbank-id='{drug_id}']/db:targets/db:target", ns):
        # print(target)
        # break
        polypeptide = target.find('db:polypeptide', ns)
        genatlas = polypeptide.find("db:external-identifiers/db:external-identifier[db:resource='GenAtlas']", ns)
        data.append({
            "ID": unwrap("id", target),
            "Source": unwrap_attrib("source", polypeptide),
            "Source ID": unwrap_attrib("id", polypeptide),
            "Polypeptide Name": unwrap("name", polypeptide),
            "Gene Name": unwrap("gene-name", polypeptide),
            "GenAtlas ID": unwrap("identifier", genatlas),
            "Chromosome No": unwrap("chromosome-location", polypeptide),
            "Cellular Loc": unwrap("cellular-location", polypeptide),
        })
            
    return pd.DataFrame(data)

In [43]:
targets_df("DB00002")

Unnamed: 0,ID,Source,Source ID,Polypeptide Name,Gene Name,GenAtlas ID,Chromosome No,Cellular Loc
0,BE0000767,Swiss-Prot,P00533,Epidermal growth factor receptor,EGFR,EGFR,7,Cell membrane
1,BE0000901,Swiss-Prot,O75015,Low affinity immunoglobulin gamma Fc region re...,FCGR3B,FCGR3B,1,Cell membrane
2,BE0002094,Swiss-Prot,P02745,Complement C1q subcomponent subunit A,C1QA,C1QA,1,Secreted
3,BE0002095,Swiss-Prot,P02746,Complement C1q subcomponent subunit B,C1QB,C1QB,1,Secreted
4,BE0002096,Swiss-Prot,P02747,Complement C1q subcomponent subunit C,C1QC,C1QC,1,Secreted
5,BE0002097,Swiss-Prot,P08637,Low affinity immunoglobulin gamma Fc region re...,FCGR3A,FCGR3A,1,Cell membrane
6,BE0000710,Swiss-Prot,P12314,High affinity immunoglobulin gamma Fc receptor I,FCGR1A,FCGR1A,1,Cell membrane
7,BE0002098,Swiss-Prot,P12318,Low affinity immunoglobulin gamma Fc region re...,FCGR2A,FCGR2A,1,Cell membrane
