## Module 2 

**Task description:**

Please use your favourite programming language (shell scripts, python, R, for instance) and APIs (Application Programming Interfaces) of databases to perform following operations. Submit your code. 

1. Retrieve all approved drugs from the ChEMBL database, sort them by approval year and name (a [Python example is here](https://github.com/chembl/chembl_webresource_client); documentations of the ChEMBL API can be found [here](https://www.ebi.ac.uk/chembl/api/data/docs));
  
2. For each approved drug since 2014 that you identified in step (1), retrieve a list of UniProt accession numbers, namely protein targets associated with the drug;

3.  For each protein with a UniProt accession number that you identified in step (2), retrieve **UniProt keywords** associated with it. You can use the UniProt API, documented [here](https://www.ebi.ac.uk/proteins/api/doc/#!/proteins/search). Python and R clients are also available.

## 1. Retrieve and sort all approved drugs from the ChEMBL DB

In [1]:
from chembl_webresource_client.new_client import new_client
import pandas as pd

molecule = new_client.molecule
approved_drugs = molecule.filter(max_phase=4).order_by('first_approval','pref_name')

# convert to DataFrame for better display
df = pd.DataFrame(approved_drugs[0:20])  # only first 20 drugs! approved_drugs contains all 

In [2]:
df[['molecule_chembl_id', 'pref_name', 'first_approval', 'molecule_type']]

Unnamed: 0,molecule_chembl_id,pref_name,first_approval,molecule_type
0,CHEMBL449,BUTABARBITAL,1939,Small molecule
1,CHEMBL1200982,BUTABARBITAL SODIUM,1939,Small molecule
2,CHEMBL1200542,DESOXYCORTICOSTERONE ACETATE,1939,Small molecule
3,CHEMBL821,GUANIDINE,1939,Small molecule
4,CHEMBL1200728,GUANIDINE HYDROCHLORIDE,1939,Small molecule
5,CHEMBL1201657,HEPARIN SODIUM,1939,Oligosaccharide
6,CHEMBL90,HISTAMINE,1939,Small molecule
7,CHEMBL3989520,HISTAMINE PHOSPHATE,1939,Small molecule
8,CHEMBL700,SULFAPYRIDINE,1939,Small molecule
9,CHEMBL1370561,AMINOPHYLLINE,1940,Small molecule


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 36 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   atc_classifications   20 non-null     object 
 1   availability_type     20 non-null     int64  
 2   biotherapeutic        0 non-null      object 
 4   chebi_par_id          11 non-null     float64
 5   chemical_probe        20 non-null     int64  
 6   chirality             20 non-null     int64  
 7   cross_references      20 non-null     object 
 8   dosed_ingredient      20 non-null     bool   
 9   first_approval        20 non-null     int64  
 10  first_in_class        20 non-null     int64  
 11  helm_notation         0 non-null      object 
 12  indication_class      16 non-null     object 
 13  inorganic_flag        20 non-null     int64  
 14  max_phase             20 non-null     object 
 15  molecule_chembl_id    20 non-null     object 
 16  molecule_hierarchy    20 

## 2. Retrieve a list of UniProt accession numbers

In [4]:
approved_drugs_since_2014 = approved_drugs.filter(first_approval__gte=2014)

pd.DataFrame(approved_drugs_since_2014[0:20])[['molecule_chembl_id', 'pref_name', 'first_approval', 'molecule_type']]

Unnamed: 0,molecule_chembl_id,pref_name,first_approval,molecule_type
0,CHEMBL441738,AFAMELANOTIDE,2014,Protein
1,CHEMBL4297213,AFAMELANOTIDE ACETATE,2014,Protein
2,CHEMBL2107841,ALBIGLUTIDE,2014,Protein
3,CHEMBL514800,APREMILAST,2014,Small molecule
4,CHEMBL2105735,ASUNAPREVIR,2014,Small molecule
5,CHEMBL256997,ATALUREN,2014,Small molecule
6,CHEMBL408513,BELINOSTAT,2014,Small molecule
7,CHEMBL1742992,BLINATUMOMAB,2014,Antibody
8,CHEMBL2103872,CEFTOLOZANE,2014,Small molecule
9,CHEMBL1213250,CEFTOLOZANE SULFATE,2014,Small molecule


In [5]:
drug_ids = [ chemb_id['molecule_chembl_id'] for chemb_id in approved_drugs_since_2014.only(['molecule_chembl_id']) ] 
drug_ids[:10]

['CHEMBL441738',
 'CHEMBL4297213',
 'CHEMBL2107841',
 'CHEMBL514800',
 'CHEMBL2105735',
 'CHEMBL256997',
 'CHEMBL408513',
 'CHEMBL1742992',
 'CHEMBL2103872',
 'CHEMBL1213250']

In [6]:
from chembl_webresource_client.new_client import new_client

# In 2014, UniProt introduced 10-character accession numbers to accommodate the growing number of entries

# To retrieve them we need to go as follows: molecule -> mechanism -> target 

# To retrive targets using molecule_chembl_id we got above  
# we must get their mechanisms of action first using .mechanism endpoint
# Finally, we can get information about the targets and their uniprots 

target = new_client.target

uniprot_ids_all = {}
counter = 0 

for drug in drug_ids:
    # ChEMBL molecule -> mechanism

    mechanisms = new_client.mechanism.filter(molecule_chembl_id=drug)
    
    for mechanism in mechanisms:

        target_id = mechanism['target_chembl_id'] # can be None! 

        if target_id:
            # ChEMBL target -> Uniprot 
            target_data = target.get(target_id)

            uniprots_target = [data['accession'] for data in target_data['target_components']]
        
            uniprot_ids_all[drug] = uniprots_target 
        else: 
            print("No target id found in DB")

        if counter < 20:
            print(f"Drug: {drug},  UniProt: {uniprots_target}")
            counter +=1 

Drug: CHEMBL441738,  UniProt: ['Q01726']
Drug: CHEMBL4297213,  UniProt: ['Q01726']
Drug: CHEMBL2107841,  UniProt: ['P43220']
Drug: CHEMBL514800,  UniProt: ['P27815', 'Q08499', 'Q07343', 'Q08493']
Drug: CHEMBL2105735,  UniProt: ['D2K2A8', 'A3EZI9']
Drug: CHEMBL256997,  UniProt: ['P08865', 'P42677', 'P62753', 'P15880', 'P23396', 'P61247', 'P62701', 'P22090', 'P46782', 'P62081', 'P62241', 'P46781', 'P46783', 'P62280', 'P25398', 'P62277', 'P62263', 'P62841', 'P62244', 'P62249', 'P08708', 'P62269', 'P39019', 'P60866', 'P63220', 'P62266', 'P62847', 'P62851', 'P62854', 'P62979', 'P62857', 'P62273', 'P62861', 'P39023', 'P36578', 'P46777', 'Q02878', 'P18124', 'P62424', 'P62917', 'P32969', 'P27635', 'P62906', 'P62913', 'P30050', 'P26373', 'P40429', 'P50914', 'P61313', 'P18621', 'Q07020', 'Q02543', 'P84098', 'P46778', 'P35268', 'P62829', 'P62750', 'P83731', 'P61254', 'P61353', 'P46776', 'P46779', 'P47914', 'P62888', 'P62899', 'P62910', 'P49207', 'P42766', 'P18077', 'Q9Y3U8', 'P83881', 'P61927', '

In [7]:
print(f"Initial number of drugs: {len(drug_ids)}")
print(f"Number of drugs after UniProt accession numbers retrieval: {len(uniprot_ids_all)}")

Initial number of drugs: 722
Number of drugs after UniProt accession numbers retrieval: 566


In [9]:
# drugs can work against multiple targets 
uniprot_ids = [item for sublist in list(uniprot_ids_all.values()) for item in sublist if item is not None]
print(f"Number of unique UniProt accession numbers: {len(uniprot_ids)}")


Number of unique UniProt accession numbers: 1371


## 3. Retrieve UniProt keywords

In [10]:
from bioservices import UniProt
# source: https://widdowquinn.github.io/2018-03-06-ibioic/02-sequence_databases/07-uniprot_programming.html 
u = UniProt()

def get_uniprot_keywords(accession):
    try:
        # retrieve entry in JSON format
        entry = u.retrieve(accession, frmt="json")
        
        # check if entry is a dictionary before using get()
        if isinstance(entry, dict):
            return entry.get("keywords", [])
        else:
            print(f"Warning: Could not retrieve keywords for {accession}. Response type: {type(entry)}")
            return []
    except Exception as e:
        print(f"Error retrieving data for {accession}: {str(e)}")
        return []

i=0
uniprot_keywords = {}

# Because keywords represent different classifications of this protein's structure and function 
# a single UniProt can have several sets of keywords

for idx in uniprot_ids:
    item = get_uniprot_keywords(idx)
    uniprot_keywords[idx] = item
    
    if i<10:
        print("UniProt:", idx, "Keywords:", item, "\n")  
        i+=1


UniProt: Q01726 Keywords: [{'id': 'KW-0002', 'category': 'Technical term', 'name': '3D-structure'}, {'id': 'KW-1003', 'category': 'Cellular component', 'name': 'Cell membrane'}, {'id': 'KW-0225', 'category': 'Disease', 'name': 'Disease variant'}, {'id': 'KW-0297', 'category': 'Molecular function', 'name': 'G-protein coupled receptor'}, {'id': 'KW-0325', 'category': 'PTM', 'name': 'Glycoprotein'}, {'id': 'KW-0449', 'category': 'PTM', 'name': 'Lipoprotein'}, {'id': 'KW-0472', 'category': 'Cellular component', 'name': 'Membrane'}, {'id': 'KW-0564', 'category': 'PTM', 'name': 'Palmitate'}, {'id': 'KW-1267', 'category': 'Technical term', 'name': 'Proteomics identification'}, {'id': 'KW-0675', 'category': 'Molecular function', 'name': 'Receptor'}, {'id': 'KW-1185', 'category': 'Technical term', 'name': 'Reference proteome'}, {'id': 'KW-0807', 'category': 'Molecular function', 'name': 'Transducer'}, {'id': 'KW-0812', 'category': 'Domain', 'name': 'Transmembrane'}, {'id': 'KW-1133', 'category'

















































































In [12]:
uniprot_keywords["P08865"]

[{'id': 'KW-0002', 'category': 'Technical term', 'name': '3D-structure'},
 {'id': 'KW-0007', 'category': 'PTM', 'name': 'Acetylation'},
 {'id': 'KW-1003', 'category': 'Cellular component', 'name': 'Cell membrane'},
 {'id': 'KW-0963', 'category': 'Cellular component', 'name': 'Cytoplasm'},
 {'id': 'KW-0903',
  'category': 'Technical term',
  'name': 'Direct protein sequencing'},
 {'id': 'KW-0225', 'category': 'Disease', 'name': 'Disease variant'},
 {'id': 'KW-1183',
  'category': 'Molecular function',
  'name': 'Host cell receptor for virus entry'},
 {'id': 'KW-0945',
  'category': 'Biological process',
  'name': 'Host-virus interaction'},
 {'id': 'KW-1017', 'category': 'PTM', 'name': 'Isopeptide bond'},
 {'id': 'KW-0472', 'category': 'Cellular component', 'name': 'Membrane'},
 {'id': 'KW-0539', 'category': 'Cellular component', 'name': 'Nucleus'},
 {'id': 'KW-0597', 'category': 'PTM', 'name': 'Phosphoprotein'},
 {'id': 'KW-1267',
  'category': 'Technical term',
  'name': 'Proteomics id

In [15]:
uniprot_keywords["ENSG00000205571"] 
# but https://www.uniprot.org/uniprotkb?query=ENSG00000205571 finds it => something with the bioservices client  

[]