# Exploration

For web scraping, I am using BeautifulSoup. I'll be keeping the data in a pandas dataframe.

My first task at hand is to decipher how DrugBank stores information. To do this, I'll bring in the html with beautiful soup and run through the html tree to see where I can isolate information for drug targets.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Originally used html.parser for BeautifulSoup, ran into an issue with find_all function

In [2]:
#I will start out by testing 'DB00619'
drug = 'DB00619'

url = 'http://www.drugbank.ca'
page = requests.get(url+'/drugs/'+drug)

soup = BeautifulSoup(page.content, 'lxml')

I want to extract the drug name and accession number from a reliable source, the metadata is a likely candidate.

In [3]:
drug_name = soup.find('meta', {'name':'dc.title'})['content']
accession_number = soup.find('meta', {'name':'dc.identifier'})['content']

In [4]:
#print(soup.prettify())

To explore the html, I used DevTools of the actual page (https://www.drugbank.ca/drugs/DB00619) to navigate to the section that deals with Targets. At the same time with BeautifulSoup, I try to navigate through the html to reach the same section. The goal here is to isolate the data to the drug targets and extract the target names.

After some digging around, I notice that the 'bond-list' has multiple 'bond cards'. 
Each 'bond card' has an id attached to it, which I am establishing as the target_id. I still need the target_name.

Upon searching the page for other instances of the target_id, I came across a link with a href attribute of '#target_id'(i.e. #BE0004743) and the target_name within the text section.

In [5]:
#navigating through html to only select target information
container_targets = soup.find('div', class_='bond-list-container targets')
bond_cards = container_targets.find_all('div', class_='bond card')
len(bond_cards)

9

From manually looking through the page, I know there are 9 different targets for DB00619 or Imatinib. This confirms that I've isolated the targets.

In [6]:
#bond cards hold information about the target. There is an id attribute with the target_id
target_ids = []
for bond_card in bond_cards:
    target_ids.append(bond_card.get('id'))
target_ids

['BE0004743',
 'BE0000453',
 'BE0001104',
 'BE0001039',
 'BE0000853',
 'BE0000852',
 'BE0001124',
 'BE0000014',
 'BE0000205']

In [7]:
target_names = []
for target_id in target_ids:
    target_names.append(soup.find('a', href=('#'+target_id)).text)
target_names

['Breakpoint cluster region protein',
 'Mast/stem cell growth factor receptor Kit',
 'RET proto-oncogene',
 'High affinity nerve growth factor receptor',
 'Macrophage colony-stimulating factor 1 receptor',
 'Platelet-derived growth factor receptor alpha',
 'Epithelial discoidin domain-containing receptor 1',
 'Tyrosine-protein kinase ABL1',
 'Platelet-derived growth factor receptor beta']

In [8]:
#creating a data structure to then convert to a dataframe
result = []
for i in range(0, len(target_names)):
    result.append((accession_number, drug_name, target_ids[i], target_names[i]))

In [9]:
#creating a empty full_df to later consolidate all information, and df with the current extracted information
full_df = pd.DataFrame(columns = ['accession_number', 'drug_name', 'target_id', 'target_name'])
df = pd.DataFrame(result, columns = ['accession_number', 'drug_name', 'target_id', 'target_name'])
df.head()

Unnamed: 0,accession_number,drug_name,target_id,target_name
0,DB00619,Imatinib,BE0004743,Breakpoint cluster region protein
1,DB00619,Imatinib,BE0000453,Mast/stem cell growth factor receptor Kit
2,DB00619,Imatinib,BE0001104,RET proto-oncogene
3,DB00619,Imatinib,BE0001039,High affinity nerve growth factor receptor
4,DB00619,Imatinib,BE0000853,Macrophage colony-stimulating factor 1 receptor


In [10]:
full_df = pd.concat([full_df, df], axis=0)
full_df.head()

Unnamed: 0,accession_number,drug_name,target_id,target_name
0,DB00619,Imatinib,BE0004743,Breakpoint cluster region protein
1,DB00619,Imatinib,BE0000453,Mast/stem cell growth factor receptor Kit
2,DB00619,Imatinib,BE0001104,RET proto-oncogene
3,DB00619,Imatinib,BE0001039,High affinity nerve growth factor receptor
4,DB00619,Imatinib,BE0000853,Macrophage colony-stimulating factor 1 receptor


# Web Scraping for predefined set

In [11]:
#Creating a function and bringing all the previous steps in
drug_target = pd.DataFrame(columns = ['accession_number', 'drug_name', 'target_id', 'target_name'])
def find_drug_target(drug):
    global drug_target
    #id of target (i.e. 'BE0004743')
    target_ids = []
    #name of target(i.e. Breakpoint cluster region protein)
    target_names = []
    result = []
    url = 'http://www.drugbank.ca'
    page = requests.get(url+'/drugs/'+drug)
    soup = BeautifulSoup(page.content, 'lxml')
    
    drug_name = soup.find('meta', {'name':'dc.title'})['content']
    accession_number = soup.find('meta', {'name':'dc.identifier'})['content']
    
    #Navigating through the html to find the targets for the specific drug.
    container_targets = soup.find('div', class_='bond-list-container targets')
    #if there are no targets for the drug, will ignore and continue
    if container_targets == None:
        return
    bond_cards = container_targets.find_all('div', class_='bond card')
    
    #Iterating through the available targets and collecting the ids.
    for bond_card in bond_cards:
        target_ids.append(bond_card.get('id'))
    #Iterate through the ids, and find the corresponding name
    for target_id in target_ids:
        target_names.append(soup.find('a', href=('#'+target_id)).text)
    #setting up a list to convert to dataframe.
    for i in range(0, len(target_names)):
        result.append((accession_number, drug_name, target_ids[i], target_names[i]))
    
    #Converting to dataframe and appending to the full_df vertically
    df = pd.DataFrame(result, columns = ['accession_number', 'drug_name', 'target_id', 'target_name'])
    drug_target = pd.concat([drug_target, df], axis=0)

In [12]:
predefined_set = ('DB00619', 'DB01048', 'DB14093', 'DB00173', 'DB00734', 'DB00218', 'DB05196', 'DB09095', 'DB01053', 'DB00274')

for drug in predefined_set:
    find_drug_target(drug)

In [13]:
#We have the drug and target relation in the dataframe for the requested drug ids.
drug_target = drug_target.reset_index(drop=True)
drug_target

Unnamed: 0,accession_number,drug_name,target_id,target_name
0,DB00619,Imatinib,BE0004743,Breakpoint cluster region protein
1,DB00619,Imatinib,BE0000453,Mast/stem cell growth factor receptor Kit
2,DB00619,Imatinib,BE0001104,RET proto-oncogene
3,DB00619,Imatinib,BE0001039,High affinity nerve growth factor receptor
4,DB00619,Imatinib,BE0000853,Macrophage colony-stimulating factor 1 receptor
5,DB00619,Imatinib,BE0000852,Platelet-derived growth factor receptor alpha
6,DB00619,Imatinib,BE0001124,Epithelial discoidin domain-containing receptor 1
7,DB00619,Imatinib,BE0000014,Tyrosine-protein kinase ABL1
8,DB00619,Imatinib,BE0000205,Platelet-derived growth factor receptor beta
9,DB01048,Abacavir,BE0004136,Reverse transcriptase/RNaseH


In [None]:
#converting to csv format
drug_target.to_csv(r'drug_target.csv')

# Querying all drugs within DrugBank

I scraped all of the accession numbers from DrugBank here

In [14]:
#Manually looked up total number of approved drugs. Total drugs:11354 Total pages:455 (25 per page)
drug_accession_numbers = []
#setting range to higher number to allocate for future additions of drugs
for i in range(1, 1000):
    url = 'https://www.drugbank.ca/drugs?approved=0&c=name&ca=0&d=up&eu=0&experimental=0&illicit=0&investigational=0&nutraceutical=0&page='+str(i)+'&us=0&withdrawn=0'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    
    #the link contains href of the accession number (i.e. <a href="/drugs/DB01557">α-Methylfentanyl</a>)
    links = soup.find_all("a", href=lambda href: href and "/drugs/" in href)
    #when there are no drugs on the page, it will exit the loop
    if len(links) == 0:
        break
    for a in links:
        drug_accession_numbers.append(a['href'][-7:])

In [15]:
len(drug_accession_numbers)

11354

Having pulled all of the drugs available in the database, now I'm ready to rerun my function for each of the individual drugs.

In [16]:
#resetting the drug_target dataframe to be empty.
drug_target = pd.DataFrame(columns = ['accession_number', 'drug_name', 'target_id', 'target_name'])
for drug in drug_accession_numbers:
    find_drug_target(drug)
drug_target = drug_target.reset_index(drop=True)
drug_target.info()

TypeError: 'NoneType' object is not subscriptable

After a trial run, it seems that some of the pages (i.e. https://www.drugbank.ca/drugs/DB14136) are incomplete. From initial looks, I think this page was made in the previous version of drugbank, and will be updated in the future. Nevertheless, more changes to the code would be necessary to process this other page setup. I'll stop here for querying the drugs.

I'll move on to using Flask to set up a local api (filename 'api.py'), I'll access the csv and accept incoming queries for accession number, and return the targets associated with the drug.

For instructions to setup locally, refer to the readme