# HackMed 21: FDA Drug Shortage Web Scraper
Start: 24.04.2021 | author Camillo Moschner (cm967)

Source data: https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm

## Import statements

In [1]:
import numpy as np
import pandas as pd
import re
import pickle

In [2]:
import requests
from bs4 import BeautifulSoup
from pandas.io.html import read_html

In [3]:
from HackMed21_0_helper import find_name
from HackMed21_0_helper import find_status
from HackMed21_0_helper import find_CAS_number_and_ChEBI

Additional dependencies:

In [4]:
# !pip install lxml
# !pip install html5lib

## Wikipedia Explorer

In [5]:
wikipedia_link = "https://en.wikipedia.org/wiki/"
drug_name = "Acetazolamide" # taken directly from the FDA Drug Shortage website
drug_wikipedia_link = wikipedia_link+drug_name
drug_page = requests.get(drug_wikipedia_link)
drug_soup = BeautifulSoup(drug_page.content, 'html.parser')
p=drug_soup.find(id='mw-content-text')

Explore Wikipedia's molecule infobox to identify useful parameters to scrape:

In [6]:
infoboxes = read_html(drug_wikipedia_link, index_col=0, attrs={"class":"infobox"})
pd.set_option('display.max_columns', None)
infoboxes[0].T

Unnamed: 0,NaN,NaN.1,Clinical data,Trade names,AHFS/Drugs.com,Pregnancycategory,Routes ofadministration,ATC code,Legal status,Legal status.1,Pharmacokinetic data,Protein binding,Metabolism,Elimination half-life,Excretion,Identifiers,"IUPAC name N-(5-Sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide",CAS Number,PubChem CID,IUPHAR/BPS,DrugBank,ChemSpider,UNII,KEGG,ChEBI,ChEMBL,PDB ligand,CompTox Dashboard (EPA),ECHA InfoCard,Chemical and physical data,Formula,Molar mass,3D model (JSmol),Melting point,SMILES NS(=O)(=O)c1nnc(s1)NC(=O)C,"InChI InChI=1S/C4H6N4O3S2/c1-2(9)6-3-7-8-4(12-3)13(5,10)11/h1H3,(H2,5,10,11)(H,6,7,9) Key:BZKPWHYZMXOIDC-UHFFFAOYSA-N",.mw-parser-output .nobold{font-weight:normal} (verify)
1,,,Clinical data,"Diamox, Diacarb, others",Monograph,AU: B3,by mouth or intravenous,S01EC01 (WHO),Legal status,AU: S4 (Prescription only) CA: ℞-only UK: POM ...,Pharmacokinetic data,70–90%[1],None[1],2–4 hours[1],Urine (90%)[1],Identifiers,"IUPAC name N-(5-Sulfamoyl-1,3,4-thiadiazol-2-y...",59-66-5,1986,6792,DB00819,1909,O3FX965V0I,D00218,CHEBI:27690,ChEMBL20,"AZM (PDBe, RCSB PDB)",DTXSID7022544,100.000.400,Chemical and physical data,C4H6N4O3S2,222.24 g·mol−1,Interactive image,258 to 259 °C (496 to 498 °F),SMILES NS(=O)(=O)c1nnc(s1)NC(=O)C,InChI InChI=1S/C4H6N4O3S2/c1-2(9)6-3-7-8-4(12-...,.mw-parser-output .nobold{font-weight:normal} ...


Retrieval example (for full functions see "HackMed21_helper.py" file).

In [7]:
infoboxes[0].xs(u'ChEBI').values[0]

'CHEBI:27690'

## FDA & Wikipedia Scraper

Load website you want to scrape as a request instance

In [8]:
URL = 'https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm'
page = requests.get(URL)

In [9]:
soup = BeautifulSoup(page.content, 'html.parser')

Find the table of drugs exhibiting a shortage using the html id for content:

In [10]:
results = soup.find(id="cont").find('tbody')

Create a dictionary in which each key:value pair represents a Table Data Cell element: 

In [11]:
table_list = results.find_all('tr')
print('The FDA Drug Shortage list has {:} entries.'.format(len(table_list)))

The FDA Drug Shortage list has 161 entries.


### Main Parser
...definitely "over-the-top, over-engineered, and held together with Sellotape and bits of string…" :P

Parse table_list (a bs4.element.ResultSet) to retrieve the drug name and its status:

In [12]:
drug_par=[]
for row in table_list:
    info = []
    # retrieve the drug name:
    unparsed_name_str = row.find_all('a')[1]['title']
    name_str = find_name(unparsed_name_str)
    if 'Parathyroid' in name_str:
        abbr_name = 'Parathyroid Hormone'
    if 'Injection' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Capsules' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Tablets' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Multi-Vitamin' in name_str:
        abbr_name = 'Multivitamin'
    elif '(' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Ointment' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Solution' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Strips' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Histreline' in name_str:
        abbr_name = 'Histrelin'
    elif 'Implant' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Suspension' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Emulsion' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Aerosol' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'System' in name_str:
        abbr_name = name_str.split(' ')[0]
    elif 'Sterile Water' in name_str:
        abbr_name = 'Water for injection'
    else:
        abbr_name = name_str 
    info.append(name_str)
    info.append(abbr_name)
    # retrieve the status:
    unparsed_status_str = str(row.find('strong'))
    status_str = find_status(unparsed_status_str)
    info.append(status_str)
    # retrieve FDA accessdata hyperlink:
    base_link = "https://www.accessdata.fda.gov/scripts/drugshortages/"
    specific_link=row.find_all('a')[1]['href']
    info.append(base_link+specific_link)
    # scrape Wikipedia for CAS number:
    wiki_info = find_CAS_number_and_ChEBI(abbr_name)
    CAS_number = wiki_info[0]
    ChEBI = wiki_info[1]
    info.append(CAS_number)
    info.append(ChEBI)

    drug_par.append(info)

https://en.wikipedia.org/wiki/Acetazolamide
https://en.wikipedia.org/wiki/Amifostine
https://en.wikipedia.org/wiki/Amino_Acids
https://en.wikipedia.org/wiki/Aminophylline
https://en.wikipedia.org/wiki/Amoxapine
https://en.wikipedia.org/wiki/Amphetamine
https://en.wikipedia.org/wiki/Anagrelide
https://en.wikipedia.org/wiki/Asparaginase
https://en.wikipedia.org/wiki/Atropine
https://en.wikipedia.org/wiki/Atropine
https://en.wikipedia.org/wiki/Azacitidine
https://en.wikipedia.org/wiki/Azithromycin
https://en.wikipedia.org/wiki/Azithromycin
https://en.wikipedia.org/wiki/Belatacept
https://en.wikipedia.org/wiki/Bumetanide
https://en.wikipedia.org/wiki/Bupivacaine
https://en.wikipedia.org/wiki/Bupivacaine
https://en.wikipedia.org/wiki/Calcitriol
https://en.wikipedia.org/wiki/Calcium
https://en.wikipedia.org/wiki/Calcium
https://en.wikipedia.org/wiki/Calcium
https://en.wikipedia.org/wiki/Capreomycin
https://en.wikipedia.org/wiki/Cefazolin
https://en.wikipedia.org/wiki/Cefepime
https://en.wiki

Create a Pandas DataFrame combining all the the web scraped information into one easily-accessible data structure.

In [13]:
df = pd.DataFrame(drug_par, columns=['full_drug_name','name','status','link','CAS_number','ChEBI'])

Check all the drugs that created 'N/A' values (probably due to some unwanted code behaviour) to check how to optimize the code.

In [14]:
df.loc[df['ChEBI']=='N/A']

Unnamed: 0,full_drug_name,name,status,link,CAS_number,ChEBI
2,Amino Acids,Amino Acids,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
7,Asparaginase Erwinia Chrysanthemi (Erwinaze),Asparaginase,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
13,Belatacept (Nulojix) Lyophilized Powder for In...,Belatacept,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
18,"Calcium Chloride Injection, USP",Calcium,Resolved,https://www.accessdata.fda.gov/scripts/drugsho...,,
19,Calcium Disodium Versenate Injection,Calcium,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
20,Calcium Gluconate Injection,Calcium,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
21,"Capreomycin Injection, USP",Capreomycin,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
24,Cefotaxime Sodium Injection,Cefotaxime,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
28,Ceftolozane and Tazobactam (Zerbaxa) Injection,Ceftolozane,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,
32,Continuous Renal Replacement Therapy (CRRT) So...,Continuous,Currently in Shortage,https://www.accessdata.fda.gov/scripts/drugsho...,,


## Data Backup

In [17]:
pickle.dump( df, open( "/Users/camillomoschner/Documents/GitHub/react2drug/drug_shortages.p", "wb" ) )