# Exploratory Drug Label Analysis

Goals of this notebook are to simply take a look at some of the xml files from the [dailymed drug package inserts](https://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm).  We'll try to get a handle on how this data looks, how to throw it into a datastructure, and maybe make some plots.  First, let's set up some import and see if we can't read some XML.

In [1]:
import os
import pprint
import json
import lxml.etree as etree
import pandas as pd
from pandas.io.json import json_normalize
pprint = pprint.pprint #Set up pretty print

# I unpacked some test data here, not all of it is saved to github repo due to size constraints
test_data_dir = '../test_data/dm_spl_daily_update_12142018/prescription/'
drug_dirs = [a for a in os.listdir(test_data_dir) if '.zip' not in a]
# show the names of the dirs and the total number
pprint(drug_dirs[0:5])
len(drug_dirs)

['20181214_94472ddb-ae1d-4a11-a673-81254d562e6b',
 '20181214_cea3f6aa-cd1c-4152-98bb-bbb7b977497a',
 '20181214_6bdbdc53-5e07-159c-e053-2991aa0a5bcc',
 '20181214_9c6aa884-750e-4b11-9955-bc7ab6e9c2ad',
 '20181214_93d1d088-1ebb-478e-9715-de067052bff3']


246

Excellent, we have about 250 directories with no useful identifiers built into the dir name.

In [3]:
print('dir for drug 1:')
print(drug_dirs[0])
pprint(os.listdir(test_data_dir+drug_dirs[0]))
print()

print('dir for drug 2:')
print(drug_dirs[1])
pprint(os.listdir(test_data_dir+drug_dirs[1]))
print()

dir for drug 1:
20181214_94472ddb-ae1d-4a11-a673-81254d562e6b
['7b291076-36db-9793-e053-2991aa0a9f53.xml',
 'mar08-0004-01.jpg',
 'mar08-0004-02.jpg',
 'mar08-0004-05.jpg',
 'mar08-0004-04.jpg',
 'mar08-0004-03.jpg']

dir for drug 2:
20181214_cea3f6aa-cd1c-4152-98bb-bbb7b977497a
['image-02.jpg',
 'image-05.jpg',
 'image-04.jpg',
 'image-03.jpg',
 '9e027111-4dce-47fb-bafb-90191891a2f8.xml',
 'image-06.jpg',
 'image-01.jpg']



Looks like each directory has an xml file with a randomized name, not related to the dir name.  It also has some number of images, and those images may or may not include the name of the drug. Lets print an xml file and check it out.

In [4]:
test_filename = test_data_dir+drug_dirs[0]+'/7b291076-36db-9793-e053-2991aa0a9f53.xml'
x = etree.parse(test_filename)
print(etree.tostring(x, pretty_print=True))




That's too much to deal with.  Let's read it into pandas.

In [6]:
df_list = pd.read_html(test_filename, flavor='bs4') # This will convert an xml file into a list of DataFrames
print(len(df_list))

7


In [7]:
for i,df in enumerate(df_list):
    print(f'Dataframe: {i}')
    display(df)

Dataframe: 0


Unnamed: 0,0,1
0,Author,Dosage
1,Cavanagh 1,3 mg/kg of body weight per 24 hours by constan...
2,,
3,Dietzman 2,2 to 6 mg/kg of body weight as a single intrav...
4,,
5,Frank 3,40 mg initially followed by repeat intravenous...
6,,
7,Oaks 4,40 mg initially followed by repeat intravenous...
8,,
9,Schumer 5,1 mg/kg of body weight as a single intravenous...


Dataframe: 1


Unnamed: 0,0,1,2,3
0,Product No.,NDC No.,Strength,Vial Size
1,500601,63323-506-01,10 mg/mL,1 mL


Dataframe: 2


Unnamed: 0,0,1,2,3
0,Product No.,NDC No.,Strength,Vial Size
1,501610,63323-516-10,10 mg/mL,10 mL


Dataframe: 3


Unnamed: 0,0
0,THE 0.75% CONCENTRATION OF BUPIVACAINE HYDROCH...


Dataframe: 4


Unnamed: 0,0,1,2,3,4
0,Type of Block,Conc.,Each Dose,Motor Block 1,
1,(mL),(mg),,,
2,Local infiltration,0.25% 4,up to max.,up to max.,—
3,Epidural,"0.75% 2,4",10-20,75-150,complete
4,0.5% 4,10-20,50-100,moderate to complete,
5,0.25% 4,10-20,25-50,partial to moderate,
6,Caudal,0.5% 4,15-30,75-150,moderate to complete
7,0.25% 4,15-30,37.5-75,moderate,
8,Peripheral nerves,0.5% 4,5 to max.,25 to max.,moderate to complete
9,0.25% 4,5 to max.,12.5 to max.,moderate to complete,


Dataframe: 5


Unnamed: 0,0,1,2,3
0,NDC No.,Container,Fill,Quantity
1,,0.25%—Contains 2.5 mg bupivacaine hydrochlorid...,,
2,0409-1559-10,Single-dose vials,10 mL,box of 10
3,0409-1559-30,Single-dose vials,30 mL,box of 10
4,0409-1587-50,Multiple-dose vials,50 mL,box of 1
5,,0.5%—Contains 5 mg bupivacaine hydrochloride p...,,
6,0409-1560-10,Single-dose vials,10 mL,box of 10
7,0409-1560-29,Single-dose vials,30 mL,box of 10
8,0409-1610-50,Multiple-dose vials,50 mL,box of 1
9,,0.75%—Contains 7.5 mg bupivacaine hydrochlorid...,,


Dataframe: 6


Unnamed: 0,0,1,2,3
0,NDC No.,Container,Fill,Quantity
1,"0.25% with epinephrine 1:200,000—Contains 2.5 ...",,,
2,0409-1746-10,Single-dose vials,10 mL,box of 10
3,0409-1746-30,Single-dose vials,30 mL,box of 10
4,0409-1752-50,Multiple-dose vials,50 mL,box of 1
5,"0.5% with epinephrine 1:200,000—Contains 5 mg ...",,,
6,0409-1749-03,Single-dose ampuls,3 mL,box of 10
7,0409-1749-10,Single-dose vials,10 mL,box of 10
8,0409-1749-29,Single-dose vials,30 mL,box of 10
9,0409-1755-50,Multiple-dose vials,50 mL,box of 1


# OpenFDA database

Looks like the [openFDA](https://open.fda.gov/tools/downloads/) has currated json files from the FDA label xml files.  This probably doesn't have everything we need, but it is almost certainly a cleaner way to start.

In [19]:
data = json.load(open('../test_data/drug-label-0001-of-0007.json'))
print(data.keys())
data['meta']

dict_keys(['meta', 'results'])


{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.',
 'last_updated': '2018-12-14',
 'license': 'https://open.fda.gov/license/',
 'results': {'limit': 20000, 'skip': 0, 'total': 131198},
 'terms': 'https://open.fda.gov/terms/'}

Well we have two keys, 'meta' and 'results'.  If we take a quick peak, we know that we don't really care about meta.  I bet 'results' is more promising...

In [21]:
df = pd.DataFrame(data['results'])
df

Unnamed: 0,abuse,accessories,active_ingredient,active_ingredient_table,adverse_reactions,adverse_reactions_table,animal_pharmacology_and_or_toxicology,ask_doctor,ask_doctor_or_pharmacist,ask_doctor_or_pharmacist_table,...,use_in_specific_populations_table,user_safety_warnings,version,veterinary_indications,warnings,warnings_and_cautions,warnings_and_cautions_table,warnings_table,when_using,when_using_table
0,,,[HPUS Active Ingredients Each dose contains eq...,,,,,,,,...,,,1,,"[Warnings If symptoms persist or worsen, consu...",,,,,
1,,,[Active ingredient Colloidal oatmeal 43%],,,,,,,,...,,,1,,[Warnings For external use only. When using th...,,,,[When using this product Do not get into eyes....,
2,,,,,[6 ADVERSE REACTIONS Use of levocetirizine dih...,"[<table ID=""_RefID0EWPAC"" width=""75%""> <captio...",[13.2 Animal Toxicology Reproductive Toxicolog...,,,,...,,,1,,,[5 WARNINGS AND PRECAUTIONS 1.Avoid engaging i...,,,,
3,,,[Active ingredient (in each extended-release t...,,,,,[Ask a doctor before use if you have •persiste...,,,...,,,1,,[Warnings Do not use •for children under 12 ye...,,,,,
4,,,,,[6 ADVERSE REACTIONS The most common adverse r...,"[<table ID=""_RefID0EOOAG"" width=""100%""> <capti...",,,,,...,,,1,,,[5 WARNINGS AND PRECAUTIONS •Angioedema: incre...,,,,
5,,,[ACTIVE INGREDIENT Active Ingredient: Hamameli...,,,,,,,,...,,,1,,[WARNINGS Warnings: 1. If the following sympto...,,,,,
6,,,[Active ingredient Benzoyl Peroxide 2.5%],,,,,,,,...,,,1,,[Warnings For external use only],,,,[When Using This Product • avoid unnecessary s...,
7,,,"[ACTIVE INGREDIENTS: Arsenicum Album 12X, Calc...",,,,,,,,...,,,1,,"[WARNINGS: If pregnant or breastfeeding, ask a...",,,,,
8,,,[DRUG FACTS Active Ingredient Miconazole Nitra...,,,,,,,,...,,,1,,[WARNINGS For external use only Do not use : O...,,,,[When using this product avoid contact with th...,
9,,,,,[6 ADVERSE REACTIONS Reactions to articaine ar...,"[<table ID=""table2"" width=""75%""> <caption>Tabl...",,,,,...,,,1,,,[5 WARNINGS AND PRECAUTIONS Accidental Intrava...,,,,


We have tabular data now, at least.  More investigation to continue...