# FDA - Device Report Downloader

## Background
Medical Device reports are sent to the FDA from across the country. These reports are gathered and reviewed by experts in the FDA, and are also made available to the public. At the moment, there is only one easy way to retrieve MDRs from the FDA, which is to use the MAUDE database (https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm). However, this database suffers from a poor interface, few search features, and a limitation of only 500 reports retrievable at a time. Therefore, the MAUDE database is only suitable for analyzing small numbers of reports, and a new way of extracting and using device report data is needed.

The FDA has a second repository of medical device reports, kept as raw data in JSON format on another site (https://open.fda.gov/data/downloads/). Each file exists inside its own ZIP folder and has the same name as all others, making it a nightmare to manually download, unzip, rename, and merge the JSONs needed to make up the base dataset. Furthermore, the links to the downloads are hidden behind Javascript and buttons.

This code uses the Selenium webdriver to get around these challenges. Just enter the date range needed, the product codes, and your download folder--then run the notebook. The end result will be csv file with the requested data.

## User Input

Input the year range

In [None]:
start_year = 2020
end_year = 2020

Enter the filepath for download

In [None]:
path = 'C:/Users/Allen/Documents/FDA'

Enter the Product Codes needed--use "All" to include all product codes

In [None]:
pcode = ["DYE","LWR","MIE","MWH","NPX","OHA","PAL","PAP"]

Trim: some basic formatting is performed (MDR_text, for example, is stored in a challenging way, as an array of items within a cell)--use trim = "Yes" to allow for basic formatting, or "No" to opt out

In [None]:
trim = "Yes"

## Installing specialized packages

In [None]:
pip install webdriver-manager

In [None]:
pip install selenium

In [None]:
#Importing of useful packages
import pandas as pd
import re
import numpy as np
import json
import datetime
from bs4 import BeautifulSoup as bs
import requests
import re
import urllib
import time
import os
import string
import nltk
import requests, zipfile
from io import BytesIO

URL = "https://open.fda.gov/data/downloads/"

## Web Scraping

In [None]:
#The FDA website does not load the downloadable files unless you scroll to that area of the page first
#Web-Scraping involves using the Selenium webdriver to open the site with Chrome, navigate to the
    #needed area, and hit the correct buttons at the correct time

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import selenium.common.exceptions
from selenium import webdriver
import time

from selenium.webdriver.support.wait import WebDriverWait

options = webdriver.ChromeOptions()

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.google.com")

driver.get(URL)
driver.maximize_window()

time.sleep(1)

#Get past the light screen
button1 = driver.find_element(By.CLASS_NAME, "button.bg-primary.clr-white")
button1.click()

time.sleep(1)
 
    #Scroll to the button for medical device events
element_link=WebDriverWait(driver, 10).until(EC.presence_of_element_located(
   (By.XPATH, '//*[@id="Medical Device Event"]')))

driver.execute_script("arguments[0].scrollIntoView(true)", element_link)

time.sleep(1)

#Click the medical device event button
button2 = driver.find_element(By.XPATH, '//*[@id="Medical Device Event"]/section/button')
button2.click()

time.sleep(1)

#Retrieve the html code now that it displays the links we need
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print (html)

driver.close()

## Downloading Data

In [None]:
#Snip HTML to just the portion in question
pattern = '1991(.*?)<li id="Medical Device PMA">'
substring = re.search(pattern, html).group(1)

In [None]:
#Itemize links into array
import lxml.html

url_list = lxml.html.fromstring(substring)
url_list = url_list.xpath('//a/@href')

In [None]:
#Determine which links to follow, based on start and end year
year_list = list(range(start_year, end_year+1))

index_to_download = []

for meh in year_list:
    for bleh in range(0, len(url_list)):
        if str(meh) in url_list[bleh]:
            index_to_download.append(url_list.index(url_list[bleh]))

index_count = len(index_to_download)

In [None]:
#Follow links in array to download/process ZIPs

pathfull = path + "/FDAdata.json"
import requests, zipfile
from io import BytesIO

#Run loop, opening JSONs
loopnumber = 0
datamain = ""
for snuh in index_to_download:
    print('Download ' + (str(loopnumber+1)) + " of " + (str(index_count)) + " started ")
    url = url_list[snuh]
    import requests, zipfile
    req = requests.get(url)
    print('Download ' + (str(loopnumber+1))+ " completed ")
    zipfile = zipfile.ZipFile(BytesIO(req.content))
    filename = "FDAdata.json"
    for i, f in enumerate(zipfile.filelist):
        f.filename = filename.format(i)
        zipfile.extract(f)
    print('File ' + (str(loopnumber+1))+ ' extracted')
    data = json.load(open(pathfull))
    data = data["results"]
    datamain = data
    if loopnumber == 0:
        print('Creating Dataframe with JSON ' + (str(loopnumber+1)))
        dfmain = pd.json_normalize(data,
                  record_path = "device",
                  meta = ["report_number","report_source_code","date_received","event_type","type_of_report","mdr_text"],
                  record_prefix = "_",
                  errors = "ignore")
        if pcode[0] != "All":
            dfmain = dfmain[dfmain._device_report_product_code.isin(pcode)]
        print('Dataframe Created')
    else:
        print('Appending Dataframe with JSON ' + (str(loopnumber+1)))
        dfnew = pd.json_normalize(data,
                  record_path = "device",
                  meta = ["report_number","report_source_code","date_received","event_type","type_of_report","mdr_text"],
                  record_prefix = "_",
                  errors = "ignore")
        if pcode[0] != "All":
            dfnew = dfnew[dfnew._device_report_product_code.isin(pcode)]
        dfmain = pd.concat([dfmain, dfnew])
        print('JSON ' + (str(loopnumber+1)) + ' appended')
        
    os.remove(path + "/" + filename)
    loopnumber = loopnumber + 1
print("Dataframe ready")

## Trimming Data/Export

In [None]:
if trim == "Yes":
    
    #Keep only relevant columns
    dfmain = dfmain[["_device_report_product_code","_brand_name","_generic_name","_manufacturer_d_name","type_of_report","report_number","report_source_code",
                     "date_received","event_type","mdr_text"]]

    #Rename columns
    dfmain.columns = ["product_code","brand_name","generic_name","manufacturer_name","type_of_report","report_number",
                            "report_source_code","date_received","event_type","mdr_text"]

    #Update date column to date format
    dfmain["date_received"] = pd.to_datetime(dfmain["date_received"])
    
    #Remove brackets from type of report column
    dfmain['type_of_report'] = dfmain['type_of_report'].str.join(', ')
    
    #Update MDR Text to only show the text narrative items--also lowercase the text
    newmdr = []
    for crag in dfmain["mdr_text"]:
        newmdr.append(''.join(re.findall("'text': .+?}",str(crag))).translate(str.maketrans('', '', string.punctuation)).replace("text"," - ")[4:])
    
    #Lowercase MDR Text
    dfmain["mdr_text"] = [x.lower() for x in newmdr]
    
#Export to CSV
dfmain.to_csv(r'fda_device_reports.csv', index = False)