## FAERS Data Aggregation Use Case 
## Proposed Walkthrough/"Solution"
#### Written: 12/23/2020
#### Updated: 12/23/2020

### Introduction
The below document includes code and descriptive text evaluating the use case assessment focused on reading, aggregating, and exporing the FDA's adverse events reports (FAERS) data set. Full materials for this use case can be found on the [Python OER GitHub Repository](https://github.com/domdisanto/Python_OER/tree/master/Use%20Cases/Rahim_FAERS). This walkthrough document is tentatively titled a "solution" as, while this document offers a specific way of cleaning and analyzing the available, this document certainly does not offer a unique (or even a uniquely-best) solution to the given assessment. 

The writing in this document will do its best to outline specific parameters that should be met to satisfactorically complete the assessment. These parameters, tasks, outputs, etc. exist solely to assess your developing skills as an analyst and Python programmer. That being said, if you take different steps or follow a different analytic method to reach the same results, that is perfectly acceptable! This walk through is intended only to be *a* demonstrated workflow, but it is possible (and even likely) that you may write more efficient, simpler, or more scalable code than what is included here that achieves the same results! 

### "Benchmarks" of a "Good" Solution
The 

### (0) Importing modules

In [1]:
import pandas as pd 
import numpy as np
import requests, zipfile, io

### (1) Accessing Faers Data

Before delving into the programmatic solution, we will outline the brief process to access the FAERS data and specifically the drug-related information of interest. The FAERS Data is located on the FDA’s website at https://www.fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-latest-quarterly-data-files:

![Fig1](R/Images/Fig1_FAERSHeader.JPG)

This page contains the following link:

![Fig2](R/Images/Fig2_DataAccessButton.JPG)

which contains quarterly FAERS data more specifically at https://www.fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-latest-quarterly-data-files. On this page, the data can be accessed by the calendar year in the interface toward the bottom of the page:

![Fig4](R/Images/Fig4_DataAnnuals.JPG)

Here you can download the data in ASCII (that is text-delimited files) or XML files. The ASCII files are used in the solution below, and are likely the easier data format to use rather than the XML files. I would encourage you to download at least one of the date ranges of data and examine the unzipped files. The folder will a PDF ReadMe document that details the files contained.

In the unzipped folder will also be the data contained in .txt files. Data in .txt files seems odd right? These are just small text files that would contain written notes from our Notepad app, right? These text files actually contain text delimited data. For example, .csv files are actually text-delimited files, where different values in a row are separated by commas. Try opening a .csv file on your laptop in Notepad and examine the data. You’ll notice data that would open in Excel as:

TABLE 

would be represented in its text format as:

> Column_1,Column_2,Column_3,…,Column_N    
> Value_11,Value12,Value_13,…,Value_1N  
> Value_21,Value22,Value_23,…,Value_2N  

Try opening one of the smaller (in file-size) .txt files from the FAERS data in Notepad or a similar text editor. We see this file is not delimited by commas but dollar-signs (“$”). The next section will work with importing the .txt files from the ASCII folder’s unzipped files and show you how to work with this new delimiter. Specifically, this only requires the DRUGS file as noted below.

### (2) Importing 2020 Q3 Data

The prompt asks us to aggregate the phenytoin events over the 2019 fiscal year (which we’ll call FY2019 for brevity). The definition of the fiscal year is the first fold in this work! In the federal government, the fiscal year most commonly ends on 9/30 (so FY2019 ranges from 10/1/2018 to 9/30/2019). If you’ve worked in different state agencies, private companies, or other organizations with various definitions of fiscal year end. The import thing for this assignment is to use a justifiale definition of fiscal year based on your previous experience, knowledge, or a quick Google search to find.

We’ll begin with a “manually” importing and exploring of the data for Q3 2020, specifically looking at the text file (.txt) of drug events. We can import the data file in two ways. The first involves simply importing the data from wherever we have locally saved the data:

In [2]:
filepath = "C:/Users/Dominic DiSanto/Documents/Python_OER/Use Cases/FAERS_DataPull/ASCII/DRUG20Q3.txt" 
manual_q32020_df = pd.read_table(filepath, sep = "$")

manual_q32020_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,prod_ai,val_vbm,route,dose_vbm,cum_dose_chr,cum_dose_unit,dechal,rechal,lot_num,exp_dt,nda_num,dose_amt,dose_unit,dose_form,dose_freq
0,100046943,10004694,1,PS,LIPITOR,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
1,100046943,10004694,2,SS,LIPITOR,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
2,100046943,10004694,3,SS,ATORVASTATIN CALCIUM.,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
3,100046943,10004694,4,SS,ATORVASTATIN CALCIUM.,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
4,100046943,10004694,5,C,SIMVASTATIN.,SIMVASTATIN,1,,"40 MG, DAILY",,,,,,,,40,MG,,


Downloading and manually importing this data is certainly an appropriate solution for this assessment! However we also know downloading all of these files is a bit tedious. Similarly, if we want to share this code with a new user, each import of the data is going to include the path in our local machine. 

Since this data is publicly hosted, we can simply use the FDA’s hosted zip file to call the data using a stable URL! Below, we import the same Q32020 data using the `requests`, `zipfile`, `io` modules:

In [3]:
# import requests, zipfile, io # Recalling we had to import these modules

# Supplying the filepath
url = "https://fis.fda.gov/content/Exports/faers_ascii_2020Q3.zip"

# Pulling the zipped folder from the URL 
q320_zip = requests.get(url)
q320_unzipped = zipfile.ZipFile(io.BytesIO(q320_zip.content))

Now we can look at the contents of this zipped file:

We know we are interestd in the file containing the drug name for the adverse events, and we are specifically interested in the .txt files. We can manually call this file:

In [4]:
file_dir = q320_unzipped.namelist()
file_dir

['ASCII/',
 'ASCII/ASC_NTS.pdf',
 'ASCII/demo20q3.pdf',
 'ASCII/DEMO20Q3.txt',
 'ASCII/drug20q3.pdf',
 'ASCII/DRUG20Q3.txt',
 'ASCII/indi20q3.pdf',
 'ASCII/INDI20Q3.txt',
 'ASCII/outc20q3.pdf',
 'ASCII/OUTC20Q3.txt',
 'ASCII/reac20q3.pdf',
 'ASCII/REAC20Q3.txt',
 'ASCII/rpsr20q3.pdf',
 'ASCII/RPSR20Q3.txt',
 'ASCII/ther20q3.pdf',
 'ASCII/THER20Q3.txt',
 'Deleted/',
 'Deleted/ADR20Q3DeletedCases.txt',
 'FAQs.pdf',
 'Readme.pdf']

In [5]:
file_dir[4]

'ASCII/drug20q3.pdf'

Or we could search through the different files in our object that contains the unzipped contents and isolate those file names with 'DRUG' and 'txt' in the name, as we know this will identify the `.txt` file of interest.

In [6]:
import re 
for i in file_dir:
    if 'DRUG' in i and 'txt' in i:
        drug_file = i
        print(i)

ASCII/DRUG20Q3.txt


So let’s supply the file-name of interest to our read.delim() function, and explore the imported dataset.



In [7]:
q32020_df = pd.read_csv(q320_unzipped.open(drug_file), sep="$")

q32020_df.head()

Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,prod_ai,val_vbm,route,dose_vbm,cum_dose_chr,cum_dose_unit,dechal,rechal,lot_num,exp_dt,nda_num,dose_amt,dose_unit,dose_form,dose_freq
0,100046943,10004694,1,PS,LIPITOR,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
1,100046943,10004694,2,SS,LIPITOR,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
2,100046943,10004694,3,SS,ATORVASTATIN CALCIUM.,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
3,100046943,10004694,4,SS,ATORVASTATIN CALCIUM.,ATORVASTATIN CALCIUM,1,Oral,"20 MG, DAILY",,,U,,,,20702.0,20,MG,FILM-COATED TABLET,
4,100046943,10004694,5,C,SIMVASTATIN.,SIMVASTATIN,1,,"40 MG, DAILY",,,,,,,,40,MG,,


Now let’s see what medications exist in the data, and identify all instances of phenytoin in the medications, again using Python's `in` statement in an `if` conditional:

In [8]:
meds = set(q32020_df['prod_ai']) # Identifying our list of unique medications in the data set 
ptn_meds = [] # initializing empty list 

for i in meds:
    if 'phenytoin' in str(i) or 'PHENYTOIN' in str(i):
        ptn_meds.append(i)

ptn_meds = np.array(ptn_meds) # converting to numpy array, personal preference, not essential
ptn_meds

NameError: name 'x' is not defined

And lastly let’s filter the full imported data set using our identifying adverse events related to phenytoin.

In [None]:
q32020_phenytoin = q32020_df[q32020_df['prod_ai'].isin(ptn_meds)]
q32020_phenytoin.shape

In [None]:
q32020_phenytoin.head()

### (3) Importing All FY2019 Data
We walked through the process of importing a singe quarter of data. FY2019 will include four years of data that will follow an analgous process, so let’s create a function to reduce the redundancy of our code. The below function requires only the file path of the quarter’s zipped folder, and will complete the steps above of importing the full folder, unzipping the data cotents, identifying the DRUG file, and specifically filtering on phenytoin. The chunk also compares teh results of the function to our above manual input, to demonstrate the function and manual import resulting in the same imported data!

In [None]:
def faers_phenytoin_extract(url):
    zip_files = requests.get(url)
    unzip_files = zipfile.ZipFile(io.BytesIO(zip_files.content))

    file_dir = unzip_files.namelist()
    
    for i in file_dir:
        if 'DRUG' in i and 'txt' in i:
            drug_file = i

    drug_df = pd.read_csv(unzip_files.open(drug_file), sep="$")

    meds = set(drug_df['prod_ai']) # Identifying our list of unique medications in the data set 
    ptn_meds = [] # initializing empty list 

    for i in meds:
        if 'phenytoin' in str(i) or 'PHENYTOIN' in str(i):
            ptn_meds.append(i)

    ptn_meds = np.array(ptn_meds)

    out_df = drug_df[drug_df['prod_ai'].isin(ptn_meds)]
    
    return(out_df)



In [None]:
# so let's compare to the Q32020 data we pulled manually previously 
q3_20_df =  faers_phenytoin_extract("https://fis.fda.gov/content/Exports/faers_ascii_2020Q3.zip")
q3_20_df.equals(q32020_phenytoin)

In [None]:
q2_20_df = faers_phenytoin_extract("https://fis.fda.gov/content/Exports/faers_ascii_2020Q2.zip")
q1_20_df = faers_phenytoin_extract("https://fis.fda.gov/content/Exports/faers_ascii_2020Q1.zip")
q4_19_df = faers_phenytoin_extract("https://fis.fda.gov/content/Exports/faers_ascii_2019Q4.zip")

In [None]:
assert q3_20_df.columns.equals(q2_20_df.columns)
assert q3_20_df.columns.equals(q4_19_df.columns)
assert q3_20_df.columns.equals(q1_20_df.columns)

Let’s then append the data frames together, and look at our final data

In [None]:
display(q3_20_df.shape)
display(q2_20_df.shape)
display(q1_20_df.shape)
display(q4_19_df.shape)

In [None]:
fy19_phenytoin = pd.concat([q3_20_df,
                            q2_20_df,
                            q1_20_df,
                            q4_19_df])

fy19_phenytoin.shape

In [None]:
fy19_phenytoin.head()

### (4) Exporting aggregated data

In [None]:
fy19_phenytoin.to_csv('FY2019_PhenytoinAERS_Python.csv', index = False)

In [15]:
pd.DataFrame({'a':[1,2,3],
             'b':[4,5,6]}).to_excel('test.xlsx', index=False)
pd.DataFrame({'a':[1,2,3],
             'b':[4,5,6]}).to_excel('test_indx.xlsx', index=True)

In [17]:
exec(open(r'FAERS_SolutionCheck.py').read())

Solution script executing...
Python submission file found! Checking solution...
R submission file found! Checking solution...
Suubmission file found wthout "_Python" or "_R" suffix. Checking solution...
