An analysis of adverse drug event data using Hadoop, R, and Gephi
Java PigLatin R Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
pom.xml prepare for 0.3.0 development. Jan 26, 2016

Adverse Drug Event Analysis with Hadoop, R, and Gephi


This project contains code for running an analysis of adverse drug events using the Multi-Item Gamma Poisson Shrinker (MGPS) model described in Empirical bayes screening for multi-item associations.



This analysis is designed to be small enough that you can run it on a single machine if you do not have access to a Hadoop cluster. You will need to have a version of CDH3 on your local machine, along with the version of Pig that is compatible with that version.

You will need to have Maven for compiling the Pig user-defined functions, and may also want to have a copy of R and Gephi for certain phases of the analysis.


The input data for this analysis may be downloaded from the FDA's AERS website. You'll need to get the ASCII version of the data files for as many quarters as you would like to run over. For my own analysis, I used the data from 2008 through 2010.

The Pig scripts below assume that the input data is stored in three HDFS directories under the user's home directory: aers/drugs, aers/demos, and aers/reactions. All of the DRUG*.TXT files from the AERS website should go into aers/drugs, all of the DEMO*.TXT files should go into aers/demos, and all of the REAC*.TXT files should go into aers/reactions.

Running the Pipeline

If you have not done so already, load the input data into the Hadoop cluster:

hdfs dfs -mkdir aers
hdfs dfs -mkdir aers/drugs
hdfs dfs -put DRUG*.TXT aers/drugs
hdfs dfs -mkdir aers/demos
hdfs dfs -put DEMO*.TXT aers/demos
hdfs dfs -mkdir aers/reactions
hdfs dfs -put REAC*.TXT aers/reactions

Each of these commands should be run from the project's top-level directory, i.e., the directory that contains this README file.

mvn package  # Builds the Pig UDFs
pig -f src/main/pig/step1_join_drugs_reactions.pig
pig -f src/main/pig/step2_generate_drug_reaction_counts.pig
pig -f src/main/pig/step3_generate_squashed_distribution.pig

At this point, you can optionally run the R code to solve the MGPS optimization problem. You will need to install the BB library in your local version of R using install.packages("BB") if you do not have it already.

hadoop fs -getmerge aers/drugs2_reacs_stats d2r_stats.csv
Rscript src/main/R/ebgm.R d2r_stats.csv

The output from the optimization run may be plugged into the Pig script that scores the tuples, or you can just use the default parameters that are there now:

pig -f src/main/pig/step4_apply_ebgm.pig

The final output will be in aers/scored_drugs2_reacs. To generate the GEXF file of drug-drug interactions to load into Gephi, run:

hadoop fs -getmerge aers/scored_drugs2_reacs scored_d2r.csv
./src/main/python/ scored_d2r.csv > drugs.gexf