# Final Project: Admission Prediction from NHAMCS
## Data exploration notebook
### DS5559: Big Data Analysis
### Thomas Hartka, Alicia Doan, Michael Langmayr
Created: 6/21/20  
  
In this notebook we read in files from NHAMCS into a pyspark DataFrame for the years 2007-2017, then concatenate these DataFrames.  We then determine which years contain data for certain variables.  With this information, we select the variables to investigate in our prediction models.  Finally, we visualize our data, specifically focusing on the relationship between the predictors and response variables.

## Configuration

In [8]:
# set data directory
data_dir = "../raw_data"

In [2]:
# import python libraries
import os
import pandas as pd
import numpy as np
from functools import reduce

In [5]:
import os
os.getcwd() 

# import sys
# sys.path.append('lib/')
from lib import combineDf


In [6]:
# set up pyspark
from pyspark.sql import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()

## Read in data files and combine

In [9]:
# create array for DataFrames
df = []

# loop through all files in the directory
for i,filename in enumerate(os.listdir(data_dir)):
    print(i,":", filename)
    
    df.append(spark.read.csv(data_dir+"//"+filename, inferSchema=True, header = True))
    
    # extract year from file name
    year = filename.split(".")[0][-4:]
    
    # add id
    df[i] = df[i].withColumn("ID", monotonically_increasing_id())
    df[i] = df[i].withColumn("YEAR", lit(year))

0 : NHAMCS2007.csv
1 : NHAMCS2012.csv
2 : NHAMCS2008.csv
3 : NHAMCS2010.csv
4 : NHAMCS2017.csv
5 : NHAMCS2015.csv
6 : NHAMCS2014.csv
7 : NHAMCS2016.csv
8 : NHAMCS2009.csv
9 : NHAMCS2013.csv
10 : NHAMCS2011.csv


In [26]:
df[0].select(df[0].columns[:5]).show(5)

+------+-----+--------+---+-------+
|VMONTH|VYEAR|   VDAYR|AGE|ARRTIME|
+------+-----+--------+---+-------+
| April| 2007|Thursday| 49|   1325|
| April| 2007|  Friday| 24|    915|
| April| 2007| Tuesday| 30|    825|
| April| 2007|  Monday| 24|   1815|
| April| 2007|  Friday| 43|   2228|
+------+-----+--------+---+-------+
only showing top 5 rows



In [27]:
df[0].select("ID","YEAR").show(5)

+---+----+
| ID|YEAR|
+---+----+
|  0|2007|
|  1|2007|
|  2|2007|
|  3|2007|
|  4|2007|
+---+----+
only showing top 5 rows



In [10]:
# combine data first years
NHAMCS_comb = combineDf.union_d_fs(df[0],df[1])

# add the rest of the years
for i in range(2,len(df)):
    print("Concatentating: ", i)
    
    NHAMCS_comb = combineDf.union_d_fs(NHAMCS_comb,df[i])

Concatentating:  2
Concatentating:  3
Concatentating:  4
Concatentating:  5
Concatentating:  6
Concatentating:  7
Concatentating:  8
Concatentating:  9
Concatentating:  10


In [11]:
NHAMCS_comb.count()

305897

In [12]:
NHAMCS_comb.to_csv("combined_data.csv", index = False)

AttributeError: 'DataFrame' object has no attribute 'to_csv'

In [10]:
#NHAMCS_comb.cache()

DataFrame[VMONTH: string, VYEAR: int, VDAYR: string, AGE: string, ARRTIME: int, WAITTIME: string, LOV: string, RESIDNCE: string, SEX: string, ETHUN: string, RACEUN: string, ARRIVE: string, PAYPRIV: string, PAYMCARE: string, PAYMCAID: string, PAYWKCMP: string, PAYSELF: string, PAYNOCHG: string, PAYOTH: string, PAYDK: string, PAYTYPE: string, TEMPF: string, PULSE: string, RESPR: string, BPSYS: string, BPDIAS: string, POPCT: string, ORIENTED: string, IMMED: string, PAIN: string, SEEN72: string, DISCH7DA: string, PASTVIS: string, RFV1: string, RFV2: string, RFV3: string, RFV13D: string, RFV23D: string, RFV33D: string, EPISODE: string, INJURY: string, INTENT: string, CAUSE1: string, CAUSE2: string, CAUSE3: string, CAUSE13D: int, CAUSE23D: int, CAUSE33D: int, VCAUSE: string, DIAG1: string, DIAG2: string, DIAG3: string, DIAG13D: string, DIAG23D: string, DIAG33D: string, PRDIAG1: string, PRDIAG2: string, PRDIAG3: string, DIAGSCRN: string, CBC: string, BUNCREAT: string, CARDENZ: string, ELECTRO

## Write data to parquet file

In [13]:
%%time
# write out data
NHAMCS_comb.write.parquet("../data/NHAMCS.2007-2017")

CPU times: user 41 ms, sys: 15 ms, total: 56 ms
Wall time: 7min 1s
