# Final Project: Admission Prediction from NHAMCS
## Progress report: Summary of dataset
### DS5559: Big Data Analysis
### Thomas Hartka, Alicia Doan, Michael Langmayr
Created: 6/21/20  

## Data Source

Our data is from the National Hospital Ambulatory Medical Care Survey (NHAMCS).  This is a stratified sample of data gathered from Emergency Departments (EDs) from around the United States collected by the CDC.  Data from the years 2007-2017 are publically available at: 

https://www.cdc.gov/nchs/ahcd/datasets_documentation_related.htm

The data is provided in tabular format.  Each row represents a patient encounter and each column is a variable associated with the encounter.  We have previously converted the STATA files (orginal format) into CSVs.  We then used Pyspark to combine the 11 years of data into a large table, then stored the data in a parquet data structure.    

In [1]:
# set directories
data_dir = "../data"
results_dir = "../results"

In [2]:
# import python libraries
import pandas as pd
import numpy as np

# set up pyspark
from pyspark.sql import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

In [3]:
# load parquet data
NHAMCS = spark.read.parquet(data_dir + "/NHAMCS.2007-2017")

In [4]:
# example of data
NHAMCS.select(NHAMCS.columns[:10]).show(5)

+------+-----+--------+---+-------+--------+---+-----------------+------+--------------------+
|VMONTH|VYEAR|   VDAYR|AGE|ARRTIME|WAITTIME|LOV|         RESIDNCE|   SEX|               ETHUN|
+------+-----+--------+---+-------+--------+---+-----------------+------+--------------------+
|  July| 2009|Saturday| 36|   2125|       5|296|Private residence|Female|Not Hispanic or L...|
|  July| 2009|  Friday| 40|   1904|       5| 86|Private residence|Female|Not Hispanic or L...|
|  July| 2009|  Friday| 76|   1034|       0| 86|Private residence|  Male|Not Hispanic or L...|
|  July| 2009|Thursday| 27|     25|      63|190|Private residence|Female|Not Hispanic or L...|
|  July| 2009|Thursday| 71|   1940|      40|230|Private residence|Female|Not Hispanic or L...|
+------+-----+--------+---+-------+--------+---+-----------------+------+--------------------+
only showing top 5 rows



## Number of Records

In [5]:
# count number of records
NHAMCS.count()

305897

There are 305,897 records in the 11 years of data we obtained.  
  
We made a table examine which variables were available for each year.  In the following table a '1' indicates there was at least one non-null value for the variable; a '0' indicates all values were null (meaning the variable was not collected that year).  

In [6]:
# load variable data
NHAMCS_vars = pd.read_csv(results_dir + "/NHAMCS_vars_by_year.csv")

# display table of variables by year
NHAMCS_vars[['YEAR','CHF','CKD','CAD','COPD']].set_index('YEAR').transpose()

YEAR,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
CHF,0,0,1,1,1,1,1,1,1,1,1
CKD,0,0,0,0,0,0,0,1,1,1,1
CAD,0,0,0,0,0,0,0,1,1,1,1
COPD,0,0,0,0,0,1,1,1,1,1,1


This shows that several of the potentially valuable predictors (CKD-Chronic Kideny Disease, CAD-Coronary Artery Disease, etc) were not collected until 2014.  We therefore decided to limit our analysis to 2014-2017.

In [7]:
# convert year to integer from string
NHAMCS = NHAMCS.withColumn("YEAR", NHAMCS["YEAR"].cast(IntegerType()))

# filter for years 2014-2017
NHAMCS = NHAMCS.filter(NHAMCS['YEAR']>=2014)

In [8]:
# count records 2014-2017
NHAMCS.count()

81081

We are therefore left with 81,081 records for analysis for the years 2014-2017.  

## Outcome

Our outcome of interest is hospital admission.  We are would like to predict which patients will require admission to the hospital after ED evaluation and which patients can be discharged.  Being able to predict admission at that the time of patient presentation to the ED would be useful in terms of mobilizing resources.  If a patient has a high likelihood of admission, then a ward bed could be made ready and the admitting physicians could be informed.  

NHAMCS records disposition data in several variables.  If a patient is admitted to the local hospital or transferred to another hospital (psychiatric or other), they will be considered positive for admission.

ADM_OUTCOME = (ADMITHOSP = "Yes) OR (TRANPSYC = "Yes") OR (TRANOTH = "Yes") OR (OBSHOS = "Yes")
 

In [9]:
# create outcome variable
NHAMCS = NHAMCS.withColumn("ADM_OUTCOME", when((col("ADMITHOS")=="Yes") | \
                                                (col("TRANPSYC")=="Yes") | \
                                                (col("TRANOTH")=="Yes") | \
                                                (col("OBSHOS")=="Yes"), 1).otherwise(0))

In [10]:
# stats on outcome
NHAMCS.filter(NHAMCS['ADM_OUTCOME']==1).count()

9308

In [11]:
####################
# ADD OUTCOME STATS
####################

<***---Describe outcome stats----------***>


## Predictors

In [12]:
# count number of columns
len(NHAMCS.columns)

1220

From the years 2007-2017 there were 1,219 variables collected.  The exact variables collected varied by year.  In no year were all 1,219 data points collected for each encounter.  This data includes patient demographics, reason for visit, which tests were ordered, medications prescribed, and ED/hospital information. 
  
We want only use data that is available when the patient presents, which includes demographics, comorbidites, vital signs, and reason for visit.  We will not use tests orders, medications, or discharge diagnosis since these are not known on arrival. 
  
The following are our most likely predictors by category:

__Demographics__
* Age (AGE, AGER, ADEDAYS)
* Sex (SEX)
* Residence (RESIDNCE)
* Arrival time (ARRTIME)
* Year of visit (YEAR)

__Vital signs__
* Heart rate (PULSE)
* Temperature (TEMPF)
* Respiratory rate (RESPR)
* Blood pressure (BPSYS, BPDIAS)
* Oxygen saturation (POPCT)
* Pain scale (PAINSCALE)

__Comorbidities (known health problems__
* Alzheimer's disease (ALZHD)
* Asthma (ASTHMA)
* Coronary artery disease (CAD)
* Cancer (CANCER)
* Cerebrovascular disease (CEBVD)
* Congestive heart failure (CHF)
* Chronic kidney disease (CKD)
* Chronic obstructive pulmonary disease (COPD)
* Depression (DEPRN)
* Diabetes-type nunknown (DIABTYP0)
* Diabetes-type I (DIABTYP1)
* Diabetes-type II (DIABTYP2)
* HIV (EDHIV)
* End-stage renal disease/dialysis (ESRD)
* Alcohol abuse (ETOHAB)
* History of pulmonary embolism (HPE)
* Hypertension (HTN)
* Hyperlipidemia (HYPLIPID)
* Obesity (OBESITY)
* Obstructive sleep apnea (OSA)
* Osteoporosis (OSTPRSIS)
* Substance abuse (SUBSTAB)
* No chronic diseases (NOCHRON)
* Total number of chronic diseases (TOTCHRON)


__Reason for visit__
* Reason for visit-free text #1 (RFV1)
* Reason for visit-free text #2 (RFV2)
* Reason for visit-free text #3 (RFV3)
* Reason for visit-free text #4 (RFV4)
* Reason for visit-free text #5 (RFV5)
* Is this visit related to an injury/trauma, overdose/poisoning, or adverse effect of medical/surgical treatment? (INJURY)
* Did the injury/trauma, overdose/poisoning, or adverse effect of medical/surgical treatment occur within 72 hours prior to the date and time of this visit? (INJURY72)


In [13]:
####################
# ADD PREDICTOR STATS
####################

<***---Describe predictor stats----------***>

## Visualizations


<***---Add at least 5 graphs----------***>

Ideas:
- Visits per year
- Admission percentage vs age (AGER is categorical)
- Number of patients with comorbidites (histogram of TOTCHRON)
- Admission percentage vs total number of comorbidities (TOTCHRON)
- Age vs total number of comorbidities
- Admission percentage vs pain score

In [16]:
NHAMCS.filter(NHAMCS['AGER']=="One year or more").count()

0

In [18]:
NHAMCS.select('AGER').distinct().show()

+-----------------+
|             AGER|
+-----------------+
|   Under 15 years|
|75 years and over|
|      65-74 years|
|      25-44 years|
|      15-24 years|
|      45-64 years|
+-----------------+

