# Final Project: Admission Prediction from NHAMCS
## Data preprocessing
### DS5559: Big Data Analysis
### Thomas Hartka, Alicia Doan, Michael Langmayr
Created: 8/2/2020 
  
In this notebook creates a bayesian classifier for the primary reason for visit (RFV1).  The classifier is based on the likelihood of hospital admission.  It is calculated by:

$$ P(Admit|RFV)= \frac{\sum{PT_{RFV}.Admit==True}}{\sum{PT_{RFV}}}$$
Where $PT_{RFV}$ is patients with a certain RFV and $PT_{RFV}.Admit==True$ is patients with that RFV who were admitted.  Only RFV with five or more occurances are considered.  
  
Note: These values are calculated using only data from 2007-2013 (not training or test data) in order to prevent a data leak.

## Configure

In [33]:
# set data directory
data_dir = "../data"

In [34]:
# import python libraries
import os
import pandas as pd
import numpy as np
from functools import reduce

In [11]:
# set up pyspark
from pyspark.sql import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.getOrCreate()

## Read in data

In [36]:
nhamcs = spark.read.parquet(data_dir + "/NHAMCS_processed.2007-2013")

## Calculate baseline admit rate

In [37]:
baseline_admit_rate = nhamcs.filter(nhamcs.ADM_OUTCOME==1).count() / nhamcs.count()

## Count RFVs and admissions

In [38]:
# count RFVs
RFV_count = nhamcs.groupby(nhamcs.RFV1) \
                    .count() \
                    .withColumnRenamed("count","n") 

# fitler RFV with <5 occurances
RFV_count = RFV_count.filter(RFV_count.n > 5)

RFV_count.show(10)

+--------------------+----+
|                RFV1|   n|
+--------------------+----+
|Adverse effect of...| 227|
|  Sepsis, septicemia|  20|
|          Nightmares|   9|
|Difficulty in swa...| 242|
|  Respiratory arrest|  11|
|"Entry of ""none"...|  12|
|        Hip symptoms|  18|
|   Extraneous vision|  31|
|Gastrointestinal ...|  27|
| Vertigo - dizziness|3231|
+--------------------+----+
only showing top 10 rows



In [39]:
# count admission by RFV
RFV_admit = nhamcs.filter(nhamcs.ADM_OUTCOME==1) \
                    .groupby(nhamcs.RFV1) \
                    .count() \
                    .withColumnRenamed("count","admit")
RFV_admit.show(10)

+--------------------+-----+
|                RFV1|admit|
+--------------------+-----+
|  Sepsis, septicemia|   18|
|Adverse effect of...|   21|
|          Nightmares|    1|
|Difficulty in swa...|   38|
|  Respiratory arrest|    5|
|Weakness of foot ...|    1|
|"Entry of ""none"...|    2|
|        Hip symptoms|    7|
|Gastrointestinal ...|    4|
|EEG, electroencep...|    1|
+--------------------+-----+
only showing top 10 rows



In [40]:
# join tables
RFV_comb = RFV_count.join(RFV_admit,['RFV1'] ,how='left').na.fill(0)
RFV_comb.show(10)

+--------------------+----+-----+
|                RFV1|   n|admit|
+--------------------+----+-----+
|Adverse effect of...| 227|   21|
|  Sepsis, septicemia|  20|   18|
|          Nightmares|   9|    1|
|Difficulty in swa...| 242|   38|
|  Respiratory arrest|  11|    5|
|"Entry of ""none"...|  12|    2|
|        Hip symptoms|  18|    7|
|   Extraneous vision|  31|    1|
|Gastrointestinal ...|  27|    4|
| Vertigo - dizziness|3231|  570|
+--------------------+----+-----+
only showing top 10 rows



## Calculate admission rate for RFV

In [41]:
RFV_comb = RFV_comb.withColumn('RFV1_admit_rate', RFV_comb.admit / RFV_comb.n)
RFV_comb.show(10)

+--------------------+----+-----+-------------------+
|                RFV1|   n|admit|    RFV1_admit_rate|
+--------------------+----+-----+-------------------+
|Adverse effect of...| 227|   21|0.09251101321585903|
|  Sepsis, septicemia|  20|   18|                0.9|
|          Nightmares|   9|    1| 0.1111111111111111|
|Difficulty in swa...| 242|   38|0.15702479338842976|
|  Respiratory arrest|  11|    5|0.45454545454545453|
|"Entry of ""none"...|  12|    2|0.16666666666666666|
|        Hip symptoms|  18|    7| 0.3888888888888889|
|   Extraneous vision|  31|    1|0.03225806451612903|
|Gastrointestinal ...|  27|    4|0.14814814814814814|
| Vertigo - dizziness|3231|  570| 0.1764159702878366|
+--------------------+----+-----+-------------------+
only showing top 10 rows



## Add to NHAMCS 2014-2017

In [42]:
nhamcs_2014 = spark.read.parquet(data_dir + "/NHAMCS_processed.2014-2017")

In [43]:
nhamcs_2014['RFV1','RFV2'].show(10)

+--------------------+---------------+
|                RFV1|           RFV2|
+--------------------+---------------+
|Foot and toe pain...|          Blank|
|            Epilepsy|          Blank|
|Injury, other and...|          Blank|
|               Fever|Kidney dialysis|
|   Pain, unspecified|          Blank|
|               Cough|          Blank|
|Carbuncle, furunc...|          Blank|
|Stomach and abdom...|          Blank|
| Foreign body in eye|          Blank|
|   Pain, unspecified|          Blank|
+--------------------+---------------+
only showing top 10 rows



In [44]:
# add admit rate to data, fill in RFVs with unknown admit rate with baseline rate
nhamcs_2014 = nhamcs_2014.join(RFV_comb,['RFV1'] ,how='left').na.fill(baseline_admit_rate)

In [45]:
# look at data
nhamcs_2014['RFV1','RFV1_admit_rate'].rdd.takeSample(False, 10, seed=0)

[Row(RFV1='Laceration/cut of upper extremity', RFV1_admit_rate=0.028664495114006514),
 Row(RFV1='Cough', RFV1_admit_rate=0.059416365824308065),
 Row(RFV1='Fever', RFV1_admit_rate=0.09668785547005687),
 Row(RFV1='Motor vehicle accident, type of injur...', RFV1_admit_rate=0.07839506172839507),
 Row(RFV1='Upper abdominal pain, cramps, spasms', RFV1_admit_rate=0.17684594348222424),
 Row(RFV1='Abdominal pain, cramps, spasms, NOS', RFV1_admit_rate=0.18641403276518287),
 Row(RFV1='Constipation', RFV1_admit_rate=0.08409785932721713),
 Row(RFV1='Swelling of knee', RFV1_admit_rate=0.06622516556291391),
 Row(RFV1='Hand and finger pain, ache, soreness,...', RFV1_admit_rate=0.023283582089552238),
 Row(RFV1='Shortness of breath', RFV1_admit_rate=0.4183621131817405)]

## Store data

In [47]:
# overwrite exiting parquet data
nhamcs_2014.write.mode('overwrite').parquet(data_dir + "/NHAMCS_processed_bc.2014-2017")