# Project Goals
## Overview
## The Problem: Finding a Needle in a Haystack

# The Data: First Look

Let's get our data into a dataframe to examin it:

In [1]:
import src.wrangle
df = src.wrangle.get_raw_data()

## Size of the Raw Data

In [2]:
print(f'''Number of Columns: {df.shape[1]}, Number of Rows: {df.shape[0]}''')

Number of Columns: 185, Number of Rows: 91713


> This means that we have *91,713* patients in our dataset, with *185* metrics recorded for each of them within that 24 hour period. That's quite a bit of data.

## Imbalanced Data

Our goal for this project is to predict patient survivability, so what does that distribution look like in our data?

In [3]:
num_patients_died = len(df[df.hospital_death == 1])
print('Percentage of patients who did not survive: {: .2f}'.format(num_patients_died / len(df)))

Percentage of patients who did not survive:  0.09


>*91* percent of the patients survived their time in the ICU, while only *9* percent did not. 

While this is great news in terms of ICU survival rates, it means we're dealing with an imbalanced dataset which is going to raise its own challenges as we move forward.

# Preparation: Challenges Faced and How We Handled Them

## Null Values: A Different Beast in an Imbalanced Data Set
Since our target value (patient survivability) we're attempting to predict is imbalanced in our dataset, this affects our ability to handle nulls through blanket fixes. What we'll have to do is correlate the significance of the missing values against our target value to identify if the missing values are actually playing a role in patient survivability. 

In [6]:
import pandas as pd
from scipy import stats
# Set alpha value
alpha = 0.05

for col in df.columns:

    a, b = df[col], df["hospital_death"]

    observed = pd.crosstab(a, b) 
    chi2, p, degf, expected = stats.chi2_contingency(observed)

    if p < alpha:
        # Reject the null hypothesis
        print("({} and hospital_death) are  dependent of each other. (p = {})".format(col, p))
    else:
         # Failed to reject the null hypothesis
        print("({} and hospital_death) are  independent of each other. (p = {})".format(col, p))

(encounter_id and hospital_death) are  independent of each other. (p = 0.4984475106646044)
(hospital_id and hospital_death) are  dependent of each other. (p = 3.971188907548724e-148)
(hospital_death and hospital_death) are  dependent of each other. (p = 0.0)
(age and hospital_death) are  dependent of each other. (p = 4.3986226670688454e-204)
(bmi and hospital_death) are  independent of each other. (p = 0.10323036630688469)
(elective_surgery and hospital_death) are  dependent of each other. (p = 1.8111023373323387e-176)
(ethnicity and hospital_death) are  dependent of each other. (p = 0.0031164745025517304)
(gender and hospital_death) are  dependent of each other. (p = 0.03441709366041668)
(height and hospital_death) are  dependent of each other. (p = 0.0021710052871524873)
(hospital_admit_source and hospital_death) are  dependent of each other. (p = 1.2519350471326363e-197)
(icu_admit_source and hospital_death) are  dependent of each other. (p = 3.7031521455538844e-243)
(icu_id and hos

In [7]:
# d = dict.fromkeys(['columns', 'nulls', 'ratio', 'val1', 'val'])
list = []
for c in df.columns:
    list1 = []
    val = df[df[c].notna()].hospital_death.value_counts()[1]/df[df[c].notna()].hospital_death.value_counts()[0]
    if df[c].isna().sum() > 0:
        val1 = df[df[c].isna()].hospital_death.value_counts()[1]/df[df[c].isna()].hospital_death.value_counts()[0]
        if abs(val1-val) > 0.05:
            list1.append(c)
            list1.append(df[c].isna().sum())
            list1.append(val1/val)
            list1.append(val1)
            list1.append(val)
            list1.append(df[df[c].isna()].hospital_death.value_counts()[1])


            list.append(list1)

In [8]:
df2 = pd.DataFrame.from_records(list)

In [9]:
df2.head()

Unnamed: 0,0,1,2,3,4,5
0,age,4228,1.97447,0.17903,0.090672,642
1,gender,25,4.986259,0.470588,0.094377,8
2,height,1334,1.777023,0.166084,0.093462,190
3,albumin_apache,54379,0.592493,0.074238,0.125298,3758
4,fio2_apache,70868,0.290996,0.063047,0.216658,4203


Applying all our data transformations:

In [11]:
df = src.wrangle.get_training_data()

In [13]:
print(f'Number of Columns: {df.shape[1]}, Number of rows: {df.shape[0]}')

Number of Columns: 185, Number of rows: 91688


> After fixing missing values and min-max issues, we don't have a significant loss in data points.

# Exploration: Examining Which Factors are Affecting Patient Survival

## Are certain hospitals better at data collection, and does that have an impact on patient outcomes?

## Is there a link between hospital and death rate?

## Does the age of the patient have a significant impact on patient survivability?

## The gender of the patient and outcome?

# Feature Engineering: 'Bringing a Magnet to the Needle in a Haystack Problem'

# Modeling: Bringing it All Together

# Conclusion: What We've Done and What's Next