#### Pair Problem

For today's pair, your goal is to predict whether or not a patient died during their ICU stay based on their lab values and vital sign measurements.

The dataset is from the MIMIC database. Our target variable will be the `hospital_expire_flag` column, which equals 1 if the patient died during the ICU stay. The other variables will be used as predictors.

Go through the following steps.

1. Load the dataset "MIMIC_Data_small.csv" into a dataframe. 
2. Remove all nans and duplicates
3. Separate the data into X and y
4. What percentage of the patients in the dataset died (y == 1)?
5. Split the data into a training and testing set (80:20 with a random_state = 42)
6. Normalize your features
7. Use logistic regression to fit the model to the training set and predict on the testing set.
9. Calculate the accuracy and f1-score
9. Predict that no patient dies, and calculate the accuracy and f1-score
10. Which is the best metric to use (accuracy or f1-score)?

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [3]:
df = pd.read_csv('04-svm_psycopg/MIMIC_Data_small.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59726 entries, 0 to 59725
Data columns (total 7 columns):
resprate_mean           59726 non-null float64
bun_min                 59726 non-null float64
tempc_mean              59726 non-null float64
spo2_min                59726 non-null float64
diasbp_mean             59726 non-null float64
sodium_max              59726 non-null float64
hospital_expire_flag    59726 non-null int64
dtypes: float64(6), int64(1)
memory usage: 3.2 MB


In [6]:
df.dropna(axis=0, how='any', inplace=True)

In [8]:
df.drop_duplicates(inplace=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52385 entries, 0 to 59725
Data columns (total 7 columns):
resprate_mean           52385 non-null float64
bun_min                 52385 non-null float64
tempc_mean              52385 non-null float64
spo2_min                52385 non-null float64
diasbp_mean             52385 non-null float64
sodium_max              52385 non-null float64
hospital_expire_flag    52385 non-null int64
dtypes: float64(6), int64(1)
memory usage: 3.2 MB


In [11]:
df.columns

Index(['resprate_mean', 'bun_min', 'tempc_mean', 'spo2_min', 'diasbp_mean',
       'sodium_max', 'hospital_expire_flag'],
      dtype='object')

In [14]:
X = df[['resprate_mean', 'bun_min', 'tempc_mean', 'spo2_min', 'diasbp_mean',
       'sodium_max']]

y = df['hospital_expire_flag']

4. What percentage of the patients in the dataset died (y == 1)?

In [15]:
df['hospital_expire_flag'].value_counts()

0    45934
1     6451
Name: hospital_expire_flag, dtype: int64

5. Split the data into a training and testing set (80:20 with a random_state = 42)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

6. Normalize your features

7. Use logistic regression to fit the model to the training set and predict on the testing set.

In [36]:
est = make_pipeline(StandardScaler(), LogisticRegression())

In [38]:
est.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [40]:
predictions = est.predict(X_test)

9. Calculate the accuracy and f1-score

In [51]:
accuracy_score(y_test, predictions)

0.8797365658108237

In [53]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.89      0.99      0.94      9197
          1       0.55      0.09      0.16      1280

avg / total       0.85      0.88      0.84     10477



In [54]:
len(predictions)

10477

In [76]:
predictions.shape

(10477,)

9. Predict that no patient dies, and calculate the accuracy and f1-score

In [68]:
no_deaths = np.zeros(shape=(10477,), dtype='int')

In [78]:
no_deaths.shape

(10477,)

In [74]:
print(accuracy_score(y_test, no_deaths))

0.8778276224109955


In [75]:
print(classification_report(y_test, no_deaths))

             precision    recall  f1-score   support

          0       0.88      1.00      0.93      9197
          1       0.00      0.00      0.00      1280

avg / total       0.77      0.88      0.82     10477



  'precision', 'predicted', average, warn_for)


10. Which is the best metric to use (accuracy or f1-score)?