# Introduction

We'll use the hospital data set discussed in Module 2 of the course. See the notebook `DAT158-Part2-6-Extra-RandomForests-Examples.ipynb` for details. 

The goal is to train a simple model to predict hospital length of stay. We'll try to keep it simple as the focus now is on deploying the model. 

# Setup

In [1]:
%matplotlib inline
import numpy as np, pandas as pd
from pathlib import Path

In [2]:
NB_DIR = Path.cwd()
DATA_DIR = NB_DIR/'..'/'data'

# Load data

In [3]:
df = pd.read_pickle(DATA_DIR/'hospital')
data_dict = pd.read_excel(DATA_DIR/'Data_Dictionary.xlsx')
metadata = pd.read_csv(DATA_DIR/'MetaData_Facilities.csv')

In [4]:
df.head()

Unnamed: 0,rcount,gender,facid,eid,dialysisrenalendstage,asthma,irondef,pneum,substancedependence,psychologicaldisordermajor,...,neutrophils,sodium,glucose,bloodureanitro,creatinine,bmi,pulse,respiration,secondarydiagnosisnonicd9,lengthofstay
0,0,0,1,1,False,False,False,False,False,False,...,14.2,140.36113,192.476913,12.0,1.390722,30.432417,96,6.5,4,3
1,5,0,0,2,False,False,False,False,False,False,...,4.1,136.731689,94.078506,8.0,0.943164,28.460516,61,6.5,1,7
2,1,0,1,3,False,False,False,False,False,False,...,8.9,133.058517,130.530518,12.0,1.06575,28.843811,64,6.5,2,3
3,0,0,0,4,False,False,False,False,False,False,...,9.4,138.994019,163.377029,12.0,0.906862,27.959007,76,6.5,1,1
4,0,0,4,5,False,False,False,True,False,True,...,9.05,138.634842,94.886658,11.5,1.242854,30.258926,67,5.6,2,4


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 26 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   rcount                      100000 non-null  int32  
 1   gender                      100000 non-null  int32  
 2   facid                       100000 non-null  int32  
 3   eid                         100000 non-null  int32  
 4   dialysisrenalendstage       100000 non-null  bool   
 5   asthma                      100000 non-null  bool   
 6   irondef                     100000 non-null  bool   
 7   pneum                       100000 non-null  bool   
 8   substancedependence         100000 non-null  bool   
 9   psychologicaldisordermajor  100000 non-null  bool   
 10  depress                     100000 non-null  bool   
 11  psychother                  100000 non-null  bool   
 12  fibrosisandother            100000 non-null  bool   
 13  malnutrition   

In [6]:
data_dict

Unnamed: 0,Index,Data fields,Type,Descriptions
0,,LengthOfStay,,
1,1.0,eid,Integer,Unique Id of the hospital admission
2,2.0,vdate,String,Visit date
3,3.0,rcount,Integer,Number of readmissions within last 180 days
4,4.0,gender,String,Gender of the patient\nM or F
5,5.0,dialysisrenalendstage,String,Flag for renal disease during encounter
6,6.0,asthma,String,Flag for asthma during encounter
7,7.0,irondef,String,Flag for iron deficiency during encounter
8,8.0,pneum,String,Flag for pneumonia during encounter
9,9.0,substancedependence,String,Flag for substance dependence during encounter


# Prepare data

In [6]:
X = df.drop('lengthofstay', axis=1)
y = df['lengthofstay']

In [7]:
from sklearn.model_selection import train_test_split

## Split off a test set

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

## Select model

_Note: as we've discussed earlier, we treat this as a classification problem. I.e. is the expected length of stay 1,2,3,4,5,6,7,8 or more than 8 days?_

In [9]:
from sklearn.ensemble import RandomForestClassifier

In [10]:
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Train model

In [11]:
rf.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

## Evaluate model

In [12]:
from sklearn.metrics import accuracy_score

In [13]:
y_pred = rf.predict(X_test)

In [14]:
accuracy_score(y_test, y_pred)

0.66676

In [15]:
from sklearn.metrics import confusion_matrix

In [16]:
confusion_matrix(y_test, y_pred)

array([[3935,  519,   39,    2,    0,    0,    0,    0,    0],
       [ 407, 2516,  230,   52,    1,    0,    0,    0,    0],
       [  19,  713, 2952,  304,   25,    4,    0,    0,    0],
       [   1,   97,  974, 2357,  255,   21,    0,    0,    0],
       [   0,   10,  109,  995, 1716,  190,    8,    0,    1],
       [   0,    7,   65,  252,  758, 1338,  159,   10,    2],
       [   0,    0,    7,   33,  175,  570,  903,   93,   35],
       [   0,    1,    6,   17,   74,  166,  397,  371,  131],
       [   0,    0,    1,    5,   16,   85,  114,  176,  581]])

# Export trained model 

Can use either `pickle` or `joblib` for this (among others).

In [17]:
MODEL_DIR = NB_DIR/'..'/'models'

In [18]:
from joblib import dump

In [19]:
dump(rf, MODEL_DIR/'simple_rf_model.joblib', compress=6)

['/home/alex/Dropbox/Jobb/HIB/Kurs/DAT158/Forelesninger-labs/Flask/quick_flask_tutorial/ex3-ml_and_flask/nbs/../models/simple_rf_model.joblib']

# Important note

We did absolutely no preprocessing of the data. No normalization, no feature engineering, nothing. In your work you'll likely do a lot of work on the data before feeding it to any model. 

> ***Remember to export your preprocessing pipeline as well, not just the model***. 

Otherwise it'll be more difficult to deploy your model, as new data fed to the model must be preprocessed exactly as you did for the training data. You should in general use scikit-learn pipelines for combining the preprocessing with the model, and you can save the entire pipeline using `joblib` or `pickle` 