# Lab Three: Extending Logistic Regression
## Caleb Moore, Blake Gebhardt, Christian Gould
dataset: https://www.kaggle.com/datasets/vetrirah/av-healthcare2

In [67]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import datasets

In [68]:
# Notebook setup
from IPython.display import HTML
HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>
To toggle on/off output_stderr, click <a href="javascript:code_toggle_err()">here</a>.''')

# Preparation and Overview (3 points total)
[2 points] Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis? As in previous labs, also detail how good the classifier needs to perform in order to be useful.

Our target will be the "Stay" characteristic. This is the length of stay for the patient
Our use-case task will be to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals could use this information for the best possible resource allocation and better overall efficiency. 
The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.
This would be most useful deployed so that hospitals could use it anywhere to best estimate these stay durations.
We plan to shoot for around 60% to keep our expectations realistic but also significantly outperform flipping an 11-sided coin.

[.5 points] (mostly the same processes as from previous labs) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). Provide a breakdown of the variables after preprocessing (such as the mean, std, etc. for all variables, including numeric and categorical).

In [69]:
# lets look at the data
df = pd.read_csv('data/train.csv', nrows=10000)
print(df.shape)
df.head()

(10000, 18)


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


In [70]:
# Lets look at some unique values that might be worth encoding
print(df['Hospital_type_code'].unique())

['c' 'e' 'b' 'a' 'f' 'd' 'g']


In [71]:
print(df['Department'].unique())

['radiotherapy' 'anesthesia' 'gynecology' 'TB & Chest disease' 'surgery']


In [72]:
print(df['Hospital_region_code'].unique())

['Z' 'X' 'Y']


In [73]:
print(df['Ward_Type'].unique())

['R' 'S' 'Q' 'P' 'T']


In [74]:
print(df['Ward_Facility_Code'].unique())

['F' 'E' 'D' 'B' 'A' 'C']


In [75]:
print(df['Bed Grade'].unique())
print('NaNs:', df['Bed Grade'].isna().sum())
print(df['Bed Grade'].value_counts())

[ 2.  3.  4.  1. nan]
NaNs: 2
2.0    3945
3.0    3548
4.0    1793
1.0     712
Name: Bed Grade, dtype: int64


In [76]:
print(df['Type of Admission'].unique())

['Emergency' 'Trauma' 'Urgent']


In [77]:
print(df['Severity of Illness'].unique())

['Extreme' 'Moderate' 'Minor']


In [78]:
print(df['Age'].unique())
print(df['Age'].value_counts())

['51-60' '71-80' '31-40' '41-50' '81-90' '61-70' '21-30' '11-20' '0-10'
 '91-100']
31-40     1879
41-50     1799
71-80     1663
51-60     1413
61-70     1242
21-30     1024
81-90      429
11-20      395
0-10        99
91-100      57
Name: Age, dtype: int64


In [79]:
# lets get rid of any na values since we have enough raw data 
total_with_na = df.shape[0]
df.dropna(inplace=True)
print('dropped', total_with_na - df.shape[0], 'values')

dropped 112 values


In [80]:
# separate the target column
train_stays = df['Stay']
df.drop('Stay', axis=1, inplace=True)

In [81]:
# lets make a map to replace string values
mapping = {     '0-10': 1,
                '11-20': 2,
                '21-30': 3,
                '31-40': 4,
                '41-50': 5,
                '51-60': 6,
                '61-70': 7,
                '71-80': 8,
                '81-90': 9,
                '91-100': 10,

                'Minor': 0,
                'Moderate': 1,
                'Extreme': 2,
                
                'Urgent': 0,
                'Trauma': 1,
                'Emergency': 2,

                'A': 0,
                'B': 1,
                'C': 2,
                'D': 3,
                'E': 4,
                'F': 5,

                'P': 0,
                'Q': 1,
                'R': 2,
                'S': 3,
                'T': 4,

                'X': 0,
                'Y': 1,
                'Z': 2,

                'radiotherapy': 0,
                'anesthesia': 1,
                'gynecology': 2,
                'TB & Chest disease': 3,
                'surgery': 4,

                'a': 0,
                'b': 1,
                'c': 2,
                'd': 3,
                'e': 4,
                'f': 5,
                'g': 6
        }

In [82]:
df = df.replace(mapping)
df

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,1,8,2,3,2,3,0,2,5,2.0,31397,7.0,2,2,2,6,4911.0
1,2,2,2,5,2,2,0,3,5,2.0,31397,7.0,1,2,2,6,5954.0
2,3,10,4,1,0,2,1,3,4,2.0,31397,7.0,1,2,2,6,4745.0
3,4,26,1,2,1,2,0,2,3,2.0,31397,7.0,1,2,2,6,7272.0
4,5,26,1,2,1,2,0,3,3,2.0,31397,7.0,1,2,2,6,5558.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,27,0,7,1,3,2,3,2,3.0,26396,8.0,1,0,2,2,5312.0
9996,9997,28,1,11,0,3,2,2,5,3.0,26396,8.0,1,0,2,2,4843.0
9997,9998,29,0,4,0,3,2,3,5,2.0,26396,8.0,1,0,2,2,5997.0
9998,9999,11,1,2,1,3,2,1,3,4.0,64838,34.0,2,0,4,6,4390.0


In [83]:
# Lets take a look at it now
print(df.shape)
df.head()

(9888, 17)


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,1,8,2,3,2,3,0,2,5,2.0,31397,7.0,2,2,2,6,4911.0
1,2,2,2,5,2,2,0,3,5,2.0,31397,7.0,1,2,2,6,5954.0
2,3,10,4,1,0,2,1,3,4,2.0,31397,7.0,1,2,2,6,4745.0
3,4,26,1,2,1,2,0,2,3,2.0,31397,7.0,1,2,2,6,7272.0
4,5,26,1,2,1,2,0,3,3,2.0,31397,7.0,1,2,2,6,5558.0


[.5 points] Divide your data into training and testing data using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?

In [84]:
# the data was already split for us, so we will practice splitting and validating with the train data since it's adequately large
print('original shapes')
print('X:', df.shape)
print('y:', train_stays.shape)
print()

X_train, X_test, y_train, y_test = train_test_split(df, train_stays, test_size=0.2, random_state=0)

print('train shapes')
print('X:', X_train.shape)
print('y', y_train.shape)
print()

print('test shapes')
print('X:', X_test.shape)
print('y:', y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train.head)
clf.score(X_test, y_test)

original shapes
X: (9888, 17)
y: (9888,)

train shapes
X: (7910, 17)
y (7910,)

test shapes
X: (1978, 17)
y: (1978,)


ValueError: y should be a 1d array, got an array of shape () instead.

In [None]:
X, y = datasets.load_iris(return_X_y=True)
print(X)
y

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

# Modeling (5 points total)
The implementation of logistic regression must be written only from the examples given to you by the instructor. No credit will be assigned to teams that copy implementations from another source, regardless of if the code is properly cited. 

[2 points] Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template developed by the instructor in the course. You should add the following functionality to the logistic regression classifier:

* Ability to choose optimization technique when class is instantiated: either steepest ascent, stochastic gradient ascent, and {Newton's method/Quasi Newton methods}. 
* Update the gradient calculation to include a customizable regularization term (either using no regularization, L1 regularization, L2 regularization, or both L1 and L2 regularization). Associate a cost with the regularization term, "C", that can be adjusted when the class is instantiated.

[1.5 points] Train your classifier to achieve good generalization performance. That is, adjust the optimization technique and the value of the regularization term(s) "C" to achieve the best performance on your test set. Visualize the performance of the classifier versus the parameters you investigated. Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?

[1.5 points] Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time and classification performance. Discuss the results. 

# Deployment (1 points total)
* Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?

# Exceptional Work (1 points total)
* You have free reign to provide additional analyses. One idea: Update the code to use either "one-versus-all" or "one-versus-one" extensions of binary to multi-class classification. 