# Week 5 – Support Vector Machines

This notebook focuses on concepts such as support vector machines, the kernel trick, and regularization for support vector machines.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
df = pd.read_csv('LengthOfStay.csv')

In [3]:
df.head()

Unnamed: 0,eid,vdate,rcount,gender,dialysisrenalendstage,asthma,irondef,pneum,substancedependence,psychologicaldisordermajor,...,glucose,bloodureanitro,creatinine,bmi,pulse,respiration,secondarydiagnosisnonicd9,discharged,facid,lengthofstay
0,1,8/29/2012,0,F,0,0,0,0,0,0,...,192.476918,12.0,1.390722,30.432418,96,6.5,4,9/1/2012,B,3
1,2,5/26/2012,5+,F,0,0,0,0,0,0,...,94.078507,8.0,0.943164,28.460516,61,6.5,1,6/2/2012,A,7
2,3,9/22/2012,1,F,0,0,0,0,0,0,...,130.530524,12.0,1.06575,28.843812,64,6.5,2,9/25/2012,B,3
3,4,8/9/2012,0,F,0,0,0,0,0,0,...,163.377028,12.0,0.906862,27.959007,76,6.5,1,8/10/2012,A,1
4,5,12/20/2012,0,F,0,0,0,1,0,1,...,94.886654,11.5,1.242854,30.258927,67,5.6,2,12/24/2012,E,4


Drop unnecessary columns.

In [4]:
df_model = df.drop(columns=['eid', 'vdate', 'discharged', 'facid'])

In [5]:
df_model

Unnamed: 0,rcount,gender,dialysisrenalendstage,asthma,irondef,pneum,substancedependence,psychologicaldisordermajor,depress,psychother,...,neutrophils,sodium,glucose,bloodureanitro,creatinine,bmi,pulse,respiration,secondarydiagnosisnonicd9,lengthofstay
0,0,F,0,0,0,0,0,0,0,0,...,14.20,140.361132,192.476918,12.0,1.390722,30.432418,96,6.5,4,3
1,5+,F,0,0,0,0,0,0,0,0,...,4.10,136.731692,94.078507,8.0,0.943164,28.460516,61,6.5,1,7
2,1,F,0,0,0,0,0,0,0,0,...,8.90,133.058514,130.530524,12.0,1.065750,28.843812,64,6.5,2,3
3,0,F,0,0,0,0,0,0,0,0,...,9.40,138.994023,163.377028,12.0,0.906862,27.959007,76,6.5,1,1
4,0,F,0,0,0,1,0,1,0,0,...,9.05,138.634836,94.886654,11.5,1.242854,30.258927,67,5.6,2,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,3,M,0,0,0,0,0,0,0,0,...,9.30,132.614977,171.422555,12.0,0.650323,30.063069,80,6.5,1,6
99996,0,M,0,0,0,0,0,0,0,0,...,9.30,138.327320,122.342450,12.0,1.521424,28.969548,61,6.5,1,1
99997,1,M,0,0,1,0,0,0,0,0,...,7.70,136.695905,108.288106,12.0,1.025677,26.354919,61,6.9,1,4
99998,0,M,0,0,0,0,0,0,1,0,...,8.20,135.980516,111.750731,16.0,1.035400,29.193462,59,5.6,1,4


Change '5+' to '5' on the 'rcount column.

In [6]:
df_model['rcount'] = df_model['rcount'].replace('5+', 5).astype(int)

Make the gender columns a binary one.

In [7]:
df_model['gender'] = (df_model['gender'] == 'M').astype(int)


In [8]:
df_model['gender']

0        0
1        0
2        0
3        0
4        0
        ..
99995    1
99996    1
99997    1
99998    1
99999    0
Name: gender, Length: 100000, dtype: int64

Assign target and features, train split, feature scaling and appplying the model.

In [9]:
X = df_model.drop(columns='gender')
y = df_model['gender']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
model = svm.SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

In [13]:
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6729


In [14]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[9888 1713]
 [4829 3570]]


In [15]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.85      0.75     11601
           1       0.68      0.43      0.52      8399

    accuracy                           0.67     20000
   macro avg       0.67      0.64      0.64     20000
weighted avg       0.67      0.67      0.66     20000



The overall accuracy was of 67%, meaning the model predicted correctly 2 out of 3 samples. The model had 9888 true negatives, and 3570 true positives. From the report we can see that the model leans toward predicting class 0 (women). I will change the kernel to 'rbf', add class_weight='balanced' and set the regularization parameter to 10 to see if the model improves.

In [16]:
model = svm.SVC(kernel='rbf', C=1, class_weight='balanced')
model.fit(X_train_scaled, y_train)

In [17]:
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6753


In [18]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[8342 3259]
 [3235 5164]]


In [19]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.72      0.72     11601
           1       0.61      0.61      0.61      8399

    accuracy                           0.68     20000
   macro avg       0.67      0.67      0.67     20000
weighted avg       0.68      0.68      0.68     20000



The RBF kernel helped capture nonlinear patterns and it improved from our previous model, but there's still noise or complexity the model can’t fully learn from the features. The class imbalance did help capturing more true positives but it affected the recall score for women. Guessing the gender of a person is a difficult taks if you just based your model on hospital information, taking that into account I'd say the model is doing a pretty good.