# [9660] Exercise # 7 - Support Vector Machine
Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/healthcare_stroke_2.csv

## Exercise # 7 Requirements
* Load data into dataframe
  * Do NOT use meaningless columns (e.g. 'id') as independent variables
* Examine data
* Use SimpleImputer to replace missing values
* Encode categorical variables
* Standardize independent variables
* Prepare data for model training and testing
  * Separate independent and dependent variables
  * Split train and test sets
* Train support vector classifier model with default hyperparameters
  * Predict using the test set
  * Calculate and display model accuracy
* Train support vector classifier model with at least 2 different hyperparameters
  * Predict using the test set
  * Calculate and display model accuracy

In [None]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 11/14/24 23:16:00


### Import libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

### Load data

In [None]:
# Use column 'id' as the index column
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/healthcare_stroke_2.csv', index_col='id')

### Examine data

In [None]:
df.shape

(5110, 9)

In [None]:
df.head()

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9046,Male,67.0,0,1,Yes,228.69,36.6,formerly smoked,1
51676,Female,61.0,0,0,Yes,202.21,,never smoked,1
31112,Male,80.0,0,1,Yes,105.92,32.5,never smoked,1
60182,Female,49.0,0,0,Yes,171.23,34.4,smokes,1
1665,Female,79.0,1,0,Yes,174.12,24.0,never smoked,1


In [None]:
# Review distribution of target (stroke) values
df['stroke'].value_counts()

Unnamed: 0_level_0,count
stroke,Unnamed: 1_level_1
0,4861
1,249


In [None]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
avg_glucose_level,0
bmi,201
smoking_status,134
stroke,0


### Prepare data

### Use the SimpleImputer to replace missing values

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
df[['bmi']] = imp_mean.fit_transform(df[['bmi']])


In [None]:
imp_most_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

In [None]:
df[['smoking_status']] = imp_most_freq.fit_transform(df[['smoking_status']])

In [None]:
df.isnull().sum()

Unnamed: 0,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
avg_glucose_level,0
bmi,0
smoking_status,0
stroke,0


### Encode categorical variables 'gender', 'ever_married' and 'smoking_status'

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
df['smoking_status'].unique()

array(['formerly smoked', 'never smoked', 'smokes'], dtype=object)

In [None]:
oe = OrdinalEncoder(categories=[['smokes', 'formerly smoked', 'never smoked']])
df['smoking_status'] = oe.fit_transform(df[['smoking_status']])

In [None]:
df[['gender']] = OrdinalEncoder().fit_transform(df[['gender']])

In [None]:
df[['ever_married']] = OrdinalEncoder().fit_transform(df[['ever_married']])

In [None]:
df[['smoking_status']] = OrdinalEncoder().fit_transform(df[['smoking_status']])

In [None]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9046,1.0,67.0,0,1,1.0,228.69,36.6,1.0,1
51676,0.0,61.0,0,0,1.0,202.21,28.893237,2.0,1
31112,1.0,80.0,0,1,1.0,105.92,32.5,2.0,1
60182,0.0,49.0,0,0,1.0,171.23,34.4,0.0,1
1665,0.0,79.0,1,0,1.0,174.12,24.0,2.0,1


### Separate independent and dependent variables
Dependent variable: 'stroke'

In [None]:
X = df.drop('stroke', axis=1)     # Independent variables
y = df['stroke']                  # Dependent variable

### Standardize independent variables

In [None]:
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

### Split data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y,
                                                    test_size=0.25,
                                                    random_state=42)

### Train SVC model (with default hyperparameters)

In [None]:
classifier = SVC(random_state=42)

In [None]:
classifier.fit(X_train, y_train)

### Evaluate SVC model

In [None]:
model_preds = classifier.predict(X_test)
model_accuracy = accuracy_score(y_test, model_preds)
print(f"SVC (with polynomial kernel) score: {round((model_accuracy * 100), 3)}%")

SVC (with polynomial kernel) score: 95.149%


### Train SVC model (with at least 2 different hyperparameters)

In [None]:
classifier = SVC(C=0.1,class_weight='balanced',  kernel='rbf', probability=True,
        break_ties=True)

In [None]:
classifier.fit(X_train, y_train)

### Evaluate updated SVC model

In [None]:
model_preds = classifier.predict(X_test)
model_accuracy = accuracy_score(y_test, model_preds)
print(f"SVC (with polynomial kernel) score: {round((model_accuracy * 100), 3)}%")

SVC (with polynomial kernel) score: 68.153%
