## Objective

* This is to demonstrate how Support Vector Machines (SVM) model is applied for heart disease prediction 
* Evaluate the accuracy of the model and the improved version for comparison

## Work Flow

1. Import libraries, read heart disease dataset, split into training and testing sets with 80:20 ratio
2. Import Support Vector Classifier(SVC) to implement SVMs, then fit the model using training set
3. Predict heart disease status in the test set and show the accuracy of SVM
4. Improve the model with additional feature maximum heart rate achieved by the individual
5. Check accuracy of the improved model

## Notes
* SVM is an algorithm that finds a line (hyperplane) which separates the data into two classes
* A heart disease dataset from Kaggle for demo: 
[Heart Disease](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)
* Data Card: 
1. age
2. sex
3. chest pain type (4 values)
4. resting blood pressure
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
14. target: 0 = no disease; 1 = disease

### Step-1
* Load the dataset
* Split it into training(80%) and testing(20%) sets. The goal of the training set is to find a line or boundary to separate the people with the heart disease and those without, and the testing set will tell us how well the model works on people it hasn't seen before

In [30]:
import pandas as pd
import math

heart = pd.read_csv("heart.csv")

krows = math.floor(heart.shape[0] * 0.8)

training = heart.loc[:krows]
testing = heart.loc[krows:]

In [31]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


### Step-2
* Initial model to predict whether or not a patient has heart disease based on two things: their age and cholesterol level. It's noted that age and higher cholesterol is associated with higher rates of heart disease
* Prepare the model to be fit to the data. "SVC" stands for "Support Vector Classifier" and is part of the svm module to implement SVMs
* After importing the SVC class, we fit the model using the age and chol columns from the training set. The fit method builds the "line" that separates those with heart disease from those without

In [32]:
from sklearn.svm import SVC

model = SVC()
model.fit(training[["age", "chol"]], training["target"])

### Step-3
* After the model has been fit, we use it to predict the heart disease condidtion in the test group. 
* To evaluate how well the SVM predicts heart disease in the testing set, we need to calculate the accuracy of the model. It is the proportion of the observations that are predicted correctly. By comparing the model predictions to the actual observations in the testing set.

In [33]:
predictions = model.predict(testing[["age", "chol"]])

accuracy = sum(testing["target"] == predictions) / testing.shape[0]

accuracy

0.5609756097560976

* The model has an accuracy of about 56%. It is common for initial models to perform poorly.

### Step-4
* We can improve the model by incorporating more features into the model so that it has more information to try to separate those with heart disease and those without. Besides age and chol columns, we can add the thalach column. It represents the maximum heart rate achieved by the patient.
* We can repeat step-2 and step-3 with thalach column included on the improved model. 

In [34]:
model = SVC()
model.fit(training[["age", "chol", "thalach"]], training["target"])

predictions = model.predict(testing[["age", "chol", "thalach"]])

accuracy = sum(testing["target"] == predictions) / testing.shape[0]

### Step-5
* Check the accuracy of this improved model to see if it performs better.

In [35]:
accuracy

0.6780487804878049

### Summary
* Now the improved model has an accuracy of 68% or ((68-56)/56 * 100) 21% increase in performance.
* We can continue to iterate and improve upon the model by adding new features or removing those that do not help.