Group 25 Project

Doris (33996984)

Verness Chin (52924784)

Mackenzie Dy (62709126)

Kunyue Liu (94258175)

Title: Heart Failure

In [None]:
pip install -U scikit-learn
pip install matplotlib
pip install ucimlrepo

In [8]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [10]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_failure_clinical_records = fetch_ucirepo(id=519) 
  
# data (as pandas dataframes) 
X = heart_failure_clinical_records.data.features 
y = heart_failure_clinical_records.data.targets 
  
# metadata 
print(heart_failure_clinical_records.metadata) 
  
# variable information 
print(heart_failure_clinical_records.variables) 

{'uci_id': 519, 'name': 'Heart failure clinical records', 'repository_url': 'https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records', 'data_url': 'https://archive.ics.uci.edu/static/public/519/data.csv', 'abstract': 'This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate'], 'num_instances': 299, 'num_features': 12, 'feature_types': ['Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['death_event'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2020, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5Z89R', 'creators': [], 'intro_paper': {'title': 'Machine learning can predict survival of patients with heart failure from serum creatinine and ejec

In [198]:
url = 'https://archive.ics.uci.edu/static/public/519/data.csv'
data = pd.read_csv(url)
data

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death_event
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


Method:
    
    
Data Preparation
Data Selection: Select only the relevant columns, which are "Serum creatinine," "Ejection fraction," and "Death event."
Data Splitting: Using ‘train_test_split” function to split the data into two datasets, one for training purposes (75%) and one for testing purposes (25%).
 
KNN Model Building:
Create a preprocessor to standardize the "Serum creatinine" and "Ejection fraction" columns to have a mean of 0 and a standard deviation of 1.
 
Choosing K: To pick an appropriate value for K, the number of nearest neighbors to consider, 5-fold cross-validation will be applied. A grid search model will be created to find the optimal K value that minimizes errors.
Training: Building the KNN model with the value of K obtained from the Choosing K on the training data using the standardized "Serum creatinine" and "Ejection fraction" columns as input features and "Death event" as the target variable. Then pass the training data and model specification utilizing “fit” function.
 
Model Evaluation and visualization:
Testing: Use the testing data to evaluate the KNN model's performance. Make predictions for "Death event" based on "Serum creatinine" and "Ejection fraction" from the testing data.
Evaluation Metrics: Create a confusion matrix by using the crosstab function and calculate the accuracy, precision, and recall based on the metrics. The confusion matrix will show the number of true positives, true negatives, false positives, and false negatives. The “ConfusionMatrixDisplay” function will create a heatmap of the confusion matrix for better visualization


In this analysis, we will be focusing on quantitative variables rather than the boolean variables as that is the subject of our attention.

We've also converted the death event to a more readable format (yes and no).

In [199]:
data["death_event"] = data["death_event"].astype(str).replace({
    '1': 'Yes',
    '0': 'No'
})
data[['creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'death_event']]

Unnamed: 0,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,death_event
0,582,20,265000.00,1.9,130,Yes
1,7861,38,263358.03,1.1,136,Yes
2,146,20,162000.00,1.3,129,Yes
3,111,20,210000.00,1.9,137,Yes
4,160,20,327000.00,2.7,116,Yes
...,...,...,...,...,...,...
294,61,38,155000.00,1.1,143,No
295,1820,38,270000.00,1.2,139,No
296,2060,60,742000.00,0.8,138,No
297,2413,38,140000.00,1.4,140,No


This is where we split the data into the test (25%) and training data (75%).

In [201]:
heart_disease_train, heart_disease_test = train_test_split(data, test_size=0.25, random_state=123)

For our preferred analysis, we want two variables that will help us accurately predict a death event. However, we cannot randomly select a pair of variables, as there could exist another pair of variables that might predict the death event more accurately. To solve this, we will compare all possible pairs and determine which one has the higher accuracy.

Below, we initialize a dataframe to store all the possible pairs and their accuracies.

In [303]:
compared_categories_dataframe = pd.DataFrame(columns=['category_1', 'category_2', 'accuracy'])

Below, we train the data on the two selected variables and ensure that the variables are all scaled correctly. We then predict the death event from this trained model on the testing set, obtaining an accuracy score between the real value and the predicted value.

### Comparison of Ejection Fraction and Serum Creatinine

In [304]:
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), ["ejection_fraction", "serum_creatinine"]),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)

knn_spec0 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[["ejection_fraction", "serum_creatinine"]]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec0)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[["ejection_fraction", "serum_creatinine"]])
)

X_test = heart_disease_test_predictions[["ejection_fraction", "serum_creatinine"]]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)
heart_disease_prediction_accuracy

0.6933333333333334

In [305]:
new_row = {'category_1' : 'ejection_fraction', 'category_2' : 'serum_creatinine', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333


### Comparsion of Ejection Fraction and Platelets

In [306]:
compared_categories = ['ejection_fraction', 'platelets']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec1 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec1)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [307]:
new_row = {'category_1' : 'ejection_fraction', 'category_2' : 'platelets', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6


### Comparison of Ejection Fraction and Serum Sodium

In [308]:
compared_categories = ['ejection_fraction', 'serum_sodium']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec2 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec2)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [309]:
new_row = {'category_1' : 'ejection_fraction', 'category_2' : 'serum_sodium', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64


### Comparison of Creatine Phosphokinase and Serum Creatinine

In [310]:
compared_categories = ['creatinine_phosphokinase', 'serum_creatinine']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec3 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec3)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [311]:
new_row = {'category_1' : 'creatine_phosphokinase', 'category_2' : 'serum_creatine', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64


### Comparison of Ejection Fraction and Creatine Phosphokinase

In [312]:
compared_categories = ['creatinine_phosphokinase', 'ejection_fraction']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec4 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec4)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [313]:
new_row = {'category_1' : 'ejection_fraction', 'category_2' : 'creatinine_phosphokinase', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64
4,ejection_fraction,creatinine_phosphokinase,0.573333


### Comparison of Platelets with Serum Creatinine

In [314]:
compared_categories = ['platelets', 'serum_creatinine']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec5 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec5)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [315]:
new_row = {'category_1' : 'platelets', 'category_2' : 'serum_creatinine', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64
4,ejection_fraction,creatinine_phosphokinase,0.573333
5,platelets,serum_creatinine,0.653333


### Comparison of Serum Creatinine and Serum Sodium

In [316]:
compared_categories = ['serum_creatinine', 'serum_sodium']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec6 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec6)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [317]:
new_row = {'category_1' : 'serum_creatinine', 'category_2' : 'serum_sodium', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64
4,ejection_fraction,creatinine_phosphokinase,0.573333
5,platelets,serum_creatinine,0.653333
6,serum_creatinine,serum_sodium,0.693333


### Comparison of Creatine Phosphokinase and Platelets

In [318]:
compared_categories = ['creatinine_phosphokinase', 'platelets']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec6 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec6)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [319]:
new_row = {'category_1' : 'creatinine_phosphokinase', 'category_2' : 'platelets', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64
4,ejection_fraction,creatinine_phosphokinase,0.573333
5,platelets,serum_creatinine,0.653333
6,serum_creatinine,serum_sodium,0.693333
7,creatinine_phosphokinase,platelets,0.56


### Comparison of Creatinine Phosphokinase and Serum Sodium

In [320]:
compared_categories = ['creatinine_phosphokinase', 'serum_sodium']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec6 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec6)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [321]:
new_row = {'category_1' : 'creatinine_phosphokinase', 'category_2' : 'serum_sodium', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64
4,ejection_fraction,creatinine_phosphokinase,0.573333
5,platelets,serum_creatinine,0.653333
6,serum_creatinine,serum_sodium,0.693333
7,creatinine_phosphokinase,platelets,0.56
8,creatinine_phosphokinase,serum_sodium,0.586667


### Comparison of Platelets and Serum Sodium

In [322]:
compared_categories = ['platelets', 'serum_sodium']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
knn_spec6 = KNeighborsClassifier(n_neighbors=6)
heart_disease_train.columns = heart_disease_train.columns.str.strip()
X = heart_disease_train[compared_categories]
y = heart_disease_train["death_event"]
X = X.astype(float)  

heart_disease_fit = make_pipeline(heart_disease_preprocessor, knn_spec6)
heart_disease_fit = heart_disease_fit.fit(X,y)

heart_disease_test_predictions = heart_disease_test.assign(
    predicted = heart_disease_fit.predict(heart_disease_test[compared_categories])
)
X_test = heart_disease_test_predictions[compared_categories]
y_test = heart_disease_test_predictions["death_event"]
heart_disease_prediction_accuracy = heart_disease_fit.score(X_test, y_test)

In [323]:
new_row = {'category_1' : 'platelets', 'category_2' : 'serum_sodium', 'accuracy' : heart_disease_prediction_accuracy }
compared_categories_dataframe.loc[len(compared_categories_dataframe)] = new_row
compared_categories_dataframe

Unnamed: 0,category_1,category_2,accuracy
0,ejection_fraction,serum_creatinine,0.693333
1,ejection_fraction,platelets,0.6
2,ejection_fraction,serum_sodium,0.64
3,creatine_phosphokinase,serum_creatine,0.64
4,ejection_fraction,creatinine_phosphokinase,0.573333
5,platelets,serum_creatinine,0.653333
6,serum_creatinine,serum_sodium,0.693333
7,creatinine_phosphokinase,platelets,0.56
8,creatinine_phosphokinase,serum_sodium,0.586667
9,platelets,serum_sodium,0.626667


Having went through all the possible pairs, we see from the table that the category of Ejection Fraction and Serum Creatinine produces the highest accuracy score out of all the other possible pairs. Thus, Ejection Fraction and Serum Creatinine is the pair we will use to further analyze how it affect the death event.

### Visuailzation of planned analysis

Below is a chart that visualizes the relationship between Ejection Fraction and Serum Creatinine and how they both relate to the death event.

In [324]:
plot = alt.Chart(data).mark_point(opacity = 0.5).encode(
    x = alt.X('ejection_fraction', title = "Ejection Fraction in percentage (%)"),
    y = alt.Y('serum_creatinine', title = "Serum Creatinine in mg/dL"),
     color = alt.Color('death_event', title = 'Death event')
)
plot

### Tuning the model

Below, we attempt to find the prediction accuracy of our classifier by perorming a cross validation.

In [325]:
np.random.seed(2020)

compared_categories = ['ejection_fraction', 'serum_creatinine']
heart_disease_preprocessor = make_column_transformer(
     (StandardScaler(), compared_categories),
     remainder= 'passthrough',
     verbose_feature_names_out=False,
)
X = heart_disease_train[compared_categories]
y = heart_disease_train['death_event']
X = X.astype(float)

heart_disease_pipe = make_pipeline(heart_disease_preprocessor, knn_spec0)
heart_disease_vfold_score = pd.DataFrame(
     cross_validate(
         estimator=heart_disease_pipe,
         cv=5,
         X=X,
         y=y,
         return_train_score=True,
     )
)

heart_disease_metrics = heart_disease_vfold_score.agg(["mean", "sem"])
heart_disease_metrics

Unnamed: 0,fit_time,score_time,test_score,train_score
mean,0.005752,0.007678,0.785556,0.8136
sem,0.00291,0.002237,0.017177,0.005594


As you can see above, the prediction accuracy of our classifier at K = 6 is only 78%. Let us see if we can find a better value by changing the KNearestNeighbor value, K.

Below, we create a paramater gird that attempts to find the most optimal KNearestNeighbor value.

In [326]:
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 50, 1),
}
heart_disease_tune_pipe = make_pipeline(heart_disease_preprocessor, KNeighborsClassifier())

knn_tune_grid = GridSearchCV(
    estimator = heart_disease_tune_pipe,
    param_grid = param_grid,
    cv = 5,
    return_train_score=True,
    n_jobs=-1
)

In [327]:
knn_model_grid = knn_tune_grid.fit(X, y)
accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)

In [328]:
accuracy_versus_k_grid = alt.Chart(accuracies_grid).mark_line(point=True).encode(
     x=alt.X("param_kneighborsclassifier__n_neighbors", title = "N_neighbors"),
     y=alt.Y("mean_test_score", title = "Mean Test Score").scale(zero=False)
).properties(
    height=500,
    width=800
)
accuracy_versus_k_grid
# optimal is 36 and 6

In [330]:
np.random.seed(2000)

# your code here
param_grid = {'n_neighbors': range(1, 30, 1)}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, return_train_score=True, n_jobs=-1)
grid_search.fit(X, y)
cv_results_df = pd.DataFrame(grid_search.cv_results_)

cross_val_plot = alt.Chart(cv_results_df).mark_line(point=True).encode(
    x = alt.X("param_n_neighbors", title="Number of Neighbors (k)"),
    y = alt.Y("mean_test_score", title ="Accuracy").scale(zero=False)
)
cross_val_plot

In [27]:
# Optimal K is 6

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
    heart_disease_fit, 
    X_test, 
    y_test
)