# [9660] Homework 3 - KNN
Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/cardiovascular_disease_adults_60K.csv

## Homework Submission Rules (for all homework assignments)
* Homework is due by 6:05 PM on the due date
  * No late submission will be accepted
* You must submit a cleanly executed notebook (*.ipynb)
  * Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework 3 Requirements
* Load data
  * Do NOT use meaningless columns (e.g. 'id') as independent variables
* Identify missing values and use SimpleImputer to replace missing values
* Ordinal Encode independent variables: 'smoker', 'alcohol_drinker', 'physically_active', 'cholesterol' and 'glucose'
  * From a health perspective:
    * It is better to NOT BE a 'smoker', NOT BE an 'alcohol_drinker', and TO BE 'physically_active'
    * For 'cholesterol' and 'glucose', 'average' is better than 'above_average', which is better than 'high'
* Dummy (one-hot) independent variable: encode 'gender'
* Label encode dependent variable: 'cardiovascular_disease'
* Separate independent and dependent variables
* Standardize independent variables
* Split data into training and test sets
* Train KNeighborsClassifier (with default hyperparameters)
* Calculate accuracy for KNeighborsClassifier (with default hyperparameters)
* Re-train KNeighborsClassifier (change n_neighbors hyperparameter and at least one other hyperparameter)
  * NOTE: The objective of changing these hyperparameters is to improve model accuracy
    * If you used hyperparameter random_state in your initial model training, do NOT change this value during model retrainings
    * Do NOT re-split training and test sets during model retrainings
* Calculate accuracy for re-trained KNeighborsClassifier (with updated hyperparameters)

In [None]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 11/06/24 14:27:09


### Import libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

### Load data

Risk Factors for Cardiovascular Heart Disease

age: Age of participant (integer)  
gender : Gender of participant (string - male, female)  
height : Height measured in centimeters (integer)  
weight : Weight measured in kilograms (integer)  
systolic_bp  : Systolic blood pressure reading taken from patient (integer)  
diastolic_bp  : Diastolic blood pressure reading taken from patient (integer)  
cholesterol : Total cholesterol level (string - average, above-average, high)  
glucose : Glucose level (string - average, above-average, high)  
smoker : Whether person smokes or not (string - N, Y)  
alcohol_drinker : Whether person drinks alcohol or not (string - NO, YES)  
physically_active : Whether person is physically active or not (string - no, yes)  
cardiovascular_disease : Whether person suffers from cardiovascular diseases or not (string - No, Yes)

In [None]:
# Read cardiovascular_disease_adults_60K.csv into dataframe
#  NOTES:
#   Field separator is '|'
#   Use column 'id' as index_col
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/cardiovascular_disease_adults_60K.csv', sep='|', index_col='id')


### Examine data

In [None]:
df.shape

(60000, 12)

In [None]:
df.head()

Unnamed: 0_level_0,age,gender,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.0,Male,168.0,62.0,110,80,average,average,N,NO,yes,No
1,55.0,Female,156.0,85.0,140,90,high,average,N,NO,yes,Yes
2,51.0,Female,165.0,64.0,130,70,high,average,N,NO,no,Yes
3,48.0,Male,169.0,82.0,150,100,average,average,N,NO,yes,Yes
4,47.0,Female,156.0,56.0,100,60,average,average,N,NO,no,No


### Prepare data for model training

#### Use the SimpleImputer to replace missing values

In [None]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
age,139
gender,167
height,229
weight,74
systolic_bp,0
diastolic_bp,0
cholesterol,195
glucose,0
smoker,84
alcohol_drinker,0


In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
df[['age', 'height', 'weight']] = imp_mean.fit_transform(df[['age', 'height', 'weight']])

In [None]:
imp_most_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

In [None]:
df[['gender', 'cholesterol', 'smoker']] = imp_most_freq.fit_transform(df[['gender', 'cholesterol', 'smoker']])

#### Check for missing values again

In [None]:
df.isnull().sum()

Unnamed: 0,0
age,0
gender,0
height,0
weight,0
systolic_bp,0
diastolic_bp,0
cholesterol,0
glucose,0
smoker,0
alcohol_drinker,0


#### Ordinal Encode 'smoker', 'alcohol_drinker', 'physically_active', 'cholesterol' and 'glucose'

In [None]:
df['smoker'].unique()

array(['N', 'Y'], dtype=object)

In [None]:
df['alcohol_drinker'].unique()

array(['NO', 'YES'], dtype=object)

In [None]:
df['physically_active'].unique()

array(['yes', 'no'], dtype=object)

In [None]:
df['cholesterol'].unique()

array(['average', 'high', 'above_average'], dtype=object)

In [None]:
df['glucose'].unique()

array(['average', 'above_average', 'high'], dtype=object)

In [None]:
df[['smoker']] = OrdinalEncoder().fit_transform(df[['smoker']])

In [None]:
df[['alcohol_drinker']] = OrdinalEncoder().fit_transform(df[['alcohol_drinker']])

In [None]:
df[['physically_active']] = OrdinalEncoder().fit_transform(df[['physically_active']])

In [None]:
df[['cholesterol']] = OrdinalEncoder().fit_transform(df[['cholesterol']])

In [None]:
df[['glucose']] = OrdinalEncoder().fit_transform(df[['glucose']])

In [None]:
df['smoker'].unique()

array([0., 1.])

In [None]:
df['alcohol_drinker'].unique()

array([0., 1.])

In [None]:
df['physically_active'].unique()

array([1., 0.])

In [None]:
df['cholesterol'].unique()

array([1., 2., 0.])

In [None]:
df['glucose'].unique()

array([1., 0., 2.])

In [None]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,age,gender,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.0,Male,168.0,62.0,110,80,1.0,1.0,0.0,0.0,1.0,No
1,55.0,Female,156.0,85.0,140,90,2.0,1.0,0.0,0.0,1.0,Yes
2,51.0,Female,165.0,64.0,130,70,2.0,1.0,0.0,0.0,0.0,Yes
3,48.0,Male,169.0,82.0,150,100,1.0,1.0,0.0,0.0,1.0,Yes
4,47.0,Female,156.0,56.0,100,60,1.0,1.0,0.0,0.0,0.0,No


#### Dummy (one-hot) encode gender

In [None]:
df = pd.get_dummies(df, columns=['gender'], dtype=int)

In [None]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,age,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease,gender_Female,gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,50.0,168.0,62.0,110,80,1.0,1.0,0.0,0.0,1.0,No,0,1
1,55.0,156.0,85.0,140,90,2.0,1.0,0.0,0.0,1.0,Yes,1,0
2,51.0,165.0,64.0,130,70,2.0,1.0,0.0,0.0,0.0,Yes,1,0
3,48.0,169.0,82.0,150,100,1.0,1.0,0.0,0.0,1.0,Yes,0,1
4,47.0,156.0,56.0,100,60,1.0,1.0,0.0,0.0,0.0,No,1,0


#### Label encode target variable 'cardiovascular_disease'

In [None]:
le = LabelEncoder()
df['cardiovascular_disease'] = le.fit_transform(df['cardiovascular_disease'])

In [None]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [None]:
le.transform(list(le.classes_))

array([0, 1])

In [None]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,age,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease,gender_Female,gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,50.0,168.0,62.0,110,80,1.0,1.0,0.0,0.0,1.0,0,0,1
1,55.0,156.0,85.0,140,90,2.0,1.0,0.0,0.0,1.0,1,1,0
2,51.0,165.0,64.0,130,70,2.0,1.0,0.0,0.0,0.0,1,1,0
3,48.0,169.0,82.0,150,100,1.0,1.0,0.0,0.0,1.0,1,0,1
4,47.0,156.0,56.0,100,60,1.0,1.0,0.0,0.0,0.0,0,1,0


### Separate independent and dependent variables
* Independent variables: All remaining variables except 'cardiovascular_disease'
* Dependent variable: 'cardiovascular_disease'

In [None]:
X = df.drop('cardiovascular_disease', axis = 1)
X.head()

Unnamed: 0_level_0,age,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,gender_Female,gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.0,168.0,62.0,110,80,1.0,1.0,0.0,0.0,1.0,0,1
1,55.0,156.0,85.0,140,90,2.0,1.0,0.0,0.0,1.0,1,0
2,51.0,165.0,64.0,130,70,2.0,1.0,0.0,0.0,0.0,1,0
3,48.0,169.0,82.0,150,100,1.0,1.0,0.0,0.0,1.0,0,1
4,47.0,156.0,56.0,100,60,1.0,1.0,0.0,0.0,0.0,1,0


In [None]:
y = df['cardiovascular_disease']
y.head()

Unnamed: 0_level_0,cardiovascular_disease
id,Unnamed: 1_level_1
0,0
1,1
2,1
3,1
4,0


### Standardize independent variables

In [None]:
cols_to_standardize = ['age','height', 'weight', 'systolic_bp', 'diastolic_bp', 'glucose', 'cholesterol', 'smoker', 'alcohol_drinker', 'physically_active', 'gender_Female', 'gender_Male']

scaler = StandardScaler()

X[cols_to_standardize] = scaler.fit_transform(X[cols_to_standardize])

### Split data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                    test_size=0.2, random_state=42)

### Train KNeighborsClassifier (with default hyperparameters)


In [None]:
knn = KNeighborsClassifier()
knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [None]:
knn.fit(X_train, y_train)

### Evaluate performance for KNeighborsClassifier (with default hyperparameters)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
# Print model accuracy score
accuracy_score_1 = accuracy_score(y_test, y_pred)
print(f"Accuracy = {round((accuracy_score_1 * 100), 4)}%")

Accuracy = 64.2583%


### Train KNeighborsClassifier (change n_neighbors hyperparameter and at least one other hyperparameter)
NOTE: The objective of changing these hyperparameters is to improve model accuracy

In [None]:
knn = KNeighborsClassifier(n_neighbors=6, weights='distance', metric='euclidean')
knn.fit(X_train, y_train)

### Evaluate performance for KNeighborsClassifier (with updated hyperparameters)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
accuracy_score_1 = accuracy_score(y_test, y_pred)
print(f"Accuracy = {round((accuracy_score_1 * 100), 4)}%")

Accuracy = 64.475%
