# Title #

## Introduction ##
provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and describe the dataset that was used to answer the question

## Methods and Results ##
describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
your report should include code which:
loads data from the original source on the web 
wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 
note: all tables and figure should have a figure/table number and a legend

## Discussion ##
summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?

## References ##
At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
Make sure to cite the source of your data as well.


## Methods and Results ##
First we will start by loading the data as follows. 

In [78]:
#Import the packages 
import pandas as pd
import numpy as np
import altair as alt

#Set the seed
np.random.seed(1)

#load the data
heart_data = pd.read_csv("heart.csv")
heart_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


The information provided in the dataset is as follows: 
1. 'age': Age of the patients
2. 'sex': The sex 
3. 'cp': The chest pain type suffered (there are 4 types)
4. 'trestbps': The resting blood pressure
5. 'chol': The serum cholestrol in mg/dl units
6. 'fbs': The fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7. 'restecg': Resting electrocardiographic results (values 0, 1, 2)
8. 'thalach': The maximum heart rate achieved
9. 'exang': The exercise induced angina
10. 'oldpeak': The oldpeak = ST depression induced by exercise relative to rest
11. 'slope': The slope of the peak exercise ST segment
12. 'ca': The number of major vessels (0-3) colored by flourosopy
13. 'thal': 0 = normal; 1 = fixed defect; 2 = reversable defect
14. 'target': 0 or 1

In [63]:
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 312 entries, 4 to 1023
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       312 non-null    int64  
 1   sex       312 non-null    int64  
 2   cp        312 non-null    int64  
 3   trestbps  312 non-null    int64  
 4   chol      312 non-null    int64  
 5   fbs       312 non-null    int64  
 6   restecg   312 non-null    int64  
 7   thalach   312 non-null    int64  
 8   exang     312 non-null    int64  
 9   oldpeak   312 non-null    float64
 10  slope     312 non-null    int64  
 11  ca        312 non-null    int64  
 12  thal      312 non-null    int64  
 13  target    312 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 36.6 KB


To improve readability, we will replace the values "0,1,2,3" in the column for Chest Pain Types, i.e. "cp" as the respective types of Chest Pain.

In [69]:
heart_data['fbs'] = heart_data['fbs'].replace({
     0 : 'Fasting Blood Sugar < 120 mg/dl',
     1 : 'Fasting Blood Sugar > 120 mg/dl',
}).astype('category')
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 312 entries, 4 to 1023
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       312 non-null    int64   
 1   sex       312 non-null    int64   
 2   cp        312 non-null    int64   
 3   trestbps  312 non-null    int64   
 4   chol      312 non-null    int64   
 5   fbs       312 non-null    category
 6   restecg   312 non-null    int64   
 7   thalach   312 non-null    int64   
 8   exang     312 non-null    int64   
 9   oldpeak   312 non-null    float64 
 10  slope     312 non-null    int64   
 11  ca        312 non-null    int64   
 12  thal      312 non-null    int64   
 13  target    312 non-null    int64   
dtypes: category(1), float64(1), int64(12)
memory usage: 34.6 KB


In [70]:
heart_data['fbs'].unique()

['Fasting Blood Sugar > 120 mg/dl', 'Fasting Blood Sugar < 120 mg/dl']
Categories (2, object): ['Fasting Blood Sugar < 120 mg/dl', 'Fasting Blood Sugar > 120 mg/dl']

Next, we will explore our data set. To do so, we will use 'value_counts' to count occurences of each chest pain type.

In [71]:
heart_data['fbs'].value_counts()

Fasting Blood Sugar < 120 mg/dl    270
Fasting Blood Sugar > 120 mg/dl     42
Name: fbs, dtype: int64

Now, we shall start cleaning up the data. Since we intend to explore the dataset concerning only females ("0" in the column called "sex"), we shall filter the dataset to only show the values relevant to our concern. 

In [72]:
heart_data= heart_data[heart_data["sex"] == 0]
heart_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
4,62,0,0,138,294,Fasting Blood Sugar > 120 mg/dl,1,106,0,1.9,1,3,2,0
5,58,0,0,100,248,Fasting Blood Sugar < 120 mg/dl,0,122,0,1.0,1,0,2,1
10,71,0,0,112,149,Fasting Blood Sugar < 120 mg/dl,1,125,0,1.6,1,0,2,1
11,43,0,0,132,341,Fasting Blood Sugar > 120 mg/dl,0,136,1,3.0,1,0,3,0
12,34,0,1,118,210,Fasting Blood Sugar < 120 mg/dl,1,192,0,0.7,2,0,2,1


Next, we se that the data relevant to us is stored the columns: "age", "chol" and "fbs". Hence, we will modify and create another dataset that contains only those columns


In [74]:
heart_data_2 = heart_data.loc[:, ["age","chol","fbs"]]
heart_data_2.head()

Unnamed: 0,age,chol,fbs
4,62,294,Fasting Blood Sugar > 120 mg/dl
5,58,248,Fasting Blood Sugar < 120 mg/dl
10,71,149,Fasting Blood Sugar < 120 mg/dl
11,43,341,Fasting Blood Sugar > 120 mg/dl
12,34,210,Fasting Blood Sugar < 120 mg/dl


Then, we will omit any row that has any "None" values present, just in case.

In [75]:
heart_data_3 = heart_data_2.dropna(axis=0)
heart_data_3.head()

Unnamed: 0,age,chol,fbs
4,62,294,Fasting Blood Sugar > 120 mg/dl
5,58,248,Fasting Blood Sugar < 120 mg/dl
10,71,149,Fasting Blood Sugar < 120 mg/dl
11,43,341,Fasting Blood Sugar > 120 mg/dl
12,34,210,Fasting Blood Sugar < 120 mg/dl


Finally, now we are able to plot the training data. 

In [77]:


heart_data_3_category=alt.Chart(heart_data_3).mark_circle().encode(
    x=alt.X(
        "chol",
        title=["Cholestrol","(in mg/dl)"],
        scale=alt.Scale(zero=False),
        axis=alt.Axis(tickCount=7)
    ),
    y=alt.Y(
        "age",
        title=["Age", "(Female Patient)"],
        scale=alt.Scale(zero=False),
        axis=alt.Axis(tickCount=7)
    ),
    color="fbs"
).configure_axis(titleFontSize=12)

heart_data_3_category

  for col_name, dtype in df.dtypes.iteritems():


Now, we will move onto creating the training and testing sets.

In [89]:
from sklearn.model_selection import train_test_split

heart_data_3_train, heart_data_3_test = train_test_split(
    heart_data_3, train_size=0.75
)
heart_data_3_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 377 to 751
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   age     234 non-null    int64   
 1   chol    234 non-null    int64   
 2   fbs     234 non-null    category
dtypes: category(1), int64(2)
memory usage: 5.8 KB


In [82]:
heart_data_3_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 78 entries, 320 to 536
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   age     78 non-null     int64   
 1   chol    78 non-null     int64   
 2   fbs     78 non-null     category
dtypes: category(1), int64(2)
memory usage: 2.0 KB


In [83]:
heart_data_3_train["fbs"].value_counts(normalize=True)

Fasting Blood Sugar < 120 mg/dl    0.867521
Fasting Blood Sugar > 120 mg/dl    0.132479
Name: fbs, dtype: float64

In [84]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

heart_data_3_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "chol"]),
)

Now we will begin to train the classifier. 

In [100]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

knn = KNeighborsClassifier(n_neighbors=3) 

X = heart_data_3_train.loc[:, ["age", "chol"]]
y = heart_data_3_train["fbs"]

knn_fit = make_pipeline(heart_data_3_preprocessor, knn).fit(X, y)

knn_fit

Now to predict the labels in the test set.

In [101]:
heart_data_3_test_predictions = heart_data_3_test.assign(
    predicted = knn_fit.predict(heart_data_3_test.loc[:, ["age", "chol"]])
)
heart_data_3_test_predictions[['fbs', 'predicted']]

Unnamed: 0,fbs,predicted
329,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
286,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
560,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
964,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
759,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
...,...,...
481,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
722,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl
612,Fasting Blood Sugar > 120 mg/dl,Fasting Blood Sugar > 120 mg/dl
263,Fasting Blood Sugar < 120 mg/dl,Fasting Blood Sugar < 120 mg/dl


In [96]:
from sklearn.model_selection import cross_validate

knn = KNeighborsClassifier(n_neighbors=3) 
heart_data_3_pipe = make_pipeline(heart_data_3_preprocessor, knn)
X = heart_data_3_train.loc[:, ["age", "chol"]]
y = heart_data_3_train["fbs"]
cv_5_df = pd.DataFrame(
    cross_validate(
        estimator=heart_data_3_pipe,
        cv=10,
        X=X,
        y=y
    )
)

cv_5_df

Unnamed: 0,fit_time,score_time,test_score
0,0.007565,0.007608,0.875
1,0.01059,0.006804,0.916667
2,0.006803,0.007399,0.916667
3,0.007291,0.006236,0.875
4,0.006719,0.019594,0.913043
5,0.006207,0.006103,0.956522
6,0.006669,0.005947,0.956522
7,0.008796,0.005597,0.956522
8,0.006052,0.005256,1.0
9,0.006116,0.005357,0.913043


Let's try computing the accuracy!

In [102]:
correct_preds = heart_data_3_test_predictions[
    heart_data_3_test_predictions['fbs'] == heart_data_3_test_predictions['predicted']
]

correct_preds.shape[0] / heart_data_3_test_predictions.shape[0]

0.9743589743589743

Parameter value selection

In [104]:
knn = KNeighborsClassifier()
heart_data_3_tune_pipe = make_pipeline(heart_data_3_preprocessor, knn)
heart_data_3_tune_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                    ['age', 'chol'])])),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                  ['age', 'chol'])]),
 'kneighborsclassifier': KNeighborsClassifier(),
 'columntransformer__n_jobs': None,
 'columntransformer__remainder': 'drop',
 'columntransformer__sparse_threshold': 0.3,
 'columntransformer__transformer_weights': None,
 'columntransformer__transformers': [('standardscaler',
   StandardScaler(),
   ['age', 'chol'])],
 'columntransformer__verbose': False,
 'columntransformer__verbose_feature_names_out': True,
 'columntransformer__standardscaler': StandardScaler(),
 'columntransformer__standardscaler__copy': True,
 'columntransformer__standardscaler__with_mean': True,
 'columntransformer__standardsc

In [108]:
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 5),
}
from sklearn.model_selection import GridSearchCV

heart_data_3_tune_grid = GridSearchCV(
    estimator=heart_data_3_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)
accuracies_grid = pd.DataFrame(
             heart_data_3_tune_grid
             .fit(heart_data_3_train.loc[:, ["age", "chol"]],
                  heart_data_3_train["fbs"]
            ).cv_results_)

accuracies_grid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 19 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   mean_fit_time                            20 non-null     float64
 1   std_fit_time                             20 non-null     float64
 2   mean_score_time                          20 non-null     float64
 3   std_score_time                           20 non-null     float64
 4   param_kneighborsclassifier__n_neighbors  20 non-null     object 
 5   params                                   20 non-null     object 
 6   split0_test_score                        20 non-null     float64
 7   split1_test_score                        20 non-null     float64
 8   split2_test_score                        20 non-null     float64
 9   split3_test_score                        20 non-null     float64
 10  split4_test_score                        20 non-null

In [109]:
accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", "mean_test_score", "std_test_score"]
              ].assign(
                  sem_test_score = accuracies_grid["std_test_score"] / 10**(1/2)
              ).rename(
                  columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"}
              ).drop(
                  columns = ["std_test_score"]
              )
accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.995652,0.004125
1,6,0.872645,0.021924
2,11,0.838043,0.018311
3,16,0.842029,0.015745
4,21,0.825,0.010773
5,26,0.837681,0.014253
6,31,0.837862,0.009699
7,36,0.850725,0.005993
8,41,0.850725,0.005993
9,46,0.850725,0.005993


Now we can plot the values of K versus the accuracy to pick the best value.

In [115]:
accuracy_vs_k = (
    alt.Chart(accuracies_grid)
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "n_neighbors",
            title="Neighbors",
            scale=alt.Scale(zero=False)
        ),
        y=alt.Y(
            "mean_test_score",
            title="Accuracy estimate",
            scale=alt.Scale(zero=False),
        ),
    )
)

accuracy_vs_k