### Your name:

<pre>Alikhan Bulatov</pre>




## Dataset
A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa is given [in this url](https://raw.githubusercontent.com/tofighi/MachineLearning/master/datasets/heart.csv). These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African
Medical Journal.

Below is a description of the variables:
1. **sbp**: systolic blood pressure
2. **tobacco**: cumulative tobacco (kg)
3. **ldl**: low densiity lipoprotein cholesterol
4. **adiposity**
5. **famhist**: family history of heart disease (Present, Absent)
6. **typea**: type-A behavior
7. **obesity**
8. **alcohol**: current alcohol consumption
9. **age**: age at onset
10. **chd**: coronary heart disease



In [0]:
import numpy as np
import pandas as pd

#load csv file
df = pd.read_csv('https://raw.githubusercontent.com/tofighi/MachineLearning/master/datasets/heart.csv')

## Classification

Our goal is classifying coronary heart disease which has two classes of 0 or 1 using several classifiers based on all features (1 to 9).



## EDA


* EDA (By using numerical and visual Explaratory Data Analysis answer the following questions:)
  * How much is the percentage of each class 0 and 1?
  * How many missing values do we have?
  * How many categorical variables you have in your features?


In [2]:
### Your code here
df.head(10)             # Explore the dataset
df.info()               # Check how many categorical variables are there
df.isnull().sum().sum() # Calculate the number of missing values
df['chd'].value_counts(normalize=True) * 100  # Calculate the percentage of each class



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 11 columns):
row.names    462 non-null int64
sbp          462 non-null int64
tobacco      462 non-null float64
ldl          462 non-null float64
adiposity    462 non-null float64
famhist      462 non-null object
typea        462 non-null int64
obesity      462 non-null float64
alcohol      462 non-null float64
age          462 non-null int64
chd          462 non-null int64
dtypes: float64(5), int64(5), object(1)
memory usage: 39.8+ KB


0    65.367965
1    34.632035
Name: chd, dtype: float64

<pre>Your answers to questions here </pre>
* ~65.4% of people in a given sample have a class 0, and ~34.6% have a class 1 for a coronary disease
* We have 0 missing values in the dataset
* There is one categorical variable - 'famhist'


## Preprocessing
* Encode the categorical variable
* Normalize all the other feature columns
 



In [0]:
from sklearn.preprocessing import StandardScaler

# convert categorical columns and drop the row names column

df_converted = pd.get_dummies(df,prefix=['famhist'],drop_first=True).drop('row.names', axis=1)

# normalzie
scaler = StandardScaler()
scaler.fit(df_converted)

# split data into X (features) and y (target)
X = df_converted.iloc[:, df_converted.columns != 'chd']
y = df_converted.iloc[:, df_converted.columns == 'chd']

# Grid Search

- Make sure you build a full data pipeline as you have learned in the course by using Pipeline() in scikit learn api
- Set the random seed to 123 (For splitting or any other random algorithm)
- Split data into training (80%) and testing (20%)
- Use F1 as the metric for comparing classifiers
- Use these classifiers.
    - Logistic Regression
        - All default parameters
    - Random Forest
        - tune only: n_estimators: {4, 5, 10, 20, 50} 
    - KNN Classfier 
        - tune only: n_neighbors: all odd numbers between 3 and 33
    - SVM
        - [A practical guide to SVM Classification](http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf) particularly in page 5 gives you rule of thumb for tuning hyperparameters:

        We recommend a "grid-search" on 𝐶 and 𝛾 using cross-validation. Various pairs of (𝐶,𝛾) values are tried and the one with the best cross-validation accuracy is picked. We found that trying exponentially growing sequences of 𝐶 and 𝛾 is a practical method to identify good parameters. For example use the following values for GridSearchCV:

$$
(\left.C=2^{-5}, 2^{-3}, \ldots, 2^{15} ; \gamma=2^{-15}, 2^{-13}, \ldots, 2^{3}
\right)
$$

- Cross-validation with 5-folds
- Other hypter paramenters -> Use default


### Which classifier with which hyperparameters performs better in the cross validation?



In [4]:
# to make this notebook's output stable across runs (reproducible results)
np.random.seed(123)
#Your code here
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)

# Setup the pipeline steps
steps_svm = [('scaler', StandardScaler()), ('SVM', SVC())]
steps_reg = [('scaler', StandardScaler()), ('logreg', LogisticRegression())]
steps_rf = [('scaler', StandardScaler()), ('RF',RandomForestClassifier())]
steps_knn = [('scaler', StandardScaler()), ('KNN',KNeighborsClassifier())]
  
# Create the pipelines
pipeline_svm = Pipeline(steps_svm)
pipeline_reg = Pipeline(steps_reg)
pipeline_rf = Pipeline(steps_rf)
pipeline_knn = Pipeline(steps_knn)

## SVM
# Specify the hyperparameter space for SVM
parameters_svm = {'SVM__C':[1, 10, 100,1000], 'SVM__gamma':[0.1, 0.01]}
# Instantiate the GridSearchCV object
cv_svm = GridSearchCV(pipeline_svm,parameters_svm,cv=5)
# Fit to the training set
cv_svm.fit(X_train,y_train.values.ravel())
# Predict the labels of the test set
y_pred_svm = cv_svm.predict(X_test)
# Compute and print metrics
print("SVM")
print("F1 Score: {}".format(f1_score(y_test, y_pred_svm)))
print("Tuned Model Parameters: {} \n".format(cv_svm.best_params_))

## Linear Regression
# Fit it to the data
pipeline_reg.fit(X_train,y_train.values.ravel())
y_pred_reg = pipeline_reg.predict(X_test)
# Compute and print metrics
print("Linear Regression")
print("F1 Score: {} \n".format(f1_score(y_test, y_pred_reg)))

## Random Forest
# Specify the hyperparameter space for rf
parameters_rf = {'RF__n_estimators':[4, 5, 10, 20, 50]}
# Instantiate the GridSearchCV object
cv_rf = GridSearchCV(pipeline_rf,parameters_rf,cv=5)
# Fit to the training set
cv_rf.fit(X_train,y_train.values.ravel())
# Predict the labels of the test set
y_pred_rf = cv_rf.predict(X_test)
# Compute and print metrics
print("Random Forest")
print("F1 Score: {}".format(f1_score(y_test, y_pred_rf)))
print("Tuned Model Parameters: {} \n".format(cv_rf.best_params_))


## KNN
# Specify the hyperparameter space for KNN
parameters_knn = {'KNN__n_neighbors':[3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33]}
# Instantiate the GridSearchCV object
cv_knn = GridSearchCV(pipeline_knn,parameters_knn,cv=5)
# Fit to the training set
cv_knn.fit(X_train,y_train.values.ravel())
# Predict the labels of the test set
y_pred_knn = cv_knn.predict(X_test)
# Compute and print metrics
print("KNN")
print("F1 Score: {}".format(f1_score(y_test, y_pred_knn)))
print("Tuned Model Parameters: {} \n".format(cv_knn.best_params_))


SVM
F1 Score: 0.6229508196721312
Tuned Model Parameters: {'SVM__C': 1, 'SVM__gamma': 0.1} 

Linear Regression
F1 Score: 0.65625 

Random Forest
F1 Score: 0.45614035087719296
Tuned Model Parameters: {'RF__n_estimators': 20} 

KNN
F1 Score: 0.4912280701754386
Tuned Model Parameters: {'KNN__n_neighbors': 19} 



<pre>Linear Regression model with default parameters is the best model </pre>

# Classification Reports 


Show classification report and confusion matrix **just for the best model you found** during Grid Search. 


In [5]:
#Your code here
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(classification_report(y_test, y_pred_reg))
print(confusion_matrix(y_test, y_pred_reg),)

              precision    recall  f1-score   support

           0       0.81      0.83      0.82        60
           1       0.68      0.64      0.66        33

    accuracy                           0.76        93
   macro avg       0.74      0.73      0.74        93
weighted avg       0.76      0.76      0.76        93

[[50 10]
 [12 21]]


# Conclusions

<pre>Summarize and explain your results and choices here </pre>

The best model found is the liner regression model because considering our problem which is to predict whether a person has a coronary desease or not it is much more important to predict whether a person has a disease and not whether the person does not have it. <br><br>
Looking at the confusion matrix above we can see that we have 12 false negatives and 21 true positives which is the best result among all models, some models were better at predicting true negatives but this is not what our model is supposed to do.