# Question1: Predicting Heart Disease
## Fabio Carrasco

#### Read the data file “Hearts_s.csv” (from github using the following command), and assign it to a Pandas DataFrame:

In [227]:

import pandas as pd

In [228]:
heart_df = pd.read_csv("https://github.com/mpourhoma/CS4661/raw/master/Heart_s.csv")

#### Check out the dataset. As you see, the dataset contains a number of features including both contextual and biological factors (e.g. age, gender, vital signs, …). The last column “AHD” is the label with “Yes” meaning that a human subject has Heart Disease, and “No” meaning that the subject does not have Heart Disease.

In [229]:
heart_df[0::10]

Unnamed: 0,Age,Gender,ChestPain,RestBP,Chol,RestECG,MaxHR,Oldpeak,Thal,AHD
0,63,f,typical,145,233,2,150,2.3,fixed,No
10,57,f,asymptomatic,140,192,0,148,0.4,fixed,No
20,64,f,typical,110,211,2,144,1.8,normal,No
30,69,m,typical,140,239,0,151,1.8,normal,No
40,65,m,asymptomatic,150,225,2,114,1.0,reversable,Yes
50,41,m,nontypical,105,198,0,168,0.0,normal,No
60,51,m,asymptomatic,130,305,0,142,1.2,reversable,Yes
70,65,m,nonanginal,155,269,0,148,0.8,normal,No
80,45,f,asymptomatic,104,208,2,148,3.0,normal,No
90,62,m,asymptomatic,160,164,2,145,6.2,reversable,Yes


#### As you see, there are at least 3 categorical features in the dataset (Gender, ChestPain, Thal). Let’s ignore these categorical features for now, only keep the numerical features and build your feature matrix and label vector.

In [230]:
feature_cols = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak']

X = heart_df[feature_cols]

y = heart_df['AHD']

X.head()

Unnamed: 0,Age,RestBP,Chol,RestECG,MaxHR,Oldpeak
0,63,145,233,2,150,2.3
1,67,160,286,2,108,1.5
2,67,120,229,2,129,2.6
3,37,130,250,0,187,3.5
4,41,130,204,2,172,1.4


#### Split the dataset into testing and training sets with the following parameters: test_size=0.25, random_state=6.

In [231]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=6)

#### Use KNN (with k=3), Decision Tree (with random_state=5 (this random state is for decision tree and you put it when you define the decision tree classifier. It is different from the random state that you used to split the data in part D)), and Logistic Regression Classifiers to predict Heart Disease based on the training/testing datasets that you built in part (d). Then check, compare, and report the accuracy of these 3 classifiers. Which one is the best? Which one is the worst?

In [232]:
# The following line will import LogisticRegression and DecisionTreeClassifier Classes
from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

In [233]:
#find knn prediction
k = 3
knn = KNeighborsClassifier(n_neighbors=k) 
knn.fit(X_train,y_train)
y_predict_knn = knn.predict(X_test)

#find logreg prediction
my_logreg = LogisticRegression()
my_logreg.fit(X_train, y_train)
y_predict_lr = my_logreg.predict(X_test)

#find decisiontree prediction
my_decisiontree = DecisionTreeClassifier(random_state=5)
my_decisiontree.fit(X_train, y_train)
y_predict_dt = my_decisiontree.predict(X_test)

In [234]:
from sklearn.metrics import accuracy_score
#Find accuracy 
score_lr = accuracy_score(y_test, y_predict_lr)
score_dt = accuracy_score(y_test, y_predict_dt)
score_knn = accuracy_score(y_test, y_predict_knn)

print(score_lr)
print(score_dt)
print(score_knn)

0.6710526315789473
0.618421052631579
0.6447368421052632


#### The logreg prediction has the best accuracy.

#### Now, we want to use the categorical features as well! To this end, we have to perform a feature engineering process called OneHotEncoding for the categorical features. To do this, each categorical feature should be replaced with dummy columns in the feature table (one column for each possible value of a categorical feature), and then encode it in a binary manner such that only one of the dummy columns can take “1” at a time (and zero for the rest). For example, “Gender” can take two values “m” and “f”. Thus, we need to replace this feature (in the feature table) by 2 columns titled “m” and “f”.  Wherever we have a male subject, we can put “1” and ”0” in the columns “m” and “f”.  Wherever we have a female subject, we can put “0” and ”1” in the columns “m” and “f”. (Hint: you will need 4 columns to encode “ChestPain” and 3 columns to encode “Thal”).

In [235]:
ohe_heart_df = pd.get_dummies(data=heart_df, columns=['Gender', 'ChestPain','Thal'])
ohe_X = ohe_heart_df.loc[:, ohe_heart_df.columns != 'AHD']
print(ohe_X[::10])

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  Gender_f  Gender_m  \
0     63     145   233        2    150      2.3         1         0   
10    57     140   192        0    148      0.4         1         0   
20    64     110   211        2    144      1.8         1         0   
30    69     140   239        0    151      1.8         0         1   
40    65     150   225        2    114      1.0         0         1   
50    41     105   198        0    168      0.0         0         1   
60    51     130   305        0    142      1.2         0         1   
70    65     155   269        0    148      0.8         0         1   
80    45     104   208        2    148      3.0         1         0   
90    62     160   164        2    145      6.2         0         1   
100   34     118   182        2    174      0.0         1         0   
110   56     125   249        2    144      1.2         1         0   
120   63     150   407        2    154      4.0         0         1   
130   

#### Repeat parts (d) and (e) with the new dataset that you built in part (f). How does the prediction accuracy change for each method?

In [236]:
X_train, X_test, y_train, y_test = train_test_split(ohe_X, y, test_size=0.25, random_state=6)

In [237]:
#find knn prediction
k = 3
knn = KNeighborsClassifier(n_neighbors=k) 
knn.fit(X_train,y_train)
y_predict_knn = knn.predict(X_test)

#find logreg prediction
my_logreg = LogisticRegression()
my_logreg.fit(X_train, y_train)
y_predict_lr = my_logreg.predict(X_test)

#find decisiontree prediction
my_decisiontree = DecisionTreeClassifier(random_state=5)
my_decisiontree.fit(X_train, y_train)
y_predict_dt = my_decisiontree.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [238]:
#Find accuracy 
score_lr = accuracy_score(y_test, y_predict_lr)
score_dt = accuracy_score(y_test, y_predict_dt)
score_knn = accuracy_score(y_test, y_predict_knn)

print(score_lr)
print(score_dt)
print(score_knn)

0.7763157894736842
0.7368421052631579
0.6447368421052632


#### Accuracy for LogReg and DataTree has increased but KNN has stayed the same.

#### Now, repeat part (e) with the new dataset that you built in part (f), but this time using Cross-Validation. Thus, rather than splitting the dataset into testing and training, use 10-fold Cross-Validation (as we learned in Lab4) to evaluate the classification methods and report the final prediction accuracy. 

In [239]:
# importing the method:
from sklearn.model_selection import cross_val_score

In [240]:

accuracy_list_knn = cross_val_score(knn, ohe_X, y, cv=10, scoring='accuracy')
accuracy_list_dt = cross_val_score(my_decisiontree, ohe_X, y, cv=10, scoring='accuracy')
accuracy_list_lr = cross_val_score(my_logreg, ohe_X, y, cv=10, scoring='accuracy')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [241]:
# use average of accuracy values as final result
accuracy_cv_dt = accuracy_list_dt.mean()
accuracy_cv_knn = accuracy_list_knn.mean()
accuracy_cv_lr = accuracy_list_lr.mean()

print(accuracy_cv_dt)
print(accuracy_cv_knn)
print(accuracy_cv_lr)

0.750752688172043
0.6343010752688172
0.810752688172043


#### Logisitc Regression has the best value through cross validation.