## Dataset  

There are 14 columns in the dataset, where the patient_id column is a unique and random identifier. The remaining 13 features are described in the section below.  

slope_of_peak_exercise_st_segment (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart   

thal (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values normal, fixed_defect, reversible_defect   

resting_blood_pressure (type: int): resting blood pressure  

chest_pain_type (type: int): chest pain type (4 values)  

num_major_vessels (type: int): number of major vessels (0-3) colored by flourosopy  

fasting_blood_sugar_gt_120_mg_per_dl (type: binary): fasting blood sugar > 120 mg/dl  

resting_ekg_results (type: int): resting electrocardiographic results (values 0,1,2)  

serum_cholesterol_mg_per_dl (type: int): serum cholestoral in mg/dl  

oldpeak_eq_st_depression (type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms  

sex (type: binary): 0: female, 1: male  

age (type: int): age in years  

max_heart_rate_achieved (type: int): maximum heart rate achieved (beats per minute)  

exercise_induced_angina (type: binary): exercise-induced chest pain (0: False, 1: True)  

In [2]:
# import 
# algebra tools
import math
import numpy as np 

# processing tools
import pandas as pd 
import re
from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# visualization tools
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# algorithms
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import SGDRegressor

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR,SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier

In [4]:
# upload train and test files
test_values = pd.read_csv("test_values.csv")
train_values = pd.read_csv("train_values.csv")
train_labels = pd.read_csv("train_labels.csv")

In [5]:
train_values.head()

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina
0,0z64un,1,normal,128,2,0,0,2,308,0.0,1,45,170,0
1,ryoo3j,2,normal,110,3,0,0,0,214,1.6,0,54,158,0
2,yt1s1x,1,normal,125,4,3,0,2,304,0.0,1,77,162,1
3,l2xjde,1,reversible_defect,152,4,0,0,0,223,0.0,1,40,181,0
4,oyt4ek,3,reversible_defect,178,1,0,0,2,270,4.2,1,59,145,0


In [6]:
train_labels.head()

Unnamed: 0,patient_id,heart_disease_present
0,0z64un,0
1,ryoo3j,0
2,yt1s1x,1
3,l2xjde,1
4,oyt4ek,0


In [7]:
train_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 14 columns):
patient_id                              180 non-null object
slope_of_peak_exercise_st_segment       180 non-null int64
thal                                    180 non-null object
resting_blood_pressure                  180 non-null int64
chest_pain_type                         180 non-null int64
num_major_vessels                       180 non-null int64
fasting_blood_sugar_gt_120_mg_per_dl    180 non-null int64
resting_ekg_results                     180 non-null int64
serum_cholesterol_mg_per_dl             180 non-null int64
oldpeak_eq_st_depression                180 non-null float64
sex                                     180 non-null int64
age                                     180 non-null int64
max_heart_rate_achieved                 180 non-null int64
exercise_induced_angina                 180 non-null int64
dtypes: float64(1), int64(11), object(2)
memory usage: 19.8+ KB


In [8]:
test_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 14 columns):
patient_id                              90 non-null object
slope_of_peak_exercise_st_segment       90 non-null int64
thal                                    90 non-null object
resting_blood_pressure                  90 non-null int64
chest_pain_type                         90 non-null int64
num_major_vessels                       90 non-null int64
fasting_blood_sugar_gt_120_mg_per_dl    90 non-null int64
resting_ekg_results                     90 non-null int64
serum_cholesterol_mg_per_dl             90 non-null int64
oldpeak_eq_st_depression                90 non-null float64
sex                                     90 non-null int64
age                                     90 non-null int64
max_heart_rate_achieved                 90 non-null int64
exercise_induced_angina                 90 non-null int64
dtypes: float64(1), int64(11), object(2)
memory usage: 9.9+ KB


## Nice, there are no missing data.

In [9]:
#lets create some features that might be interesting
dataset = [test_values, train_values]
for variable in dataset:
    
    variable['heart_rate/blood_pressure']= (variable['max_heart_rate_achieved']/variable['resting_blood_pressure']).round(2)
    variable['cholesterol/blood_pressure']= (variable['serum_cholesterol_mg_per_dl']/variable['resting_blood_pressure']).round(2)
    variable['blood_pressure/age']= (variable['resting_blood_pressure']/variable['age']).round(2)
    variable['cholesterol/age']= (variable['serum_cholesterol_mg_per_dl']/variable['age']).round(2)
    variable['heart_rate/age']= (variable['max_heart_rate_achieved']/variable['age']).round(2)
   
    variable['log10(heart_rate)']= np.log10(variable['max_heart_rate_achieved']).round(2)
    variable['log10(cholesterol)']= np.log10(variable['serum_cholesterol_mg_per_dl']).round(2)
    variable['log10(blood_pressure)']= np.log10(variable['resting_blood_pressure']).round(2)
    variable['log10(age)']= np.log10(variable['age']).round(2)
    
    #variable['heart_rate*blood_pressure']= (variable['max_heart_rate_achieved']*variable['resting_blood_pressure']).round(2)
    #variable['cholesterol*blood_pressure']= (variable['serum_cholesterol_mg_per_dl']*variable['resting_blood_pressure']).round(2)
    #variable['blood_pressure*age']= (variable['resting_blood_pressure']*variable['age']).round(2)
    #variable['cholesterol*age']= (variable['serum_cholesterol_mg_per_dl']*variable['age']).round(2)
    #variable['heart_rate*age']= (variable['max_heart_rate_achieved']*variable['age']).round(2)
    
    #variable['heart_rate*heart_rate']= (variable['max_heart_rate_achieved']*variable['max_heart_rate_achieved']).round(2)
    #variable['cholesterol*cholesterol']= (variable['serum_cholesterol_mg_per_dl']*variable['serum_cholesterol_mg_per_dl']).round(2)
    #variable['blood_pressure*blood_pressure']= (variable['resting_blood_pressure']*variable['resting_blood_pressure']).round(2)
    #variable['age*age']= (variable['age']*variable['age']).round(2)

In [10]:
train_values['thal'].value_counts()

normal               98
reversible_defect    74
fixed_defect          8
Name: thal, dtype: int64

In [11]:
train_values['chest_pain_type'].value_counts()

4    82
3    57
2    28
1    13
Name: chest_pain_type, dtype: int64

In [10]:
train_values['resting_ekg_results'].value_counts()

2    94
0    85
1     1
Name: resting_ekg_results, dtype: int64

In [12]:
train_values= pd.get_dummies(train_values, columns= ['thal','chest_pain_type','resting_ekg_results'], drop_first=False)
test_values= pd.get_dummies(test_values, columns= ['thal','chest_pain_type','resting_ekg_results'], drop_first=False)

In [13]:
train_values.head()

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,resting_blood_pressure,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,...,thal_fixed_defect,thal_normal,thal_reversible_defect,chest_pain_type_1,chest_pain_type_2,chest_pain_type_3,chest_pain_type_4,resting_ekg_results_0,resting_ekg_results_1,resting_ekg_results_2
0,0z64un,1,128,0,0,308,0.0,1,45,170,...,0,1,0,0,1,0,0,0,0,1
1,ryoo3j,2,110,0,0,214,1.6,0,54,158,...,0,1,0,0,0,1,0,1,0,0
2,yt1s1x,1,125,3,0,304,0.0,1,77,162,...,0,1,0,0,0,0,1,0,0,1
3,l2xjde,1,152,0,0,223,0.0,1,40,181,...,0,0,1,0,0,0,1,1,0,0
4,oyt4ek,3,178,0,0,270,4.2,1,59,145,...,0,0,1,1,0,0,0,0,0,1


In [14]:
#standardize features by removing the mean and scaling to unit variance
ss = StandardScaler()
features= train_values.columns.drop('patient_id').tolist()
ss.fit(train_values[features].values)

standardized_train_values = train_values.copy()
standardized_test_values = test_values.copy()

standardized_train_values[features] = ss.transform(train_values[features])
standardized_test_values[features] = ss.transform(test_values[features])

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


In [15]:
standardized_train_values.head()

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,resting_blood_pressure,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,...,thal_fixed_defect,thal_normal,thal_reversible_defect,chest_pain_type_1,chest_pain_type_2,chest_pain_type_3,chest_pain_type_4,resting_ekg_results_0,resting_ekg_results_1,resting_ekg_results_2
0,0z64un,-0.891241,-0.195195,-0.718403,-0.438238,1.118269,-0.903207,0.672022,-1.053964,0.932485,...,-0.215666,0.914732,-0.835532,-0.279006,2.329929,-0.680746,-0.914732,-0.945905,-0.074744,0.956501
1,ryoo3j,0.729197,-1.25632,-0.718403,-0.438238,-0.669778,0.527616,-1.488048,-0.087134,0.387084,...,-0.215666,0.914732,-0.835532,-0.279006,-0.429198,1.468977,-0.914732,1.057188,-0.074744,-1.045478
2,yt1s1x,-0.891241,-0.372049,2.385097,-0.438238,1.042182,-0.903207,0.672022,2.383654,0.568884,...,-0.215666,0.914732,-0.835532,-0.279006,-0.429198,-0.680746,1.093216,-0.945905,-0.074744,0.956501
3,l2xjde,-0.891241,1.219639,-0.718403,-0.438238,-0.498582,-0.903207,0.672022,-1.591092,1.432436,...,-0.215666,-1.093216,1.196843,-0.279006,-0.429198,-0.680746,1.093216,1.057188,-0.074744,-1.045478
4,oyt4ek,2.349636,2.752375,-0.718403,-0.438238,0.395442,2.852703,0.672022,0.449994,-0.203768,...,-0.215666,-1.093216,1.196843,3.584153,-0.429198,-0.680746,-0.914732,-0.945905,-0.074744,0.956501


## -
## basic models
## -


In [16]:
#define some shorthands for our models' inputs'
x_train = standardized_train_values.drop("patient_id", axis=1)
y_train = train_labels['heart_disease_present']
x_test  = standardized_test_values.drop("patient_id", axis=1)

In [17]:
sgdc = SGDClassifier(tol= 1e-3, random_state=23)
sgdc.fit(x_train,y_train)

sgdc_prediction = sgdc.predict(x_test)

sgdc_score = cross_val_score(sgdc, x_train, y_train, cv=5)

print((sgdc_score.mean()*100).round(2))
print((sgdc_score.std()*100).round(2))

73.33
5.15


In [18]:
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)

knn_prediction = knn.predict(x_test)

knn_score = cross_val_score(knn, x_train, y_train, cv=5)

print((knn_score.mean()*100).round(2))
print((knn_score.std()*100).round(2))

78.33
7.74


In [19]:
gnb = GaussianNB()
gnb.fit(x_train, y_train)

gnb_prediction = gnb.predict(x_test)

gnb_score = cross_val_score(gnb, x_train, y_train, cv=5)

print((gnb_score.mean()*100).round(2))
print((gnb_score.std()*100).round(2))

71.67
6.67


In [20]:
mnb = MultinomialNB()
mnb.fit(x_train, y_train)

mnb_prediction = mnb.predict(x_test)

mnb_score = cross_val_score(mnb, x_train, y_train, cv=5)

print((mnb_score.mean()*100).round(2))
print((mnb_score.std()*100).round(2))

ValueError: Input X must be non-negative

In [21]:
perceptron = Perceptron(max_iter=100000, tol= 1e-3, random_state=23)
perceptron.fit(x_train, y_train)

perceptron_prediction = perceptron.predict(x_test)

perceptron_score = cross_val_score(perceptron, x_train, y_train, cv=5)

print((perceptron_score.mean()*100).round(2))
print((perceptron_score.std()*100).round(2))

72.78
5.67


In [22]:
dtc = DecisionTreeClassifier(random_state=23)
dtc.fit(x_train, y_train)

dtc_prediction = dtc.predict(x_test)

dtc_score = cross_val_score(dtc, x_train, y_train, cv=5)

print((dtc_score.mean()*100).round(2))
print((dtc_score.std()*100).round(2))

69.44
10.69


In [23]:
rfc = RandomForestClassifier(n_estimators=1000, random_state=23)
rfc.fit(x_train, y_train)

rfc_prediction = rfc.predict(x_test)

rfc_score = cross_val_score(rfc, x_train, y_train, cv=5)

print((rfc_score.mean()*100).round(2))
print((rfc_score.std()*100).round(2))

79.44
6.71


In [24]:
lsvc = LinearSVC(max_iter=100000)
lsvc.fit(x_train, y_train)

lsvc_prediction = lsvc.predict(x_test)

lsvc_score = cross_val_score(lsvc, x_train, y_train, cv=5)

print((lsvc_score.mean()*100).round(2))
print((lsvc_score.std()*100).round(2))

77.78
6.8


In [25]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(x_train, y_train)

lr_prediction = lr.predict(x_test)

lr_score = cross_val_score(lr, x_train, y_train, cv=5)

print((lr_score.mean()*100).round(2))
print((lr_score.std()*100).round(2))

81.11
6.43


In [26]:
linear = LinearRegression(normalize=True)
linear.fit(x_train, y_train)

linear_prediction = linear.predict(x_test)

linear_score = cross_val_score(linear, x_train, y_train, cv=5)

print((linear_score.mean()*100).round(2))
print((linear_score.std()*100).round(2))

34.72
16.57


In [27]:
br = BayesianRidge()
br.fit(x_train, y_train)

br_prediction = br.predict(x_test)

br_score = cross_val_score(br, x_train, y_train, cv=5)

print((br_score.mean()*100).round(2))
print((br_score.std()*100).round(2))

42.99
14.37


In [28]:
sgdr = SGDRegressor(tol= 1e-3, random_state=23)
sgdr.fit(x_train, y_train)

sgdr_prediction = sgdr.predict(x_test)

sgdr_score = cross_val_score(sgdr, x_train, y_train, cv=5)

print((sgdr_score.mean()*100).round(2))
print((sgdr_score.std()*100).round(2))

39.62
19.7


In [29]:
svr = SVR(gamma='scale', tol= 1e-3)
svr.fit(x_train, y_train)

svr_prediction = svr.predict(x_test)

svr_score = cross_val_score(svr, x_train, y_train, cv=5)

print((svr_score.mean()*100).round(2))
print((svr_score.std()*100).round(2))

29.87
14.1


In [30]:
svc = SVC(gamma='scale', tol= 1e-3)
svc.fit(x_train, y_train)

svc_prediction = svc.predict(x_test)

svc_score = cross_val_score(svc, x_train, y_train, cv=5)

print((svc_score.mean()*100).round(2))
print((svc_score.std()*100).round(2))

77.22
10.0


In [31]:
rfr = RandomForestRegressor(n_estimators=1000, random_state=23)
rfr.fit(x_train, y_train)

rfr_prediction = rfr.predict(x_test)

rfr_score = cross_val_score(rfr, x_train, y_train, cv=5)

print((rfr_score.mean()*100).round(2))
print((rfr_score.std()*100).round(2))

35.18
13.24


## Hyperparameter Tuning

In [32]:
lr = LogisticRegression(random_state=23, n_jobs=-1)


penalty = ['l2', 'l1', 'elasticnet', 'none']

dual= [True, False]

solver=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

C = [0.001, 0.01, 0.1, 1, 5, 10, 100]

fit_intercept = [True, False]

intercept_scaling= np.arange(0.05 , 2.05, 0.05)

max_iter = [1000,10000]
    
multi_class= ['ovr', 'multinomial', 'auto']

lr_param_grid= dict(penalty=penalty, dual=dual, solver=solver, C=C, fit_intercept=fit_intercept, 
                    intercept_scaling=intercept_scaling, max_iter=max_iter, multi_class=multi_class)

lr_grid= RandomizedSearchCV(lr, lr_param_grid, cv=5, scoring='accuracy',
                            n_iter=400, n_jobs=-1, random_state=23, error_score=0, iid=False)

lr_grid.fit(x_train, y_train)

print("Best estimator that was chosen by the search:")
print(lr_grid.best_estimator_)
f'Mean cross-validated score of the best estimator is: {(lr_grid.best_score_*100).round(2)}'

Best estimator that was chosen by the search:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=0.6500000000000001, max_iter=1000,
          multi_class='ovr', n_jobs=-1, penalty='l2', random_state=23,
          solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)


'Mean cross-validated score of the best estimator is: 82.22'

In [33]:
lr2 = lr_grid.best_estimator_
lr2.fit(x_train, y_train)

lr2_prediction = lr2.predict_proba(x_test)

lr2_score = cross_val_score(lr2, x_train, y_train, cv=5)

print((lr2_score.mean()*100).round(2))
print((lr2_score.std()*100).round(2))

82.22
7.58


In [34]:
lr2_prediction = pd.DataFrame(lr2_prediction)

In [35]:
cleveland_heart_disease_lr_prediction = test_values.patient_id.to_frame('patient_id')
cleveland_heart_disease_lr_prediction['heart_disease_present'] = lr2_prediction[1]
print(cleveland_heart_disease_lr_prediction.info())
cleveland_heart_disease_lr_prediction.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 2 columns):
patient_id               90 non-null object
heart_disease_present    90 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
None


Unnamed: 0,patient_id,heart_disease_present
0,olalu7,0.45158
1,z9n6mx,0.127255
2,5k4413,0.936419
3,mrg7q5,0.067847
4,uki4do,0.891971


In [36]:
cleveland_heart_disease_lr_prediction.to_csv('cleveland_heart_disease_lr2_prediction.csv', index = False) #DD submission loss of 0.36441 Rank:530/3654

In [37]:
svc = SVC(probability=True, max_iter=10000, random_state=23)


C= [0.00001, 0.0001, 0.001, 0.01] + np.arange(0.1, 1.05, 0.05).tolist() + np.arange(1, 11 , 1).tolist()

kernel= ['linear', 'poly', 'rbf', 'sigmoid']

degree= np.arange(0.1, 1.00, 0.05).tolist() + np.arange(1, 10 , 1).tolist() + np.arange(10, 110 , 10).tolist()

gamma= ['auto_deprecated', 'auto', 'scale']

decision_function_shape = ['ovr', 'ovo']

coef0= [0.00001, 0.0001, 0.001, 0.01] + np.arange(0.1, 1.05, 0.05).tolist() + np.arange(1, 11 , 1).tolist()

shrinking = [True, False]

svc_param_grid= dict(kernel=kernel, degree=degree, C=C, coef0=coef0, gamma=gamma,
                      shrinking = shrinking, decision_function_shape=decision_function_shape) 

svc_grid= RandomizedSearchCV(svc, svc_param_grid, cv=5, scoring='accuracy',
                              n_iter=2000, n_jobs=-1, random_state=23, error_score=0, iid=False)

svc_grid.fit(x_train, y_train)


print("Best estimator that was chosen by the search:")
print(svc_grid.best_estimator_)
f'Mean cross-validated score of the best estimator is: {(svc_grid.best_score_*100).round(2)}'

Best estimator that was chosen by the search:
SVC(C=0.20000000000000004, cache_size=200, class_weight=None, coef0=0.001,
  decision_function_shape='ovo', degree=70, gamma='auto_deprecated',
  kernel='rbf', max_iter=10000, probability=True, random_state=23,
  shrinking=False, tol=0.001, verbose=False)


'Mean cross-validated score of the best estimator is: 82.78'

In [38]:
svc2 = lr_grid.best_estimator_
svc2.fit(x_train, y_train)

svc2_prediction = svc2.predict_proba(x_test)

svc2_score = cross_val_score(svc2, x_train, y_train, cv=5)
              
print((svc2_score.mean()*100).round(2))
print((svc2_score.std()*100).round(2))

82.22
7.58


In [39]:
svc2_prediction = pd.DataFrame(svc2_prediction)

In [40]:
cleveland_heart_disease_svc2_prediction = test_values.patient_id.to_frame('patient_id')
cleveland_heart_disease_svc2_prediction['heart_disease_present'] = svc2_prediction[1].round(6)
print(cleveland_heart_disease_svc2_prediction.info())
cleveland_heart_disease_svc2_prediction.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 2 columns):
patient_id               90 non-null object
heart_disease_present    90 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
None


Unnamed: 0,patient_id,heart_disease_present
0,olalu7,0.45158
1,z9n6mx,0.127255
2,5k4413,0.936419
3,mrg7q5,0.067847
4,uki4do,0.891971


In [41]:
#cleveland_heart_disease_svc2_prediction.to_csv('cleveland_heart_disease_svc2_prediction.csv', index = False) #DD submission log loss = 0.36441

In [42]:
gnb = GaussianNB()

var_smoothing=[ 0, 0.00000001, 0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01] + np.arange(0.1, 1.05, 0.05).tolist() + np.arange(1, 11 , 1).tolist()

gnb_param_grid= dict(var_smoothing=var_smoothing) 

gnb_grid= RandomizedSearchCV(gnb, gnb_param_grid, cv=5, scoring='accuracy',
                              n_iter=37, n_jobs=-1, random_state=23, error_score=0, iid=False)

gnb_grid.fit(x_train, y_train)


print("Best estimator that was chosen by the search:")
print(gnb_grid.best_estimator_)
f'Mean cross-validated score of the best estimator is: {(gnb_grid.best_score_*100).round(2)}'

Best estimator that was chosen by the search:
GaussianNB(priors=None, var_smoothing=0.01)


'Mean cross-validated score of the best estimator is: 80.56'

In [43]:
gnb2 = gnb_grid.best_estimator_
gnb2.fit(x_train, y_train)

gnb2_prediction = gnb2.predict_proba(x_test)

gnb2_score = cross_val_score(gnb, x_train, y_train, cv=5)

print((gnb2_score.mean()*100).round(2))
print((gnb2_score.std()*100).round(2))

71.67
6.67


In [44]:
gnb2_prediction = pd.DataFrame(gnb2_prediction.round(6))
gnb2_prediction.head()

Unnamed: 0,0,1
0,0.954546,0.045454
1,1.0,0.0
2,3.6e-05,0.999964
3,1.0,0.0
4,9.5e-05,0.999905


In [45]:
cleveland_heart_disease_gnb2_prediction = test_values.patient_id.to_frame('patient_id')
cleveland_heart_disease_gnb2_prediction['heart_disease_present'] = gnb2_prediction[1].round(6)
print(cleveland_heart_disease_gnb2_prediction.info())
cleveland_heart_disease_gnb2_prediction.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 2 columns):
patient_id               90 non-null object
heart_disease_present    90 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
None


Unnamed: 0,patient_id,heart_disease_present
0,olalu7,0.045454
1,z9n6mx,0.0
2,5k4413,0.999964
3,mrg7q5,0.0
4,uki4do,0.999905


In [46]:
#cleveland_heart_disease_gnb2_prediction.to_csv('cleveland_heart_disease_gnb2_prediction.csv', index = False) #DD submission log loss = 0.52932

In [47]:
rfc = RandomForestClassifier(random_state=23, n_jobs=-1)


criterion = ['gini', 'entropy'] #The function to measure the quality of a split

n_estimators = [1, 10, 100, 1000, 10000] #The number of trees in the forest

min_samples_split = [2,4,8] #The minimum number of samples required to split an internal node

min_samples_leaf = [1,2,4] #The minimum number of samples required to be at a leaf node

max_features = ['auto', 'sqrt', 'log2'] #The number of features to consider when looking for the best split

rfc_param_grid= dict(criterion=criterion, n_estimators=n_estimators, 
                     min_samples_leaf=min_samples_leaf, min_samples_split=min_samples_split, max_features=max_features)

rfc_grid= RandomizedSearchCV(rfc, rfc_param_grid, cv=5, scoring='accuracy',
                             n_jobs=-1, n_iter=270, random_state=23, error_score=0, iid=False)

rfc_grid.fit(x_train, y_train)

print("Best estimator that was chosen by the search:")
print(rfc_grid.best_estimator_)
f'Mean cross-validated score of the best estimator is: {(rfc_grid.best_score_*100).round(2)}'

Best estimator that was chosen by the search:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=8,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
            oob_score=False, random_state=23, verbose=0, warm_start=False)


'Mean cross-validated score of the best estimator is: 81.11'

In [48]:
rfc2 = rfc_grid.best_estimator_
rfc2.fit(x_train, y_train)

rfc2_prediction = rfc2.predict_proba(x_test)

rfc2_score = cross_val_score(rfc2, x_train, y_train, cv=5)

print((rfc2_score.mean()*100).round(2))
print((rfc2_score.std()*100).round(2))

81.11
7.54


In [49]:
rfc_features= pd.DataFrame()
rfc_features['feature']= x_train.columns
rfc_features['feature_importances_']= rfc2.feature_importances_*100
rfc_features.sort_values(by='feature_importances_', ascending=False).round(2)
#zip(x_train.columns, rfc2.feature_importances_)

Unnamed: 0,feature,feature_importances_
21,thal_reversible_defect,9.71
20,thal_normal,9.44
25,chest_pain_type_4,8.47
5,oldpeak_eq_st_depression,7.01
2,num_major_vessels,6.09
9,exercise_induced_angina,5.08
14,heart_rate/age,4.86
10,heart_rate/blood_pressure,4.83
8,max_heart_rate_achieved,4.56
15,log10(heart_rate),3.68


In [50]:
rfc2_prediction = pd.DataFrame(rfc2_prediction.round(6))
rfc2_prediction.head()

Unnamed: 0,0,1
0,0.563488,0.436512
1,0.807111,0.192889
2,0.105916,0.894084
3,0.624649,0.375351
4,0.243495,0.756505


In [51]:
cleveland_heart_disease_rfc2_prediction = test_values.patient_id.to_frame('patient_id')
cleveland_heart_disease_rfc2_prediction['heart_disease_present'] = rfc2_prediction[1].round(6)
print(cleveland_heart_disease_rfc2_prediction.info())
cleveland_heart_disease_rfc2_prediction.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 2 columns):
patient_id               90 non-null object
heart_disease_present    90 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
None


Unnamed: 0,patient_id,heart_disease_present
0,olalu7,0.436512
1,z9n6mx,0.192889
2,5k4413,0.894084
3,mrg7q5,0.375351
4,uki4do,0.756505


In [52]:
#cleveland_heart_disease_rfc2_prediction.to_csv('cleveland_heart_disease_rfc3_prediction.csv', index = False) #DD submission log loss = 0.39718

In [53]:
vc= VotingClassifier(estimators=[('gnb2', gnb2), ('rfc2', rfc2), ('lr2', lr2), ('svc2', svc2)],
                     voting='soft',
                     weights = [1,1,1,1],
                     n_jobs=-1)

vc.fit(x_train, y_train)

vc_prediction = vc.predict_proba(x_test)

vc_score = cross_val_score(vc, x_train, y_train, cv=5)

print((vc_score.mean()*100).round(2))
print((vc_score.std()*100).round(2))

81.11
7.11


In [54]:
vc_prediction = pd.DataFrame(vc_prediction.round(6))
vc_prediction.head()

Unnamed: 0,0,1
0,0.653718,0.346282
1,0.88815,0.11185
2,0.058279,0.941721
3,0.872239,0.127761
4,0.114912,0.885088


In [55]:
cleveland_heart_disease_vc_prediction = test_values.patient_id.to_frame('patient_id')
cleveland_heart_disease_vc_prediction['heart_disease_present'] = vc_prediction[1].round(6)
print(cleveland_heart_disease_vc_prediction.info())
cleveland_heart_disease_vc_prediction.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 2 columns):
patient_id               90 non-null object
heart_disease_present    90 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
None


Unnamed: 0,patient_id,heart_disease_present
0,olalu7,0.346282
1,z9n6mx,0.11185
2,5k4413,0.941721
3,mrg7q5,0.127761
4,uki4do,0.885088


In [56]:
#cleveland_heart_disease_vc_prediction.to_csv('cleveland_heart_disease_vc3_prediction.csv', index = False) #DD submission log loss = 0.34338 Rank:328/3704