# **Using machine learning to understand, predict, and prevent cardiovascular disease**
    Nathan S. Robinson
    Computing Science Department
    Sam Houston State University
    nsr004@shsu.edu
    August 5th, 2018

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
$.getScript('https://ipython-contrib.github.io/jupyter_contrib_nbextensions/blob/master/src/jupyter_contrib_nbextensions/nbextensions/toc2/main.js')

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

# Introduction

## Motivation
The American Heart Association Statistics 2016 Report indicates that heart disease is the leading cause of death for both men and women, responsible for 1 in every 4 deaths. Even modest improvements in prognostic models of heart events and complications could save hundreds of lives and help to significantly reduce the cost of health care services, medications, and lost productivity.


## Methods 

There are hundreds of machine learning models, but often only a few yeild high accuracy results for a given dataset. In an effort to refine the search for a useful and accurate method with this dataset, the results of serveral algorithms will be compared. The front-runnners will be analyzed and used to develop a unique, higher-accuracy method. The end goal is to produce an approved machine learning application in healthcare; to be approved it had to pass tests to show it can produce results at least as accurately as humans are currently able to. Recently, such ML models were also used to detect with cardiologist-level accuracy 14 types of arrhythmias (sometime life-threatening heart beats) form ECG-electrocardiogram signals generated by wearable monitors. 

## Objectives

1. Understand the problem: look at each variable and do a philosophical analysis about their meaning and importance for this problem.  
1. Univariable study: focus on the dependent variable ('SalePrice') and try to know a little bit more about it.  
1. Multivariate study: understand how the dependent variable and independent variables relate.  
1. Basic cleaning: clean the dataset and handle the missing data, outliers and categorical variables  
1. Test assumptions: check if our data meets the assumptions required by most multivariate techniques.  

Establish the relative performance of deep learning models, such as deep belief networks and convolutional neural networks, and ensembles with respect to classical machine learning algorithms (including logistic regression) using cases studies built from well-known heart disease data sets such as the Cleveland set available from the UCI repository  

Research questions of interest are:  
For what would be the threshold of sample size in heart disease studies where the more complex but potentially more effective deep learning models would be recommended?  
Would ensembles of machine learning models be able to provide more robust predictions as it has been the case in other knowledge domains?  
Does the ACC/AHA list of eight risk factors should be updated with other genetic or lifestyle factors?  

The deep learning models will be implemented in Tensorflow (originally from Google, now open source) and healthcare.ai, an open source that facilitate the development of machine learning in healthcare, with the prevision that can handle so called big data by using the Hadoop/Spark platform.   

## Review Data Source

The authors of the databases have requested:  
  ...that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution.  They would be:  

   1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
   2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
   3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
   4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:  Robert Detrano, M.D., Ph.D.

#### Past Usage

1. Detrano,~R., Janosi,~A., Steinbrunn,~W., Pfisterer,~M., Schmid,~J., Sandhu,~S., Guppy,~K., Lee,~S., \& Froelicher,~V. (1989). 
    1. {\it International application of a new probability algorithm for the diagnosis of coronary artery disease.} {\it American Journal of Cardiology}, {\it 64},304-310.
    1. International Probability Analysis
    1. Address: Robert Detrano, M.D. Cardiology 111-C V.A. Medical Center 5901 E. 7th Street Long Beach, CA 90028
    1. Results in percent accuracy: (for 0.5 probability threshold) Data Name: CDF CADENZA
    1. Hungarian 77 74 Long beach 79 77 Swiss 81 81
    1. Approximately a 77% correct classification accuracy with a logistic-regression-derived discriminant function
1. David W. Aha & Dennis Kibler 
    1. Instance-based prediction of heart-disease presence with the Cleveland database 
    1. NTgrowth: 77.0% accuracy 
    1. C4: 74.8% accuracy
1. John Gennari 
    1. Gennari, J.~H., Langley, P, \& Fisher, D. (1989). Models of incremental concept formation. {\it Artificial Intelligence, 40}, 11-61. 
    1. Results: The CLASSIT conceptual clustering system achieved a 78.9% accuracy on the Cleveland database.

#### Relevant Information

1. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.
1. In particular, the Cleveland Clinic Foundation (cleveland.data) database is the only one that has been used by ML researchers to this date. 
1. The "pred_attribute" field refers to the presence of heart disease in the patient. 
1. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

#### Complete attribute documentation

1. age: age in years
2. sex: (1 = male; 0 = female) 
3. cp: chest pain type
    1. Value 1: typical angina
    1. Value 2: atypical angina
    1. Value 3: non-anginal pain
    1. Value 4: asymptomatic  
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital) 
5. chol: serum cholestoral in mg/dl  
6. fbs: (fasting blood sugar > 120 mg/dl)
    1. 1 = true
    1. 0 = false  
7. restecg: (resting electrocardiographic results)
    1. Value 0: normal    
    1. Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)    
    1. Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved  
9. exang: exercise induced angina
    1. 1 = yes
    1. 0 = no 
10. oldpeak: ST depression induced by exercise relative to rest 
11. slope: the slope of the peak exercise ST segment     
    1. Value 1: upsloping
    1. Value 2: flat    
    1. Value 3: downsloping  
12. ca: number of major vessels (0-3) colored by flourosopy  
13. thal: thalium heart scan    
    1. 3 = normal (no cold spots)
    1. 6 = fixed defect (cold spots during rest and exercise)   
    1. 7 = reversible defect (when cold spots only appear during exercise)
14. pred_attribute: (the predicted attribute) diagnosis of heart disease (angiographic disease status)    
    1. Value 0: < 50% diameter narrowing     
    1. Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)

# Data Exploration

Data exploration will be accomplished using exploratory data analysis (EDA). EDA should provide some insights for subsequent processing and modeling. This analysis will include visualization of the data and statistical examinations.

## Data and Environment Setup

### Import data processing and analysis tools
This Python 3 environment comes with many helpful analytics libraries installed  
It is defined by the [kaggle/python docker image](https://github.com/kaggle/docker-python)

In [None]:
# Import environment tools
import re
import itertools
import warnings
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import pandas as pd
import numpy as np
import keras

# Import plotly tools
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls

# Import keras tools
from keras import regularizers
from keras.callbacks import History 
from keras.layers import Dense, Input, Dropout
from keras.models import Sequential
from keras.utils import np_utils
from keras.utils.np_utils import to_categorical
from keras.wrappers.scikit_learn import KerasClassifier

# Import other tools
from __future__ import print_function
from pandas import read_excel
from IPython.display import Image
from collections import Counter
from itertools import cycle
from scipy import stats, integrate, interp
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

 ### Import and store ML models
 17 different models from 8 different categories of machine learning algorithms

In [None]:
# Import relevant machine learning models

import sklearn
# Gradient Boosters
import xgboost as xgb # Accuracy
import lightgbm as lgb # Speed

from sklearn import decomposition, preprocessing, svm
# Dimensionality Reduction
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
# Ensemble
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier, ExtraTreesClassifier
# Guassian
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
# Regression
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
# Bayesian
from sklearn.naive_bayes import GaussianNB
# Instance Based
from sklearn.neighbors import KNeighborsClassifier
# Nueral Network
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

# Import relevant machine learning analyis tools
from sklearn import metrics
#from sklearn.cross_validation import KFold, train_test_split
# Imputation
from sklearn.impute import SimpleImputer 
from sklearn.metrics import mean_absolute_error,roc_curve,accuracy_score,auc,roc_auc_score,confusion_matrix,precision_score,recall_score,f1_score, classification_report
from sklearn.metrics.cluster import fowlkes_mallows_score
from sklearn.model_selection import BaseCrossValidator, GridSearchCV, train_test_split,cross_val_score,cross_validate,cross_val_predict, KFold, StratifiedKFold, learning_curve
from sklearn.pipeline import Pipeline
# Standardization
from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize


# Initial tool settings
%matplotlib inline
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
plt.style.use('fivethirtyeight')
sns.set(style='white', context='notebook', palette='deep')
py.init_notebook_mode(connected=True)

history = History()
random_state = 43

names = ["k-Nearest Neighbors",         
         "Support Vector Machine",
         "Linear SVM",
         "RBF SVM",
         "Gaussian Process",
         "Decision Tree",
         "Extra Trees",
         "Random Forest",
         "Extra Forest",
         "AdaBoost",
         "Gaussian Naive Bayes",
         "LDA",
         "QDA",
         "Logistic Regression",
         "SGD Classifier",
         "Multilayer Perceptron",
         "Voting Classifier"
        ]

algorithms = [ KNeighborsClassifier(n_neighbors=3),
               SVC(random_state=random_state),
               SVC(kernel="linear",random_state=random_state),
               SVC(kernel="rbf",random_state=random_state),
               GaussianProcessClassifier(),
               DecisionTreeClassifier(random_state=random_state),
               ExtraTreesClassifier(random_state=random_state),
               RandomForestClassifier(random_state=random_state),
               GradientBoostingClassifier(random_state=random_state),
               AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),n_estimators=10,learning_rate=0.1,random_state=random_state),
               GaussianNB(),
               LinearDiscriminantAnalysis(),
               QuadraticDiscriminantAnalysis(),
               LogisticRegression(random_state=random_state),
               SGDClassifier(),               
               MLPClassifier(hidden_layer_sizes=(100,),momentum=0.9,solver='sgd',random_state=random_state),
               VotingClassifier(estimators=[('log', LogisticRegression()), ('SVM',SVC(C=1000)), ('MLP', MLPClassifier(hidden_layer_sizes=(100,)))], voting='hard')
              ]
#algorithms.append(SVC(random_state=random_state))

classifiers = {  "k-Nearest Neighbors" : KNeighborsClassifier(n_neighbors=3),
                 "Support Vector Machine" :  SVC(random_state=random_state),
                 "Linear SVM" :  SVC(kernel="linear",random_state=random_state),
                 "RBF SVM" :  SVC(kernel="rbf",random_state=random_state),
                 "Gaussian Process" : GaussianProcessClassifier(),
                 "Decision Tree" : DecisionTreeClassifier(random_state=random_state),
                 "Extra Trees" : ExtraTreesClassifier(random_state=random_state),
                 "Random Forest" : RandomForestClassifier(random_state=random_state),
                 "Extra Forest" : GradientBoostingClassifier(random_state=random_state),
                 "AdaBoost" : AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),n_estimators=10,random_state=random_state,learning_rate=0.1),
                 "Gaussian Naive Bayes" : GaussianNB(),
                 "LDA" : LinearDiscriminantAnalysis(),
                 "QDA" :  QuadraticDiscriminantAnalysis(),
                 "Logistic Regression" : LogisticRegression(random_state=random_state),
                 "SGD Classifier" : SGDClassifier(),
                 "Multilayer Perceptron" :  MLPClassifier(hidden_layer_sizes=(100,),momentum=0.9,solver='sgd',random_state=random_state),
                 "Voting Classifier" : VotingClassifier(estimators=[('log', LogisticRegression()), ('SVM',SVC(C=1000)), ('MLP', MLPClassifier(hidden_layer_sizes=(100,)))], voting='hard')
              }

### Import Dataset

In [None]:
# Input data files are available in the "../input/" directory
dataset = pd.read_csv("../input/Heart_Disease_Data.csv", na_values="?", low_memory = False)

#dataset=pd.read_csv("../input/Heart_Disease_Data.csv", delimiter=',', header = None, )
#dataset = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleve.mod', skiprows=20, header=None, sep='\s+', names=features, index_col=False, na_values="?")
dataset["pred_attribute"].replace(inplace=True, value=[1, 1, 1, 1], to_replace=[1, 2, 3, 4])

np_dataset = np.asarray(dataset)  
#np_dataset = open("../input/Heart_Disease_Data.csv")  
#np_dataset.readline()  # Skip the feature names row
#np_data = np.loadtxt(np_dataset, delimiter = ',')  
#mx_dataset = dataset.as_matrix()

# 13 dataset features
feature13 = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slop','ca','thal']

### Review heart disease dataset samples

In [None]:
dataset.head()

In [None]:
dataset.tail()

## Visualization

1. Inspect the distribution of the target variable (an imbalanced distribution of target variable might harm the performance of some models) 
1. Use box plot and scatter plot to inspect their distributions and check for outliers
1. Plot the data with points colored according to their classification tasks, this helps with feature engineering
1. Make pairwise distribution plots and examine their correlations

**Statistical Tests**  
We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about which test to use and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.i.d. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is due to genuine improvment or randomness.

### Review all features

#### All participants

In [None]:
columns=dataset.columns[:13]
plt.subplots(figsize=(20,15))
length=len(columns)
for i,j in itertools.zip_longest(columns,range(length)):
    plt.subplot((length/2),3,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    dataset[i].hist(bins=20,edgecolor='black')
    plt.title(i)
plt.show()

#### Only heart disease partipants

In [None]:
dataset_copy=dataset[dataset['pred_attribute']==1]
columns=dataset.columns[:13]
plt.subplots(figsize=(20,15))
length=len(columns)
for i,j in itertools.zip_longest(columns,range(length)):
    plt.subplot((length/2),3,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    dataset_copy[i].hist(bins=20,edgecolor='black')
    plt.title(i)
plt.show()

### Review continuous features
Pair plot, box plot, basic statistics

In [None]:
features_continuous=["age", "trestbps", "chol", "thalach", "oldpeak", "pred_attribute"]
#features_continuous=["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]

#### Pair plot
   
   1. The diagonal shows the distribution of the the dataset with the kernel density plots.  
   1. The scatter-plots shows the relation between each and every attribute or features taken pairwise.  
   1. Looking at the scatter-plots, we can say that no two attributes are able to clearly seperate the two outcome-class instances.  

In [None]:
sns.pairplot(data=dataset[features_continuous],hue='pred_attribute',diag_kind='kde')
#plt.gcf().set_size_inches(20,15)
plt.show()

#### Box plot

In [None]:
sns.boxplot(data=dataset[features_continuous])
#plt.gcf().set_size_inches(20,15)
plt.show()

#### Statistics

In [None]:
dataset[features_continuous].describe()

### Review categorical features  
Histograms, basic statistics

#### Diagnosis of heart disease (angiographic disease status): "pred_attribute"  
Value 0 = diameter narrowing <= 50%  (Healthy)  
Value 1 =  diameter narrowing > 50% (Sick)  

In [None]:
sns.countplot(x='pred_attribute',data=dataset)
plt.show()

#### Sex: "sex"
Value 0 = female  
Value 1 = male  

In [None]:
sns.countplot(x='sex',data=dataset)
plt.show()

#### Fasting blood sugar: "fbs"
Value 0 = fbs <= 120 mg/dl  
Value 1 = fbs > 120 mg/dl

In [None]:
sns.countplot(x='fbs',data=dataset)
plt.show()

#### Slope of the peak exercise ST segment: "slop"    
Value 1 = upsloping   
Value 2 = flat   
Value 3 = downsloping  

In [None]:
sns.countplot(x='slop',data=dataset)
plt.show()

#### Chest pain type: "cp"  
Value 1 = typical angina   
Value 2 = atypical angina   
Value 3 = non-anginal pain   
Value 4 = asymptomatic   

In [None]:
sns.countplot(x='cp',data=dataset)
plt.show()

#### Exercise induced angina (chest pain): "exang"
Value 0 = no    
Value 1 = yes  

In [None]:
sns.countplot(x='exang',data=dataset)
plt.show()

#### Number of major vessels colored by flourosopy : "ca"
Value 0 = 0 major blood vessels colored
Value 1 = 1 major blood vessels colored
Value 2 = 2 major blood vessels colored
Value 3 = 3 major blood vessels colored

In [None]:
sns.countplot(x='ca',data=dataset)
plt.show()

#### Thalium heart test: "thal"  
Value 3 = normal  
Value 6 = fixed defect  
Value 7 = reversable defect   

In [None]:
sns.countplot(x='thal',data=dataset)
plt.show()

#### Resting electrocardiographic results: "restecg"   
Value 0 = normal   
Value 1 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)   
Value 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria  

In [None]:
sns.countplot(x='restecg',data=dataset)
plt.show()

# Data Preprocessing

Raw data files must often but review and then cleaned to be effectively used for analysis and machine learning purposes. This includes:

1. Dealing with missing data.
1. Dealing with outliers.
1. Encoding categorical variables if necessary.
1. Dealing with noise.
    1. For example: floats derived from raw figures may be truncated. The loss of precision during floating-point arithemics can bring noise into the data: two seemingly different values might be the same before conversion. Noise can harm model performance.

## Imputation
**Check for invalid tuples in dataset**
Any NaN or invalid entries will have to be addressed via imputation or feature removal

In [None]:
dataset.isnull().sum()

In [None]:
#dataset = dataset.convert_objects(convert_numeric=True)

# Load data
X = dataset.iloc[:, :-1].values  
y = dataset.iloc[:, -1].values # = dataset.iloc[:, 13].values

#dataset.dropna(inplace=True, axis=0, how="any")
#X=dataset.loc[:, "age":"thal" ]
#y=dataset["pred_attribute"]
#np_X = np_dataset[:, :-1]  
#np_y = np_dataset[:, -1]  

my_imputer = SimpleImputer()
my_imputer = my_imputer.fit(X[:,0:13])   
X[:, 0:13] = my_imputer.transform(X[:, 0:13])

## Standardization
There can be a lot of deviation in the given dataset. High variance has to be standardised. Standardization, or normalization, is a useful technique to transform attributes with a Gaussian distribution,  differing means, and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

#features_all = dataset[dataset.columns[:13]]
#features_standard = scaler.fit_transform(dataset[dataset.columns[:13]]) # Gaussian Standardisation
#X = pd.DataFrame(features_standard,columns=[feature13])
#X['pred_attribute'] = dataset['pred_attribute']
#outcome = X['pred_attribute']

## Stratification
When we split the dataset into train and test datasets, the split is completely random. Thus the instances of each class label or outcome in the train or test datasets is random. Thus we may have many instances of class 1 in training data and less instances of class 2 in the training data. So during classification, we may have accurate predictions for class1 but not for class2. Thus we stratify the data, so that we have proportionate data for all the classes in both the training and testing data.

**Define training and test samples**  
The Cleveland data set available from the UCI repository has 303 samples; the training and test data sets were randomly selected with 30% of the original data set corresponding to the test data set.  The relative proportions of the classes of interest (disease/no disease) in both sets were checked to be similar.

In [None]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)

freqs = pd.DataFrame({"Training dataset": y_train.sum(),
                      "Test dataset":y_test.sum(),
                      "Total": y.sum()},
                     index=["Healthy", "Sick"])
freqs[["Training dataset", "Test dataset", "Total"]]

## Error from imputation
Get mean absolute error score from imputation with extra columns showing what was imputed

In [None]:
def err_score(X_train, X_test, y_train, y_test):
    model =  RandomForestRegressor(random_state=random_state)
    model.fit(X_train, y_train)
    
    # predict class labels for the train set
    pred_train = model.predict(X_train)
    # predict class labels for the test set
    pred_test = model.predict(X_test)
    # check the mean absolute error on test set
    print("Mean Absolute Error from imputation: ", mean_absolute_error(y_test, pred_test))
    
#X = my_imputer.fit_transform(X)
#X_train = my_imputer.fit_transform(X_train)
#X_test = my_imputer.transform(X_test)
err_score(X_train, X_test, y_train, y_test)

## Accuracy of training and test sets

In [None]:
# instantiate a logistic regression model, and fit with X and y (with training data in X,y)
model = LogisticRegression(random_state = random_state)
model.fit(X_train, y_train)

# check the accuracy on the training set
print("Accuracy on training set: ", model.score(X_train, y_train))
# check the accuracy on the test set
print("Accuracy on test set: ", model.score(X_test, y_test))

## Confusion matrices of training and test sets

** Predict class labels for the training set**  
0 = Healthy  
1 = Sick

In [None]:
pred_train = model.predict(X_train)
#confusion_matrix(y_train,pred_train)
pd.crosstab(y_train, pred_train, rownames=['Predicted'], colnames=['Reality'], margins=True)

**Predict class labels for the test set**  
0 = Healthy  
1 = Sick

In [None]:
pred_test = model.predict(X_test)
#confusion_matrix(y_test,pred_test)
pd.crosstab(y_test, pred_test, rownames=['Predicted'], colnames=['Reality'], margins=True)

## Accuracy of models with all features
Now without feauture engineering and  lets check how models perform on test and train data

In [None]:
data = dataset[dataset.columns[:13]]
outcome = dataset['pred_attribute']

#train,test=train_test_split(dataset,test_size=0.25,random_state=0,stratify=dataset['pred_attribute'])# stratify the outcome
#train_X=train[train.columns[:13]]
#test_X=test[test.columns[:13]]
#train_Y=train['pred_attribute']
#test_Y=test['pred_attribute']

data_copy=[]
for model in algorithms:
    model.fit(X_train,y_train)
    pred_test = model.predict(X_test)
    data_copy.append(metrics.accuracy_score(pred_test, y_test))
    
models_df = pd.DataFrame(data_copy, index=names)   
models_df.columns=['Accuracy']
models_df

# Feature Engineering

1. A lot of features can affect the accuracy of the algorithm.  
1. Less features mean faster training
1. Some features are linearly related to others. This might put a strain on the model.
1. By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.
1. Feature Selection
    1. means to select only the important features in-order to improve the accuracy of the algorithm.  
1. It reduces training time and reduces overfitting    
1. We can choose important features in 3 ways:  
    1. Correlation matrix: selecting only the uncorrelated features.  
    1. Extra Trees Classifier
    1. RandomForestClassifier: gives the importance of the features

## Feature Selection

### Pearson Correlation Heatmap

In [None]:
sns.heatmap(dataset[dataset.columns[:14]].corr(),annot=True,cmap='RdYlGn')
fig=plt.gcf()
fig.set_size_inches(25,20)
#plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
plt.show()

The Pearson Correlation plot indicated that there are no strongly correlated features.  
This is good from a point of view of feeding these features into the learning model because this means that there isn't much redundant or superfluous data in our training set.  
No features can be selected from this plot.  

### Extra Trees Classification

In [None]:
model = ExtraTreesClassifier(n_estimators=100,random_state=random_state)
model.fit(X, y)
pd.Series(model.feature_importances_, index=data.columns).sort_values(ascending=False)

### Random Forest Classification

In [None]:
model = RandomForestClassifier(n_estimators=100,random_state=random_state) # , max_features=(13 ** 0.5))
model.fit(X, y)
pd.Series(model.feature_importances_, index=data.columns).sort_values(ascending=False)

**Selected Features Preprocessing:**  
Data must be stardardized, stratified, and imputated again before the ML models can be compared.  
Using the overlapping features from the two tests above, the new set of features will only include: _cp, thalach, oldpeak, ca, thal, exang_

In [None]:
# Dataset of selected features
dataset_select = dataset[['ca','thal','thalach','cp','oldpeak','exang','pred_attribute']]

X = dataset_select.iloc[:, :-1].values  
y = dataset_select.iloc[:, -1].values
#X = np_dataset[:, :-1]  
#y = np_dataset[:, -1]  

## Imputation

In [None]:
my_imputer = my_imputer.fit(X[:,0:6])   
X[:, 0:6] = my_imputer.transform(X[:, 0:6])

## Standardization

In [None]:
X = scaler.fit_transform(X)

## Stratification

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)

## Error from imputation

In [None]:
err_score(X_train, X_test, y_train, y_test)

## Accuracy of models with selected features

In [None]:
data_copy=[]
for model in algorithms:
    model.fit(X_train,y_train)
    pred_test = model.predict(X_test)
    data_copy.append(metrics.accuracy_score(pred_test, y_test))
    
models_df2 = pd.DataFrame(data_copy, index=names)  
models_df2.columns = ['New Accuracy']  

models_df2 = models_df2.merge(models_df, left_index=True, right_index=True, how='left')
models_df2['Increase'] = models_df2['New Accuracy'] - models_df2['Accuracy']
models_df2

# Model Engineering

## Model Selection
Selecting accurate models, tuning them, and training them is essential to robust results. Just as some features were selected based on their importance or correlation, selecting a subgroup of base models will inform a better-performing ensemble model.

### k-Fold Cross Validation

Cross validation is an essential step in model training. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well. However, we shouldn¡¯t use too many folds if our training data is limited. Otherwise we would have too few samples in each fold to guarantee statistical significance.

Many times the data is imbalanced, i.e there may be a high number of class1 instances but less number of other class instances. Thus we should train and test our algorithm on each and every instance of the dataset. Then we can take an average of all the noted accuracies over the dataset. 

1. The k-Fold Cross Validation works by first dividing the dataset into k-subsets.
1. Let's say we divide the dataset into (k=10) parts. We reserve 1 part for testing and train the algorithm over the other 9 parts.
1. We continue the process by changing the testing part in each iteration and training the algorithm over the other parts. The accuracies and errors are then averaged to get a average accuracy of the algorithm.
1. An algorithm may underfit over a dataset for some training data and sometimes also overfit the data for other training set. Thus with cross-validation, we can achieve a generalised model.

In [None]:
#Testing and comparing various classifications for: Accuracy | ROC | Area under Curve | Gmean | Precision | Recall 

kfold = KFold(n_splits=10, random_state=random_state) # k=10, split the data into 10 equal parts
CV_results = []
# iterate over classifiers
Acc= {}
Acc_Train = {}
Acc_Test = {}
Std_Train={}
Std_Test={}
Predictions = {}
ROC = {}
AUC = {}
Gmean = {}
Precision = {}
Recall = {}
F1_score = {}
Confusion_Matrix = {}
mats = pd.DataFrame(Confusion_Matrix)

for clf in classifiers:
    #Acc[clf] = cross_validate(classifiers[clf],X=X,y=y,cv=10,n_jobs=-1,scoring='accuracy',return_train_score=True)
    Acc[clf] = cross_validate(classifiers[clf],X_train,y_train,cv=kfold,n_jobs=-1,scoring='accuracy',return_train_score=True)
    Acc_Train[clf] =  Acc[clf]['train_score'].mean()
    Acc_Test[clf] = Acc[clf]['test_score'].mean()
    Std_Train[clf] = Acc[clf]['train_score'].std()
    Std_Test[clf] = Acc[clf]['test_score'].std()
    CV_results.append(Acc[clf])
    
    classifiers[clf].fit(scaler.transform(X_train),y_train)
    pred =  classifiers[clf].predict(scaler.transform(X_test))
    ROC[clf] = roc_auc_score(y_test,pred)
    AUC[clf] = auc(y_test,pred,reorder=True)
    Gmean[clf] = fowlkes_mallows_score(y_test,pred)
    Precision[clf] = precision_score(y_test,pred)
    Recall[clf] = recall_score(y_test,pred)    
    F1_score[clf] = f1_score(y_test,pred)
    Confusion_Matrix[clf] = confusion_matrix(y_test,pred)

Accuracy_train = pd.DataFrame([Acc_Train[vals]*100 for vals in Acc_Train],columns=['Accuracy Train'],index=[vals for vals in Acc_Train])
Std_train = pd.DataFrame([Std_Train[vals]*100 for vals in Std_Train],columns=['S. Deviation Train'],index=[vals for vals in Std_Train])
Accuracy_pred = pd.DataFrame([Acc_Test[vals]*100 for vals in Acc_Test],columns=['Accuracy Test'],index=[vals for vals in Acc_Test])
Std_test = pd.DataFrame([Std_Test[vals]*100 for vals in Std_Test],columns=['S. Deviation Test'], index=[vals for vals in Std_Test])
ROC_Area = pd.DataFrame([ROC[vals] for vals in ROC],columns=['ROC(area)'],index=[vals for vals in ROC])
AUC_Area = pd.DataFrame([AUC[vals] for vals in AUC],columns=['AUC(area)'],index=[vals for vals in AUC])
Gm = pd.DataFrame([Gmean[vals] for vals in Gmean],columns=['Gmean'],index=[vals for vals in Gmean])
Prec = pd.DataFrame([Precision[vals] for vals in Precision],columns=['Precision'],index=[vals for vals in Precision])
Rec = pd.DataFrame([Recall[vals] for vals in Recall],columns=['Recall'],index=[vals for vals in Recall])
F1 =  pd.DataFrame([F1_score[vals] for vals in F1_score],columns=['F1_score'],index=[vals for vals in F1_score])

table = pd.concat([Accuracy_train,Std_train,Accuracy_pred,Std_test,ROC_Area,AUC_Area,Gm,Prec,Rec,F1], axis=1)
table.loc['MEAN VALUE'] = table.mean()
table

### Candidates for selection
Guassian Naive Bayes(GNB), Multilayer Perceptron, and Logistic Regression are the top three performers. We will use these and a couple other base models to create a stacked ensemble. Almost all of the decision-based models over fit on the train.

In [None]:
#box=pd.DataFrame(Acc_Test,index=[names])
#sns.boxplot(Accuracy_pred.T)
#plt.show()

## Parameter Tuning
Selecting best possible parameters for  our top three: Guassian Naive Bayes(GNB), Multilayer Perceptron(MP), and Logistic Regression(LR).
Improve a model¡¯s performance by tuning its parameters. A model usually have many parameters, but only a few of them are significant to its performance. For example, the most important parameters for a random forest is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact each parameter has to the model¡¯s performance, whether it's accuracy, robustness, or speed.

### Example of Tuning
k-NN Accuracies for different values of 'n' neaghbors

In [None]:
k_range=list(range(1,50))
data_copy=pd.Series()
k_scores = []
#x=[0,1,2,3,4]
for i in k_range:
    model=KNeighborsClassifier(n_neighbors=i) 
    scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
    k_scores.append(scores.mean())
    model.fit(X_train,y_train)
    pred_test=model.predict(X_test)
    data_copy=data_copy.append(pd.Series(metrics.accuracy_score(pred_test,y_test)))

#plt.plot(k_range, k_scores)
plt.plot(k_range, data_copy)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')    
#plt.xticks(X)
plt.show()
print('Accuracies for different values of n are:\n', data_copy.values)

### Tuning Linear and Radial SVC
Normally the best set of parameters are found by a process called grid search. Grid search iterates through all the possible combinations to find the best set of parameters.

In [None]:
svc = SVC(probability=True, random_state=random_state)
# Set the parameters by cross-validation
svc_pg = [{'kernel': ['rbf'], 'gamma': [1e-1, 1e-2, 1e-3, 1e-4],'C': [1, 10, 100, 1000]},
          {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    svc.fit(X_train, y_train)
    y_eval = svc.predict(X_test)
    acc = sum(y_eval == y_test) / float(len(y_test))
    print("Accuracy of SVC: %.2f%%" % (100*acc))
    
    print("Tuning hyper-parameters for %s\n" % score)
    svc_gscv = GridSearchCV(svc, svc_pg, cv=kfold, scoring='%s_macro' % score)
    svc_gscv.fit(X_train, y_train)
    print("Best parameters set found on training set:")
    print(svc_gscv.best_params_,"\n")
    print("Grid scores on training set:")
    means = svc_gscv.cv_results_['mean_test_score']
    stds = svc_gscv.cv_results_['std_test_score']
    
    for mean, std, params in zip(means, stds, svc_gscv.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print("Detailed classification report:")
    print("The model is trained on the full training set.")
    print("The scores are computed on the full test set.")
    y_true, y_pred = y_test, svc_gscv.predict(X_test)
    print(classification_report(y_true, y_pred))
    
    svc_est = svc_gscv.best_estimator_
    svc_score = svc_gscv.best_score_
    print("Best estimator for parameter C: %f\n" % (svc_est.C))
    print("Best score: %0.2f%%\n" % (100*svc_score))

    #print(clf)

### Tuning Guassian Naive Bayes  
GNB has no hyperparameters for the Cross Validation Grid Search to test. Instead GNB benefits largely from the data preprocessing and feature selection that was performed earlier

In [None]:
gnb = GaussianNB()
#gnb_gscv = GridSearchCV(gnb,param_grid = ,cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
#gnb.fit(X_train, y_train)

#gnb_est = gnb.best_estimator_
#gnb_score = gnb.best_score_
#print("Best estimator:", gnb_est, "\nBest Score:", gnb_score)

### Tuning Multilayer Perceptron

In [None]:
mlp = MLPClassifier(momentum=0.15,solver='sgd',learning_rate_init=1.0, early_stopping=True, shuffle=True,random_state=random_state)

mlp_pg={
'learning_rate': ["constant", "invscaling", "adaptive"],
'hidden_layer_sizes': [(10,10),(100,),(100,10)],
#'hidden_layer_sizes': [x for x in itertools.product((10,20,30,40,50,100),repeat=3)],
#'tol': [1e-2, 1e-3, 1e-4],
#'epsilon': [1e-3, 1e-7, 1e-8],
'alpha': [1e-2, 1e-3, 1e-4],
#'activation': ["logistic", "relu", "Tanh"]
}

mlp_gscv = GridSearchCV(mlp,param_grid=mlp_pg, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
mlp_gscv.fit(X_train, y_train)

mlp_est = mlp_gscv.best_estimator_
mlp_score = mlp_gscv.best_score_
print("Best estimator:", mlp_est, "\nBest Score:", mlp_score)

### Tuning Logistic Regression

In [None]:
lr = LogisticRegression(
    C=0.1,
    penalty='l2',
    dual=True,
    tol=0.0001, 
    fit_intercept=True,
    intercept_scaling=1.0, 
    class_weight=None,
    random_state=random_state)

lr_pg = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

lr_gscv = GridSearchCV(lr,param_grid=lr_pg, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
lr_gscv.fit(X_train, y_train)

lr_est = lr_gscv.best_estimator_
lr_score = lr_gscv.best_score_
print("Best estimator:", lr_est, "\nBest Score:", lr_score)

### Tuning XGBoost
 These XGBoost parameters are generally considered to have real impacts on its performance:

1. eta: Step size used in updating weights in each boosting step to prevent overfitting. Lower eta means slower training but better convergence.
1. num_round: Total number of iterations.
1. subsample: The ratio of training data used in each iteration. This is to combat overfitting.
1. colsample_bytree: The ratio of features used in each iteration. This is like max_features in RandomForestClassifier.
1. gamma : minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
1. max_depth: The maximum depth of each tree. Unlike random forest, gradient boosting would eventually overfit if we do not limit its depth. If set to too high a number might run the risk of overfitting.
1. early_stopping_rounds: If we don¡¯t see an increase of validation score for a given number of iterations, the algorithm will stop early. This is to combat overfitting, too.

Usual tuning steps:

1. Reserve a portion of training set as the validation set.
1. Set eta to a relatively high value (e.g. 0.05 ~ 0.1), num_round to 300 ~ 500.
1. Use grid search to find the best combination of other parameters.
1. Gradually lower eta until we reach the optimum.
1. Use the validation set as watch_list to re-train the model with the best parameters. 
1. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

In [None]:
xgbc = xgb.XGBClassifier(
    #learning_rate = 0.02,
     n_estimators= 2000,
     max_depth= 4,
     min_child_weight= 2,
     gamma=0.9,                        
     subsample=0.8,
     colsample_bytree=0.8,
     objective= 'binary:logistic',
     nthread= -1,
     scale_pos_weight=1)

xgbc.fit(X_train, y_train)


X_train2, X_test2, y_train2, y_test2 = train_test_split(X_train, y_train, test_size=0.3, random_state=random_state)
train2 = xgb.DMatrix(X_train2, y_train2)
test2 = xgb.DMatrix(X_test2, y_test2)
watchlist = [(test2, 'test')]
xgb_pg = {
    'max_depth': 3,
    'booster': 'gbtree',
    'objective': 'reg:linear',#'binary:logistic'
    'subsample': 0.8,
    'colsample_bytree': 0.85,
    'eta': 0.05,
    'max_depth': 7,
    'seed': 43,
    'silent': 0,
    'eval_metric': 'rmse' #root mean square estimate
}

bst = xgb.train(xgb_pg, train2, 500, watchlist, early_stopping_rounds=50)
#bst.save_model("model_new")
#bst = xgb.Booster(xgb_pg)
#bst.load_model("model_new")

# make prediction
#pred = bst.predict(xgb.DMatrix(X_test))
prediction = xgbc.predict(X_test)
pred = bst.predict(test2,ntree_limit=bst.best_ntree_limit)
xgb.to_graphviz(bst, num_trees=2)

# Ensemble Engineering

Creating ensemble models from base models is a common way to boost accuracy. Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. The models used to create such ensemble models are called ¡®**base models**¡¯. Ensemble Learning refers to the technique of combining different models. It reduces both bias and variance of the final model , thus increasing the score and reducing the risk of overfitting. 

In theory, for the ensemble to perform well, two factors matter:  
1. Base models should be as unrelated as possibly. This is why we tend to include non-tree-based models in the ensemble even though they don¡¯t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.  
1. Performance of base models shouldn¡¯t differ to much.
    1. Trade-off: In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase the overall performance.

## Stacking

It¡¯s much like cross validation. Take 5-fold stacking as an example.  
1. First we split the training data into 5 folds.  
1. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data.  
1. After 5 iterations we will obtain a matrix of shape #(samples in training data) X #(base models). This matrix is then fed to the stacker (it¡¯s just another model) in the second level.  
1. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

In [None]:
class Ensemble(object):
    def __init__(self, n_folds, stacker, base_models):
        self.n_folds = n_folds
        self.stacker = stacker
        self.base_models = base_models

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)

        folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True, random_state=2016))

        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))

        for i, clf in enumerate(self.base_models):
            S_test_i = np.zeros((T.shape[0], len(folds)))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]
                # y_holdout = y[test_idx]
                clf.fit(X_train, y_train)
                y_pred = clf.predict(X_holdout)[:]
                S_train[test_idx, i] = y_pred
                S_test_i[:, j] = clf.predict(T)[:]

            S_test[:, i] = S_test_i.mean(1)

        self.stacker.fit(S_train, y)
        y_pred = self.stacker.predict(S_test)[:]
        return y_pred

## Voting Ensemble
Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

A Weighted Voting Classifier will be used to stack the top three base models: Guassian Naive Bayes, Multilayer Perceptron, and Logistic Regression. Classifiers will be assigned weight according to their accuracies. So the classifier with single accuracy will be assigned the highest weight and so on. 

In [None]:
gnb_mlp=VotingClassifier(estimators=[('Guassian Naive Bayes', gnb),('Multilayer Perceptron',mlp)], voting='soft', weights=[2,1]).fit(X_train,y_train)
print('The accuracy for Guassian Naive Bayes and Multilayer Perceptron:',gnb_mlp.score(X_test,y_test))

In [None]:
gnb_lr=VotingClassifier(estimators=[('Guassian Naive Bayes', gnb),('Logistic Regression', lr)], voting='soft', weights=[2,1]).fit(X_train,y_train)
print('The accuracy for Guassian Naive Bayes and Logistic Regression:',gnb_lr.score(X_test,y_test))

In [None]:
mlp_lr=VotingClassifier(estimators=[('Multilayer Perceptron',mlp), ('Logistic Regression', lr)], voting='soft', weights=[2,1]).fit(X_train,y_train)
print('The accuracy for Multilayer Perceptron and Logistic Regression:',mlp_lr.score(X_test,y_test))

In [None]:
gnb_mlp_lr=VotingClassifier(estimators=[('Guassian Naive Bayes', gnb),('Multilayer Perceptron',mlp), ('Logistic Regression', lr)], voting='soft', weights=[3,2,1]).fit(X_train,y_train)
print('The ensembled model with Guassian Naive Bayes, Multilayer Perceptron, and Logistic Regression:',gnb_mlp_lr.score(X_test,y_test))

# Conclusion  

When given a raw set of data to work with there are many ways in which to refine it, analyze it, and make use of it. That is the ultimate goal accomplished by this notebook. I have demonstrated nearly all the key steps to take a dataset and turn it into a tool to potentially make the world a better place.

## Review

A quick outline of what was done in this notebook:

1. Introduction: Premise for this work and import literally hundreds of machine learning algorithms, analysis tools, and evironment tools
1. Data Exploration: Discuss the heart data to be used and inspect it visually
1. Data Preprocessing: Clean up the messy parts of the dataset and prepare it to be used by the ML models
1. Feature Engineering: Compare importance of features and make a new dataset with only the most impotant ones
1. Model Engineering: Compare the accuracy of models and tune only the most accurate ones
1. Ensemble Generation: Combine the now tuned models into one model in hopes of boosting the accuracy of the predictions
1. Conclusion: Where we are now: discussing the results

All of these steps contributibute to a more useful, accurate model, with a final accuracy of 89.01%.

## Ensemble models

In [None]:
# Generate a simple plot of the test and training learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)): 
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Cross-validation score")
    plt.legend(loc="best")
    return plt

#gnb_plot = plot_learning_curve(gnb_est,"GNB learning curves",X_train,Y_train,cv=kfold)
mlp_plot = plot_learning_curve(mlp_est,"MLP learning curves",X_train,y_train,cv=kfold)
lr_plot = plot_learning_curve(lr_est,"LR learning curves",X_train,y_train,cv=kfold)
svc_plot = plot_learning_curve(svc_est,"SVC learning curves",X_train,y_train,cv=kfold)
#xgb_plot = plot_learning_curve(xgb_gscv.best_estimator_,"XBG learning curves",X_train,Y_train,cv=kfold)

### Heatmap of Ensemable 

In [None]:
#test_survived_gnb = pd.Series(gnb_est.predict(test), name="GNBC")
test_Survived_mlp = pd.Series(mlp_est.predict(X_test), name="MLP")
test_Survived_lr = pd.Series(lr_est.predict(X_test), name="LR")
test_Survived_svc = pd.Series(svc_est.predict(X_test), name="SVC")

# Concatenate all classifier results
ensemble_results = pd.concat([test_Survived_mlp,test_Survived_lr,test_Survived_svc],axis=1)
g= sns.heatmap(ensemble_results.corr(),annot=True)

### Voting Classifier of all tuned models for submission

In [None]:
finalVC = VotingClassifier(estimators=[('Guassian Naive Bayes', gnb),('Multilayer Perceptron',mlp), ('Logistic Regression', lr),('SVM',svc)], voting='soft', weights=[4,3,2,1], n_jobs = -1).fit(X_train,y_train)
print('The ensembled model with all classfiers: ',finalVC.score(X_test,y_test))

# Generate Submission File 
test_healthy = pd.Series(finalVC.predict(X_test), name="Healthy")
#results = pd.concat([X_test['index'],test_healthy],axis=1)
#results.to_csv("Results_series_NSR.csv", index=False)
#submission1 = pd.DataFrame({'Healthy Probabilty': pred })
#submission1.to_csv("Results_df1_NSR.csv", index=True)
#print(submission1)



In [None]:
submission = pd.DataFrame({'Healthy': prediction })
submission.to_csv("Results_df_NSR.csv", index=True)
print(submission)