Alzheimer's is a type of dementia that affects a person's Memory, Thought and Behavior. It is a disease that begins mildly and affects parts of the brain, which makes the person have difficulty, to remember newly learned information, constant changes in mood, and confusion with events, times and places.
 
Alzheimer's usually starts after age 60. The risk increases as the person ages. The risk of having this disease is greater if there are people in the family who have had this disease.
 
As for the treatments that have been done for this disease, there is none that can stop the progress of this. So far what these treatments can achieve is to help alleviate some symptoms, reducing their intensity and contributing to a higher quality of life for patients and their families.

<img src="https://gx0ri2vwi9eyht1e3iyzyc17-wpengine.netdna-ssl.com/wp-content/uploads/2017/01/dementia2-804x369.jpg" alt="AzBoruta" border="0">

## objective

Implement classification algorithms for the analysis of the medical dataset, in order to provide a prediction tool for the early diagnosis of the disease.


# Table of Contents

* **1. [ Declaration of functions](#ch1)**
* ** 2 [ Analysis of data](#ch2)**
     * 2.1 [Read dataset](#ch3) 
     * 2.2 [Correlation Analysis](#ch4) 
     * 2.3 [Correlation matrix](#ch5) 
     * 2.4 [Dispersion matrix](#ch6) 
     * 2.5 [Graphs of all these correlations](#ch7) 
     * 2.6 [Miscellaneous Graphics](#ch8) 
* ** 3 [Preprocessing](#ch9)**
     * 3.1 [Remove Useless Columns](#ch10)
     * 3.2 [LabelEncoder](#ch11)
     * 3.3 [Imputation of lost values](#ch12)
     * 3.4 [Standardization](#ch13)
     * 3.5 [Export them to then select the features](#ch14)
* **  4 [Modeling](#ch15)** 
     * 4.1 [Tuning Hyperparameters for better models](#ch15)
     * 4.2 [Generating our models](#ch16)
     * 4.3 [Cross Validation](#ch17)
* **  5. [Importance of characteristics](#ch18)**
* **  6. [Predictions](#ch19)**
* ** 7. [Performance Metric for each model](#ch21)**
    * 7.1 [Report ](#ch22)
    * 7.2 [Results ](#ch23)

## Required libraries

In [None]:
import pandas as pd
from scipy.io import arff
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sklearn import preprocessing
import numpy as np

from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import xgboost as xgb
from sklearn import metrics
from sklearn.metrics import mean_squared_error

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls


FOLDS =10
%matplotlib inline

<a id="ch1"></a>
# 1. Declaration of functions

## Graphing functions 

In [None]:
# Function to graph number of people by age
def cont_age(field):
    plt.figure()
    g = None
    if field == "Age":
        df_query_mri = df[df["Age"] > 0]
        g = sns.countplot(df_query_mri["Age"])
        g.figure.set_size_inches(18.5, 10.5)
    else:
        g = sns.countplot(df[field])
        g.figure.set_size_inches(18.5, 10.5)
    
sns.despine()

In [None]:
# Function to graph number of people per state [Demented, Nondemented]
def cont_Dementes(field):
    plt.figure()
    g = None
    if field == "Group":
        df_query_mri = df[df["Group"] >= 0]
        g = sns.countplot(df_query_mri["Group"])
        g.figure.set_size_inches(18.5, 10.5)
    else:
        g = sns.countplot(df[field])
        g.figure.set_size_inches(18.5, 10.5)
    
sns.despine()

In [None]:
# 0 = F y 1= M
def bar_chart(feature):
    Demented = df[df['Group']==1][feature].value_counts()
    Nondemented = df[df['Group']==0][feature].value_counts()
    df_bar = pd.DataFrame([Demented,Nondemented])
    df_bar.index = ['Demented','Nondemented']
    df_bar.plot(kind='bar',stacked=True, figsize=(8,5))

In [None]:
def report_performance(model):

    model_test = model.predict(X_test)

    print("Confusion Matrix")
    print("{0}".format(metrics.confusion_matrix(y_test, model_test)))
    print("")
    print("Classification Report")
    print(metrics.classification_report(y_test, model_test))

<a id="ch2"></a>
 # 2. Analysis of data

<a id="ch3"></a>
## 2.1 read dataset

In [None]:
data = '../input/oasis_longitudinal.csv'
df = pd.read_csv (data)
df.head()

In [None]:
df.describe()

In [None]:
nu = pd.DataFrame(df['Group']=='Nondemented')
nu["Group"].value_counts() 

<a id="ch4"></a>
## 2.2 Correlation Analysis

In [None]:
f, ax = plt.subplots(figsize=(10, 8)) 
corr = df.corr(method = 'pearson') 
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), 
            square=True, ax=ax) 

<a id="ch5"></a>
## 2.3 Correlation matrix

In [None]:
df.corr(method = 'pearson') 

<a id="ch6"></a>
## 2.4 Dispersion matrix

In [None]:
pd.scatter_matrix(df, alpha = 0.3, figsize = (14,8), diagonal = 'kde'); 

<a id="ch7"></a>
## 2.5 Graphs of all these correlations

In [None]:
g = sns.PairGrid(df, vars=['Visit','MR Delay','M/F', 'Age', 'EDUC', 'SES', 'MMSE', 'eTIV', 'nWBV', 'ASF'],
                 hue='Group', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();

<a id="ch8"></a>
## 2.6 Miscellaneous Graphics

**Number of Demented, Nondemented and Converted depending on the sex of the patient**

In [None]:
import seaborn as sb
sb.factorplot('M/F',data=df,hue='Group',kind="count")

**Variation of the dementia according to the MMSE depending on the scores of each patient**

In [None]:
facet= sns.FacetGrid(df,hue="Group", aspect=3)
facet.map(sns.kdeplot,'MMSE',shade= True)
facet.set(xlim=(0, df['MMSE'].max()))
facet.add_legend()
plt.xlim(12.5)

**Number of patients of each age**

In [None]:
cont_age("Age")

 <a id="ch9"></a>
# 3. Preprocessing

**Replace data Convert a Dement**

In [None]:
df['Group'] = df['Group'].replace(['Converted'], ['Demented'])
df.head(3)

 <a id="ch10"></a>
## 3.1 Remove Useless Columns

In [None]:
df.drop(['Subject ID'], axis = 1, inplace = True, errors = 'ignore')
df.drop(['MRI ID'], axis = 1, inplace = True, errors = 'ignore')
df.drop(['Visit'], axis = 1, inplace = True, errors = 'ignore')
#for this study the CDR we eliminated it
df.drop(['CDR'], axis = 1, inplace = True, errors = 'ignore')
df.head(3)

 <a id="ch11"></a>
## 3.2 LabelEncoder

****We are going to use Binarized LabelEncoder for our Binary attributes********

**Which are sex and our class**

In [None]:
# 1 = Demented, 0 = Nondemented
df['Group'] = df['Group'].replace(['Demented', 'Nondemented'], [1,0])    
df.head(3)

In [None]:
# 1= M, 0 = F
df['M/F'] = df['M/F'].replace(['M', 'F'], [1,0])  
df.head(3)

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
encoder.fit(df.Hand.values)
list(encoder.classes_)
#Transoformamos
encoder.transform(df.Hand.values)
df[['Hand']]=encoder.transform(df.Hand.values)
encoder2=LabelEncoder()
encoder2.fit(df.Hand.values)
list(encoder2.classes_)

 <a id="ch12"></a>
## 3.3 Imputation of lost values

For various reasons, many real-world data sets contain missing values, often encoded as blanks, NaNs, or other placeholders. However, these data sets are incompatible with scikit-learn estimators that assume that all values ​​in a matrix are numeric, and that they all have and have meaning. A basic strategy for using incomplete datasets is to discard rows and / or complete columns that contain missing values. However, this has the price of losing data that can be valuable (though incomplete). A better strategy is to impute the lost values, that is, to deduce them from the known part of the data.

The Imputer class provides basic strategies for imputation of missing values, using either the mean, the median or the most frequent value of the row or column in which the missing values ​​are found. This class also allows different encodings of missing values.

**Lost data**

In [None]:
data_na = (df.isnull().sum() / len(df)) * 100
data_na = data_na.drop(data_na[data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Lost proportion (%)' :round(data_na,2)})
missing_data.head(20)

In [None]:
from sklearn.impute  import SimpleImputer
# We perform it with the most frequent value 
imputer = SimpleImputer ( missing_values = np.nan,strategy='most_frequent')

imputer.fit(df[['SES']])
df[['SES']] = imputer.fit_transform(df[['SES']])

In [None]:
from sklearn.impute  import SimpleImputer
# We perform it with the median
imputer = SimpleImputer ( missing_values = np.nan,strategy='median')

imputer.fit(df[['MMSE']])
df[['MMSE']] = imputer.fit_transform(df[['MMSE']])

 <a id="ch13"></a>
# 3.4 Standardization

In [None]:
from sklearn.preprocessing import StandardScaler
df_norm = df
scaler = StandardScaler()
df_norm[['Age','MR Delay','M/F','Hand','EDUC','SES','MMSE','eTIV','nWBV','ASF']]=scaler.fit_transform(df[['Age','MR Delay','M/F','Hand','EDUC','SES','MMSE','eTIV','nWBV','ASF']])

In [None]:
df_norm.head(3)

 <a id="ch14"></a>
## 3.5 Export them to then select the features

df_norm.to_csv('DatasetSelectionAttributes.csv', sep=',',index=False)

For the selection of attributes we use the R Boruta framework.

**Commands (R) :**

library(readr)

library(Boruta)

covertype <- read_csv('DatasetSelectionAttributes.csv')

set.seed(111)

boruta.trainer <- Boruta(Group~., data = covertype , doTrace = 2, maxRuns=500)

print(boruta.trainer)

plot(boruta.trainer, las = 2)


## Result:

<a href="https://ibb.co/QMGP76c"><img src="https://i.ibb.co/cQd6KNv/AzBoruta.png" alt="AzBoruta" border="0"></a>

## Remove Columns selected by boruta

In [None]:
df_norm.drop(['Hand'], axis = 1, inplace = True, errors = 'ignore')
df_norm.drop(['MR Delay'], axis = 1, inplace = True, errors = 'ignore')

In [None]:
df_norm.head()

 <a id="ch15"></a>
# 4 Modeling

In [None]:
X = df_norm.drop(["Group"],axis=1)
y = df_norm["Group"].values
X.head(3)

In [None]:
# We divide our data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0)

In [None]:
print("{0:0.2f}% Train".format((len(X_train)/len(df_norm.index)) * 100))
print("{0:0.2f}% Test".format((len(X_test)/len(df_norm.index)) * 100))

In [None]:
print("Original Demented : {0} ({1:0.2f}%)".format(len(df_norm.loc[df_norm['Group'] == 1]), 100 * (len(df_norm.loc[df_norm['Group'] == 1]) / len(df_norm))))
print("Original Nondemented : {0} ({1:0.2f}%)".format(len(df_norm.loc[df_norm['Group'] == 0]), 100 * (len(df_norm.loc[df_norm['Group'] == 0]) / len(df_norm))))
print("")
print("Training Demented : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), 100 * (len(y_train[y_train[:] == 1]) / len(y_train))))
print("Training Nondemented : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), 100 * (len(y_train[y_train[:] == 0]) / len(y_train))))
print("")
print("Test Demented : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), 100 * (len(y_test[y_test[:] == 1]) / len(y_test))))
print("Test Nondemented : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), 100 * (len(y_test[y_test[:] == 0]) / len(y_test))))

 <a id="ch16"></a>
## 4.1 Tuning Hyperparameters for better models

Before adjusting our models, we will look for the parameters that give us a high AUC

**1°  Random Forest**

In [None]:
# Number of trees in random forest
n_estimators = range(10,250)
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = range(1,40)
# Minimum number of samples required to split a node
min_samples_split = range(3,60)

In [None]:
# Create the random grid
parametro_rf = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}

In [None]:
model_forest = RandomForestClassifier(n_jobs=-1)
forest_random = RandomizedSearchCV(estimator = model_forest, param_distributions = parametro_rf, n_iter = 100, cv = FOLDS, 
                               verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
forest_random.fit(X_train, y_train)

In [None]:
forest_random.best_params_

**** 2° Extra Tree****

In [None]:
# Number of trees in random forest
n_estimators = range(50,280)
# Maximum number of levels in tree
max_depth =  range(1,40)
# Minimum number of samples required to split a node
min_samples_leaf = [3,4,5,6,7,8,9,10,15,20,30,40,50,60]

In [None]:
# Create the random grid
parametro_Et = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_leaf': min_samples_leaf}

In [None]:
model_et = ExtraTreesClassifier(n_jobs=-1)
et_random = RandomizedSearchCV(estimator = model_et, param_distributions = parametro_rf, n_iter = 100, cv = FOLDS, 
                               verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
et_random.fit(X_train, y_train)

In [None]:
et_random.best_params_

**3° AdaBoos**

In [None]:
n_estimators = range(10,200)

learning_rate = [0.0001, 0.001, 0.01, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,1]

In [None]:
# Create the random grid
parametros_ada = {'n_estimators': n_estimators,
               'learning_rate': learning_rate}

In [None]:
model_ada = AdaBoostClassifier()

ada_random = RandomizedSearchCV(estimator = model_ada, param_distributions = parametros_ada, n_iter = 100, cv = FOLDS, 
                               verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
ada_random.fit(X_train, y_train)

In [None]:
ada_random.best_params_

** 4° Gradient Boosting**

In [None]:
parametros_gb = {
    "loss":["deviance"],
    "learning_rate": [0.01, 0.025, 0.005,0.5, 0.075, 0.1, 0.15, 0.2,0.3,0.8,0.9],
    "min_samples_split": [0.01, 0.025, 0.005,0.4,0.5, 0.075, 0.1, 0.15, 0.2,0.3,0.8,0.9],
    "min_samples_leaf": [1,2,3,5,8,10,15,20,40,50,55,60,65,70,80,85,90,100],
    "max_depth":[3,5,8,10,15,20,25,30,40,50],
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse",  "mae"],
    "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
    "n_estimators":range(1,100)
    }

In [None]:
model_gb= GradientBoostingClassifier()


gb_random = RandomizedSearchCV(estimator = model_gb, param_distributions = parametros_gb, n_iter = 100, cv = FOLDS, 
                               verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
gb_random.fit(X_train, y_train)

In [None]:
gb_random.best_params_

**5° Support Vector**

In [None]:
C = [0.001, 0.10, 0.1, 10, 25, 50,65,70,80,90, 100, 1000]

kernel =  ['linear', 'poly', 'rbf', 'sigmoid']
    
gamma =[1e-2, 1e-3, 1e-4, 1e-5,1e-6,1]

In [None]:
# Create the random grid
parametros_svm = {'C': C,
            'gamma': gamma,
             'kernel': kernel}

In [None]:
model_svm = SVC()
from sklearn.model_selection import GridSearchCV
svm_random = GridSearchCV(model_svm, parametros_svm,  cv = FOLDS, 
                               verbose=2, n_jobs = -1, scoring='roc_auc')
svm_random.fit(X_train.values, y_train)

In [None]:
svm_random.best_params_

**6° xgboost **

In [None]:
param_xgb = {
        'silent': [False],
        'max_depth': [6, 10, 15, 20],
        'learning_rate': [0.001, 0.01, 0.1, 0.2, 0,3],
        'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
        'gamma': [0, 0.25, 0.5, 1.0],
        'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
        'n_estimators': [50,100,120]}

In [None]:
from sklearn.model_selection import GridSearchCV

model_xgb = xgb.XGBClassifier()
xgb_random = RandomizedSearchCV(estimator = model_xgb, param_distributions = param_xgb, n_iter = 100, cv = FOLDS, 
                               verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
xgb_random.fit(X_train.values, y_train)

In [None]:
xgb_random.best_params_

# Selected Parameters

After running RandomizedSearchCV several times, we found the most acceptable parameters for each of our models.
We will save these parameters to then make the adjustment of our models.

In [None]:
parametro_rf = {'n_estimators': 133,
 'min_samples_split': 3,
 'max_features': 'auto',
 'max_depth': 39}

parametro_et = {'n_estimators': 46,
 'min_samples_split': 3,
 'max_features': 'sqrt',
 'max_depth': 20}

parametro_ada = {'n_estimators': 40, 'learning_rate': 0.9}  

parametro_gb = {'subsample': 0.95,
 'n_estimators': 96,
 'min_samples_split': 0.15,
 'min_samples_leaf': 5,
 'max_features': 'log2',
 'max_depth': 50,
 'loss': 'deviance',
 'learning_rate': 0.15,
 'criterion': 'friedman_mse'}

parametro_svm = {'C': 25, 'gamma': 1, 'kernel': 'rbf'}

parametro_xgb= {'subsample': 0.6,
 'silent': False,
 'reg_lambda': 1.0,
 'n_estimators': 120,
 'min_child_weight': 0.5,
 'max_depth': 15,
 'learning_rate': 0.2,
 'gamma': 0.5,
 'colsample_bytree': 0.4,
 'colsample_bylevel': 1.0}


<a id="ch17"></a>
## 4. 2 Generating our models

So now let's prepare five learning models as our classification. All these models can be invoked conveniently through the Sklearn library and are listed below:

1. random forest sorter
2. AdaBoost classifier.
3. Gradient Boosting classifer
4. Support vector machine
5. Extra Trees


In [None]:
 #base models with hyper parameters already tuned
model_rf =  RandomForestClassifier(n_estimators=133,min_samples_split=3,max_features='auto',max_depth= 39)
model_et = ExtraTreesClassifier(n_estimators=133,min_samples_split=3,max_features='sqrt',max_depth= 20)
model_ada = AdaBoostClassifier(n_estimators=40,learning_rate=0.9)
model_gb = GradientBoostingClassifier(subsample = 0.95,n_estimators= 96,
                 min_samples_split = 0.15,
                 min_samples_leaf = 5,
                 max_features = 'log2',
                 max_depth = 50,
                 loss = 'deviance',
                 learning_rate = 0.15,
                 criterion= 'friedman_mse')
model_svc = SVC(C = 25, gamma= 1, kernel ='rbf')
model_xgb = xgb.XGBClassifier(psubsample= 0.6,
 silent= False,
 reg_lambda =1.0,
 n_estimators= 120,
 min_child_weight= 0.5,
 max_depth = 15,
 learning_rate= 0.2,
 gamma= 0.5,
 colsample_bytree=0.4,
 colsample_bylevel= 1.0)

<a id="ch18"></a>
## 4.3 Cross Validation

In [None]:
kf = KFold(n_splits=FOLDS, random_state = 0, shuffle = True)
for i, (train_index, val_index) in enumerate(kf.split(X_train, y_train)):
    Xtrain, Xval = X_train.values[train_index], X_train.values[val_index]
    ytrain, yval = y_train[train_index], y_train[val_index]
    
    model_rf.fit(Xtrain, ytrain)
    model_et.fit(Xtrain, ytrain)
    model_ada.fit(Xtrain, ytrain)
    model_gb.fit(Xtrain, ytrain)
    model_svc.fit(Xtrain, ytrain)
    model_xgb.fit(Xtrain, ytrain)
    

<a id="ch19"></a>
# 5. Importance of characteristics 

According to the Sklearn documentation, most classifiers are built with an attribute that returns important features by simply typing *. Feature_importances _ *. Therefore, we will invoke this very useful attribute through our graph of the function of the importance of the characteristic as such

In [None]:
rf_feature = model_rf.feature_importances_
ada_feature = model_ada.feature_importances_
gb_feature = model_gb.feature_importances_
et_feature = model_et.feature_importances_
xbg_feature = model_xgb.feature_importances_

In [None]:
cols = X.columns.tolist()
# Create a dataframe with features
feature_dataframe = pd.DataFrame( {'features': cols,
     'Random Forest feature importances': rf_feature,
      'AdaBoost feature importances': ada_feature,
    'Gradient Boost feature importances': gb_feature,
    'Extra Trees  feature importances': et_feature,
    'Xgboost feature importances': xbg_feature,
    })

In [None]:
xbg_feature

## Graphics:

In [None]:
# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

trace = go.Scatter(
    y = feature_dataframe['Xgboost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Xgboost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'XgboostFeature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')


In [None]:
# Create the new column that contains the average of the values.
feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
feature_dataframe.head(3)

In [None]:
y = feature_dataframe['mean'].values
x = feature_dataframe['features'].values
data = [go.Bar(
            x= x,
             y= y,
            width = 0.5,
            marker=dict(
               color = feature_dataframe['mean'].values,
            colorscale='Portland',
            showscale=True,
            reversescale = False
            ),
            opacity=0.6
        )]

layout= go.Layout(
    autosize= True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='bar-direct-labels')

<a id="ch20"></a>
# 6. Predictions

In [None]:
Predicted_rf= model_rf.predict(X_test)
Predicted_ada = model_ada.predict(X_test)
Predicted_gb = model_gb.predict(X_test)
Predicted_et = model_et.predict(X_test)
Predicted_svm= model_svc.predict(X_test)
Predicted_xgb= model_xgb.predict(X_test.values)

In [None]:
base_predictions_train = pd.DataFrame( {'RandomForest': Predicted_rf.ravel(),
      'AdaBoost': Predicted_ada.ravel(),
      'GradientBoost': Predicted_gb.ravel(),
      'ExtraTrees': Predicted_et.ravel(),
      'SVM': Predicted_svm.ravel(),
      'XGB': Predicted_xgb.ravel(),
     'Real value': y_test                                
                                        
    })
base_predictions_train.head(10)

<a id="ch21"></a>
# 7. Performance Metric for each model

In [None]:
acc = [] # list to store all performance metric

In [None]:
model='Random Forest'
test_score = cross_val_score(model_rf, X_train, y_train, cv=FOLDS, scoring='accuracy').mean() # Get recall for each parameter setting
test_recall = recall_score(y_test, Predicted_rf, pos_label=1)
fpr, tpr, thresholds = roc_curve(y_test, Predicted_rf, pos_label=1)
test_auc = auc(fpr, tpr)
acc.append([model,test_score, test_recall, test_auc, fpr, tpr, thresholds])

model='AdaBoost'
test_score = cross_val_score(model_ada, X_train, y_train, cv=FOLDS, scoring='accuracy').mean() # Get recall for each parameter setting
test_recall = recall_score(y_test, Predicted_ada, pos_label=1)
fpr, tpr, thresholds = roc_curve(y_test, Predicted_ada, pos_label=1)
test_auc = auc(fpr, tpr)
acc.append([model, test_score,test_recall, test_auc, fpr, tpr, thresholds])

model='Gradient Boosting'
test_score = cross_val_score(model_gb, X_train, y_train, cv=FOLDS, scoring='accuracy').mean() # Get recall for each parameter setting
test_recall = recall_score(y_test, Predicted_gb, pos_label=1)
fpr, tpr, thresholds = roc_curve(y_test, Predicted_gb, pos_label=1)
test_auc = auc(fpr, tpr)
acc.append([model, test_score,test_recall, test_auc, fpr, tpr, thresholds])

model='ExtraTrees'
test_score = cross_val_score(model_et, X_train, y_train, cv=FOLDS, scoring='accuracy').mean() # Get recall for each parameter setting
test_recall = recall_score(y_test, Predicted_et, pos_label=1)
fpr, tpr, thresholds = roc_curve(y_test, Predicted_et, pos_label=1)
test_auc = auc(fpr, tpr)
acc.append([model, test_score, test_recall, test_auc, fpr, tpr, thresholds])

model='SVM'
test_score = cross_val_score(model_svc, X_train, y_train, cv=FOLDS, scoring='accuracy').mean() # Get recall for each parameter setting
test_recall = recall_score(y_test, Predicted_svm, pos_label=1)
fpr, tpr, thresholds = roc_curve(y_test, Predicted_svm, pos_label=1)
test_auc = auc(fpr, tpr)
acc.append([model, test_score, test_recall, test_auc, fpr, tpr, thresholds])

model='Xgboost'
test_score = cross_val_score(model_xgb, X_train, y_train, cv=FOLDS, scoring='accuracy').mean() # Get recall for each parameter setting
test_recall = recall_score(y_test, Predicted_xgb, pos_label=1)
fpr, tpr, thresholds = roc_curve(y_test, Predicted_xgb, pos_label=1)
test_auc = auc(fpr, tpr)
acc.append([model,test_score, test_recall, test_auc, fpr, tpr, thresholds])


<a id="ch22"></a>
## 7.1 Report 

for the Extra Trees model


In [None]:
report_performance(model_et)

<a id="ch23"></a>
## 7.1 Results

In [None]:
result = pd.DataFrame(acc, columns=['Model', 'Accuracy', 'Recall', 'AUC', 'FPR', 'TPR', 'TH'])
result[['Model', 'Accuracy', 'Recall', 'AUC']]