# Adaboost - Diabetes dataset

- The dataset is obtained from [kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
- You can find all the machine learning practice notebooks at [my GitHub page](https://github.com/elakiricoder)

### Imports

In [113]:
import pandas as pd
import numpy as np
import csv
import random

### Load the Data

In [114]:
df = pd.read_csv("diabetes.csv")

In [115]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [116]:
df.shape

(768, 9)

 - It's obvious that the dataset is relatively small and hence, we might need to select a lower value for testing

### Split the Data

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
X = df.drop(['Outcome'], axis=1)
y = df['Outcome']

In [119]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

### Create the model and Predict

- Try Adaboost with default parameters

In [120]:
from sklearn.ensemble import AdaBoostClassifier

In [121]:
abc = AdaBoostClassifier()

In [122]:
pred_abc = abc.fit(X_train, y_train).predict(X_test)

### Evaluate the model

In [123]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [124]:
print(confusion_matrix(y_test,pred_abc))

[[78 21]
 [20 35]]


In [125]:
print(classification_report(y_test,pred_abc))

              precision    recall  f1-score   support

           0       0.80      0.79      0.79        99
           1       0.62      0.64      0.63        55

    accuracy                           0.73       154
   macro avg       0.71      0.71      0.71       154
weighted avg       0.73      0.73      0.73       154



### Test a Sample

In [126]:
gender_index = 0

gender_dic = {0:'No Diabetes', 1:'Diabetes Exists'}
y_test_np = np.array(y_test)
 
print(f'Actual --> {gender_dic[y_test_np[gender_index]]}  --  Prediction --> {gender_dic[pred_abc[gender_index]]}')

Actual --> No Diabetes  --  Prediction --> No Diabetes


 - Performance is realatively not well. 
 - Let's see how the fine tuning of parameters improves the performance.

### Tune the model with random search

In [127]:
params = {
    'n_estimators':[1,10,20,30,40,50,60,80,100,110,120,130,140,150,160,170,180,190,200,210,220,250,300,350],
    'learning_rate':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,2,10]
    }

In [128]:
from sklearn.model_selection import RandomizedSearchCV
search_model = AdaBoostClassifier()

In [129]:
random_search = RandomizedSearchCV(search_model, param_distributions=params, 
    n_iter=10,scoring='roc_auc', n_jobs=-1,cv=5,verbose=3)

In [130]:
random_search.fit(X,y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    2.2s finished


RandomizedSearchCV(cv=5, estimator=AdaBoostClassifier(), n_jobs=-1,
                   param_distributions={'learning_rate': [0.1, 0.2, 0.3, 0.4,
                                                          0.5, 0.6, 0.7, 0.8,
                                                          0.9, 1, 2, 10],
                                        'n_estimators': [1, 10, 20, 30, 40, 50,
                                                         60, 80, 100, 110, 120,
                                                         130, 140, 150, 160,
                                                         170, 180, 190, 200,
                                                         210, 220, 250, 300,
                                                         350]},
                   scoring='roc_auc', verbose=3)

In [131]:
random_search.best_estimator_

AdaBoostClassifier(learning_rate=0.2, n_estimators=120)

In [132]:
random_search.best_params_

{'n_estimators': 120, 'learning_rate': 0.2}

### Create the model with the suggested paramers

In [136]:
model = AdaBoostClassifier(n_estimators=120, learning_rate=0.2)

In [137]:
model.fit(X_train, y_train)
pred = model.predict(X_test)
train_pred = model.predict(X_train)

### Evaluate the boosted model

In [138]:
print('Train Accuracy score is:')
print(accuracy_score(y_train, train_pred))
print('---------------------------------')
print('Test Accuracy score is:')
print(accuracy_score(y_test, pred))
print('---------------------------------')
print('Confusion matrix:')
print(confusion_matrix(y_test, pred))
print('---------------------------------')
print('Classification Report:')
print(classification_report(y_test, pred))

Train Accuracy score is:
0.8175895765472313
---------------------------------
Test Accuracy score is:
0.7727272727272727
---------------------------------
Confusion matrix:
[[83 16]
 [19 36]]
---------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.84      0.83        99
           1       0.69      0.65      0.67        55

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154



 - Although it's not perfect, It has definitely made some improvements.
 - We may have to try different algorithms to see which one performs better.