**Problem Statement**:
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem.  There are 768 observations with 8 input variables and 1 output variable.  The variable names are as follows:
1.	Number of times pregnant.
2.	Plasma glucose concentration 2 hours in an oral glucose tolerance test.
3.	Diastolic blood pressure (mm Hg).
4.	Triceps skinfold thickness (mm).
5.	2-Hour serum insulin (mu U/ml).
6.	Body mass index (weight in kg/(height in m)^2).
7.	Diabetes pedigree function.
8.	Age (years).
9.	Is Diabetic (0 or 1).

### On this data we apply adaboost,gradientboost and xgboost an will see their performance 

#### Importing required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

#### Loading Data Set

In [2]:
df=pd.read_csv("D:\\python_datascience\\data sets\\diabetes.csv")
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


### Getting shape of data 

In [3]:
df.shape

(768, 9)

#### Checking missing values 

In [4]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

The data has no missing values

#### checking class imbalance

In [5]:
df["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

here , it is slightly imbalanced so we  may not need to use oversampling

#### Dividing features as x and target as y

In [6]:
x=df.drop(columns=["Outcome"])
x.head()  ## features 

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [7]:
y=df["Outcome"]
y.head()  ## target

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

#### Dividing the data into training and test data 

In [8]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42) # default : 0.25

In [9]:
#### Shape of train and test data 
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(576, 8)
(576,)
(192, 8)
(192,)


#### Model Fitting  using AdaBoostClassifier

In [10]:
adaboost=AdaBoostClassifier(n_estimators=50) 


Important parameters 

 n_estimators : defines the number of weal learners  which is to be trained iteratively.


Learning_rate :it contributes to the weight of weak learners it uses 1 as default value

In [11]:
#### Train AdaBoost Classifier 
model=adaboost.fit(x_train,y_train)

In [12]:
#### Predictions 
y_pred=model.predict(x_test)

In [13]:
### Model Evaluation
accuracy_score(y_test,y_pred)

0.7239583333333334

#### here, we can see the accuracy of Adaboost algorithm without doing any feature engineering we get 72% on this data using AdaBoost with default parameters.

####  Lets see few aspects of AdaBoost and hyper parameter tuning 

In [14]:
## import logistic Regression Model
from sklearn.linear_model import LogisticRegression

In [15]:
lm=LogisticRegression()

In [16]:
## Train adaboost with base_estimator as logstic regression since default is Decision Tree just playing with its parameters 
adab=AdaBoostClassifier(n_estimators=50,base_estimator=lm,learning_rate=1)
## The base estimator from which the ensemble is grown.

In [17]:
## Train the model
mo=adab.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
#### Predictions 
y_pred=mo.predict(x_test)

In [19]:
### Model Evaluation
accuracy_score(y_test,y_pred)

0.703125

### As we can see we used base_estimator as logistic regression it gives less accuracy than adaboost with default paramteres 

In [20]:
### Taking n_estimators=100 to tune the model in order to see the improvment in performance 
adab=AdaBoostClassifier(n_estimators=100,learning_rate=1)
## Train the model
mo=adab.fit(x_train,y_train)
#### Predictions 
y_pred=mo.predict(x_test)
### Model Evaluation
accuracy_score(y_test,y_pred)

0.7395833333333334

#### as we can see by taking 100 estimators we got 73.9% accuracy so  this we can play with hyper parameters 

#### Importing gradient boost classifier 

In [21]:
from sklearn.ensemble import GradientBoostingClassifier


In [22]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(576, 8)
(576,)
(192, 8)
(192,)


In [23]:
gbc=GradientBoostingClassifier() ## with default parameters
gbcm=gbc.fit(x_train,y_train)
#### Predictions 
y_pred=gbcm.predict(x_test)
### Model Evaluation
accuracy_score(y_test,y_pred)

0.7395833333333334

#### As we can see GradientBoosting Gives Better performance i.e 73.9% accuracy  than Adaboost  with default parameters 

####  Hyper parameter Tuning 
Important parameters :

- n_estimators

- learning_rate

In [24]:
 ### Lets do Hyper parameter Tuning of parameters max_depth and n_estimators and see the results
gbc=GradientBoostingClassifier(max_depth=2,n_estimators=3)
gbcm=gbc.fit(x_train,y_train)
#### Predictions 
y_pred=gbcm.predict(x_test)
### Model Evaluation
accuracy_score(y_test,y_pred)

0.640625

performance got drop down with max_depth=2 and n_estimators=3

### Getting optimal parameters for GradientBoostingClassifier using gridsearchcv

In [25]:
from sklearn.model_selection import GridSearchCV
param_grid_gbc={'learning_rate':[0.15,0.1,0.10,0.05],"n_estimators":[100,150,200,250]}
grid=GridSearchCV(estimator=GradientBoostingClassifier(),param_grid=param_grid_gbc,verbose=3)


#grid= GridSearchCV(XGBClassifier(objective='binary:logistic'),param_grid,verbose=3)
grid.fit(x_train,y_train)


Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV 1/5] END learning_rate=0.15, n_estimators=100;, score=0.784 total time=   0.0s
[CV 2/5] END learning_rate=0.15, n_estimators=100;, score=0.791 total time=   0.0s
[CV 3/5] END learning_rate=0.15, n_estimators=100;, score=0.800 total time=   0.0s
[CV 4/5] END learning_rate=0.15, n_estimators=100;, score=0.713 total time=   0.0s
[CV 5/5] END learning_rate=0.15, n_estimators=100;, score=0.722 total time=   0.0s
[CV 1/5] END learning_rate=0.15, n_estimators=150;, score=0.767 total time=   0.1s
[CV 2/5] END learning_rate=0.15, n_estimators=150;, score=0.757 total time=   0.1s
[CV 3/5] END learning_rate=0.15, n_estimators=150;, score=0.817 total time=   0.1s
[CV 4/5] END learning_rate=0.15, n_estimators=150;, score=0.730 total time=   0.1s
[CV 5/5] END learning_rate=0.15, n_estimators=150;, score=0.739 total time=   0.1s
[CV 1/5] END learning_rate=0.15, n_estimators=200;, score=0.759 total time=   0.1s
[CV 2/5] END learning_rate

GridSearchCV(estimator=GradientBoostingClassifier(),
             param_grid={'learning_rate': [0.15, 0.1, 0.1, 0.05],
                         'n_estimators': [100, 150, 200, 250]},
             verbose=3)

In [26]:
# To  find the parameters giving maximum accuracy
grid.best_params_

{'learning_rate': 0.1, 'n_estimators': 100}

In [27]:
# Create new model using the same parameters
new_model=XGBClassifier(learning_rate=0.05,n_estimators= 100)
new_model.fit(x_train, y_train)
y_pred_new = new_model.predict(x_test)
accuracy_new = accuracy_score(y_test,y_pred_new)
accuracy_new

0.7239583333333334

#### Model fitting using xgboost

In [28]:
#import xgboost as xgb
from xgboost import XGBClassifier

In [29]:
model = XGBClassifier(objective='binary:logistic')
model.fit(x_train,y_train)
# cheking training accuracy
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test,y_pred)
accuracy

0.75

#### xgboost gives best accuracy i.e 75% than others on same data 

### Hyper parameter tuning :

In [31]:
param_grid={
   
    'learning_rate':[1,0.5,0.1,0.01,0.001],
    'max_depth': [3,5,10,20],
    'n_estimators':[10,50,100,200]
    
}  ## This grid takes all possible combinations of its parameters learning_rate,max_depth and n estimators like ((1,3,10) ,(1,5,50) and so on )  
grid= GridSearchCV(XGBClassifier(objective='binary:logistic'),param_grid, verbose=3)
grid.fit(x_train,y_train)

Fitting 5 folds for each of 80 candidates, totalling 400 fits
[CV 1/5] END learning_rate=1, max_depth=3, n_estimators=10;, score=0.802 total time=   0.0s
[CV 2/5] END learning_rate=1, max_depth=3, n_estimators=10;, score=0.826 total time=   0.0s
[CV 3/5] END learning_rate=1, max_depth=3, n_estimators=10;, score=0.783 total time=   0.0s
[CV 4/5] END learning_rate=1, max_depth=3, n_estimators=10;, score=0.687 total time=   0.0s
[CV 5/5] END learning_rate=1, max_depth=3, n_estimators=10;, score=0.757 total time=   0.0s
[CV 1/5] END learning_rate=1, max_depth=3, n_estimators=50;, score=0.767 total time=   0.0s
[CV 2/5] END learning_rate=1, max_depth=3, n_estimators=50;, score=0.739 total time=   0.0s
[CV 3/5] END learning_rate=1, max_depth=3, n_estimators=50;, score=0.757 total time=   0.0s
[CV 4/5] END learning_rate=1, max_depth=3, n_estimators=50;, score=0.730 total time=   0.0s
[CV 5/5] END learning_rate=1, max_depth=3, n_estimators=50;, score=0.730 total time=   0.0s
[CV 1/5] END learn

[CV 1/5] END learning_rate=0.5, max_depth=3, n_estimators=100;, score=0.759 total time=   0.0s
[CV 2/5] END learning_rate=0.5, max_depth=3, n_estimators=100;, score=0.748 total time=   0.0s
[CV 3/5] END learning_rate=0.5, max_depth=3, n_estimators=100;, score=0.765 total time=   0.0s
[CV 4/5] END learning_rate=0.5, max_depth=3, n_estimators=100;, score=0.696 total time=   0.0s
[CV 5/5] END learning_rate=0.5, max_depth=3, n_estimators=100;, score=0.739 total time=   0.0s
[CV 1/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.767 total time=   0.0s
[CV 2/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.722 total time=   0.0s
[CV 3/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.730 total time=   0.0s
[CV 4/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.696 total time=   0.0s
[CV 5/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.748 total time=   0.0s
[CV 1/5] END learning_rate=0.5, max_depth=5, n_est

[CV 4/5] END learning_rate=0.1, max_depth=3, n_estimators=200;, score=0.696 total time=   0.1s
[CV 5/5] END learning_rate=0.1, max_depth=3, n_estimators=200;, score=0.774 total time=   0.0s
[CV 1/5] END learning_rate=0.1, max_depth=5, n_estimators=10;, score=0.784 total time=   0.0s
[CV 2/5] END learning_rate=0.1, max_depth=5, n_estimators=10;, score=0.748 total time=   0.0s
[CV 3/5] END learning_rate=0.1, max_depth=5, n_estimators=10;, score=0.774 total time=   0.0s
[CV 4/5] END learning_rate=0.1, max_depth=5, n_estimators=10;, score=0.687 total time=   0.0s
[CV 5/5] END learning_rate=0.1, max_depth=5, n_estimators=10;, score=0.713 total time=   0.0s
[CV 1/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.802 total time=   0.0s
[CV 2/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.730 total time=   0.0s
[CV 3/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.774 total time=   0.0s
[CV 4/5] END learning_rate=0.1, max_depth=5, n_estimators=

[CV 2/5] END learning_rate=0.01, max_depth=5, n_estimators=50;, score=0.765 total time=   0.0s
[CV 3/5] END learning_rate=0.01, max_depth=5, n_estimators=50;, score=0.783 total time=   0.0s
[CV 4/5] END learning_rate=0.01, max_depth=5, n_estimators=50;, score=0.696 total time=   0.0s
[CV 5/5] END learning_rate=0.01, max_depth=5, n_estimators=50;, score=0.722 total time=   0.0s
[CV 1/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.767 total time=   0.0s
[CV 2/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.739 total time=   0.0s
[CV 3/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.783 total time=   0.0s
[CV 4/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.704 total time=   0.0s
[CV 5/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.748 total time=   0.0s
[CV 1/5] END learning_rate=0.01, max_depth=5, n_estimators=200;, score=0.784 total time=   0.1s
[CV 2/5] END learning_rate=0.01, max_depth=5

[CV 5/5] END learning_rate=0.001, max_depth=5, n_estimators=100;, score=0.713 total time=   0.0s
[CV 1/5] END learning_rate=0.001, max_depth=5, n_estimators=200;, score=0.741 total time=   0.1s
[CV 2/5] END learning_rate=0.001, max_depth=5, n_estimators=200;, score=0.757 total time=   0.1s
[CV 3/5] END learning_rate=0.001, max_depth=5, n_estimators=200;, score=0.783 total time=   0.1s
[CV 4/5] END learning_rate=0.001, max_depth=5, n_estimators=200;, score=0.696 total time=   0.1s
[CV 5/5] END learning_rate=0.001, max_depth=5, n_estimators=200;, score=0.739 total time=   0.1s
[CV 1/5] END learning_rate=0.001, max_depth=10, n_estimators=10;, score=0.759 total time=   0.0s
[CV 2/5] END learning_rate=0.001, max_depth=10, n_estimators=10;, score=0.748 total time=   0.0s
[CV 3/5] END learning_rate=0.001, max_depth=10, n_estimators=10;, score=0.713 total time=   0.0s
[CV 4/5] END learning_rate=0.001, max_depth=10, n_estimators=10;, score=0.670 total time=   0.0s
[CV 5/5] END learning_rate=0.0

GridSearchCV(estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_cat_to_...,
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                               

In [32]:
# To  find the parameters giving maximum accuracy
grid.best_params_

{'learning_rate': 1, 'max_depth': 3, 'n_estimators': 10}

In [33]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=1) # default : 0.25

In [34]:
# Create new model using the same parameters
new_model=XGBClassifier(learning_rate= 1, max_depth= 3, n_estimators= 10)
new_model.fit(x_train, y_train)
y_pred_new = new_model.predict(x_test)
accuracy_new = accuracy_score(y_test,y_pred_new)
accuracy_new

0.78125

### this way we can increase the accuracy of our model 