### Python Implementation

**Problem Statement**:
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:
1.	Number of times pregnant.
2.	Plasma glucose concentration 2 hours in an oral glucose tolerance test.
3.	Diastolic blood pressure (mm Hg).
4.	Triceps skinfold thickness (mm).
5.	2-Hour serum insulin (mu U/ml).
6.	Body mass index (weight in kg/(height in m)^2).
7.	Diabetes pedigree function.
8.	Age (years).
9.	Is Diabetic (0 or 1).

In [2]:
import pandas as pd
import numpy as np
import xgboost as xgb
import pickle
from sklearn import datasets
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [3]:
# reading the features and the labels
data= pd.read_csv('pima-indians-diabetes.csv')

In [4]:
data.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure (mm Hg),Triceps skinfold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age,Is Diabetic
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
data.columns

Index(['Number of times pregnant', 'Plasma glucose concentration',
       'Diastolic blood pressure (mm Hg)', 'Triceps skinfold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age', 'Is Diabetic'],
      dtype='object')

In [6]:
cols = ['Plasma glucose concentration',
       'Diastolic blood pressure (mm Hg)', 'Triceps skinfold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age']

In [7]:
# as mentioned in the data description, the missing values have been replaced by zeroes. So, we are replacing zeroes with nan
for col in cols:
    data[col]=data[col].replace(0, np.nan)

In [8]:
# checking for missing values
data.isna().sum()

Number of times pregnant                            0
Plasma glucose concentration                        5
Diastolic blood pressure (mm Hg)                   35
Triceps skinfold thickness (mm)                   227
2-Hour serum insulin (mu U/ml)                    374
Body mass index (weight in kg/(height in m)^2)     11
Diabetes pedigree function                          0
Age                                                 0
Is Diabetic                                         0
dtype: int64

In [9]:
# imputing the missing values
data['Plasma glucose concentration']=data['Plasma glucose concentration'].fillna(data['Plasma glucose concentration'].mode()[0])
data['Diastolic blood pressure (mm Hg)']=data['Diastolic blood pressure (mm Hg)'].fillna(data['Diastolic blood pressure (mm Hg)'].mode()[0])
data['Triceps skinfold thickness (mm)']=data['Triceps skinfold thickness (mm)'].fillna(data['Triceps skinfold thickness (mm)'].mean())
data['2-Hour serum insulin (mu U/ml)']=data['2-Hour serum insulin (mu U/ml)'].fillna(data['2-Hour serum insulin (mu U/ml)'].mean())
data['Body mass index (weight in kg/(height in m)^2)']=data['Body mass index (weight in kg/(height in m)^2)'].fillna(data['Body mass index (weight in kg/(height in m)^2)'].mean())


In [10]:
# checking for missing values after imputation
data.isna().sum()

Number of times pregnant                          0
Plasma glucose concentration                      0
Diastolic blood pressure (mm Hg)                  0
Triceps skinfold thickness (mm)                   0
2-Hour serum insulin (mu U/ml)                    0
Body mass index (weight in kg/(height in m)^2)    0
Diabetes pedigree function                        0
Age                                               0
Is Diabetic                                       0
dtype: int64

In [11]:
#Separating the feature and the Label columns 
x=data.drop(labels='Is Diabetic', axis=1)
y= data['Is Diabetic']

In [12]:
x.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure (mm Hg),Triceps skinfold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33


In [13]:
# as the datapoints differ a lot in magnitude, we'll scale them
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaled_data=scaler.fit_transform(x)

In [14]:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y=train_test_split(scaled_data,y,test_size=0.3,random_state=42)

In [15]:
# fit model no training data
model = XGBClassifier(objective='binary:logistic')
model.fit(train_x, train_y)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [16]:
# cheking training accuracy
y_pred = model.predict(train_x)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(train_y,predictions)
accuracy

1.0

In [17]:
# cheking initial test accuracy
y_pred = model.predict(test_x)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(test_y,predictions)
accuracy

0.7272727272727273

In [18]:
test_x[0]

array([ 0.63994726, -0.77251205, -1.18156252,  0.43784695,  0.40547846,
        0.22451019, -0.1264714 ,  0.83038113])

Now to increase the accuracy of the model, we'll do hyperparameter tuning using grid search

In [19]:
from sklearn.model_selection import GridSearchCV

In [20]:
param_grid={
   
    ' learning_rate':[1,0.5,0.1,0.01,0.001],
    'max_depth': [3,5,10,20],
    'n_estimators':[10,50,100,200]
    
}

In [21]:
grid= GridSearchCV(XGBClassifier(objective='binary:logistic'),param_grid, verbose=3)

In [22]:
grid.fit(train_x,train_y)

Fitting 5 folds for each of 80 candidates, totalling 400 fits
[CV 1/5] END  learning_rate=1, max_depth=3, n_estimators=10;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=1, max_depth=3, n_estimators=10;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=1, max_depth=3, n_estimators=10;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=1, max_depth=3, n_estimators=10;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=1, max_depth=3, n_estimators=10;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=1, max_depth=3, n_estimators=50;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=1, max_depth=3, n_estimators=50;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=1, max_depth=3, n_estimators=50;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=1, max_depth=3, n_estimators=50;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=1, max_depth=3, n_estimators=50;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=

[CV 3/5] END  learning_rate=0.5, max_depth=3, n_estimators=100;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.5, max_depth=3, n_estimators=100;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.5, max_depth=3, n_estimators=100;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.5, max_depth=3, n_estimators=200;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.5, max_depth=3, n_estimators=200;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.5, max_depth=3, n_estimators=200;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.5, max_depth=3, n_estimators=200;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.5, max_depth=3, n_estimators=200;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.5, max_depth=5, n_estimators=10;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.5, max_depth=5, n_estimators=10;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.5, max_depth=5, n_estimators=10;

[CV 1/5] END  learning_rate=0.1, max_depth=5, n_estimators=50;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.1, max_depth=5, n_estimators=50;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.1, max_depth=5, n_estimators=50;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.1, max_depth=5, n_estimators=50;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.1, max_depth=5, n_estimators=50;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.1, max_depth=5, n_estimators=100;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.1, max_depth=5, n_estimators=100;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.1, max_depth=5, n_estimators=100;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.1, max_depth=5, n_estimators=100;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.1, max_depth=5, n_estimators=100;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.1, max_depth=5, n_estimators=200;, 

[CV 4/5] END  learning_rate=0.01, max_depth=10, n_estimators=50;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.01, max_depth=10, n_estimators=50;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.01, max_depth=10, n_estimators=100;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.01, max_depth=10, n_estimators=100;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.01, max_depth=10, n_estimators=100;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.01, max_depth=10, n_estimators=100;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.01, max_depth=10, n_estimators=100;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.01, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.01, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.01, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.01, max_dept

[CV 5/5] END  learning_rate=0.001, max_depth=10, n_estimators=100;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.001, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.001, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.001, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.001, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.001, max_depth=10, n_estimators=200;, score=nan total time=   0.0s
[CV 1/5] END  learning_rate=0.001, max_depth=20, n_estimators=10;, score=nan total time=   0.0s
[CV 2/5] END  learning_rate=0.001, max_depth=20, n_estimators=10;, score=nan total time=   0.0s
[CV 3/5] END  learning_rate=0.001, max_depth=20, n_estimators=10;, score=nan total time=   0.0s
[CV 4/5] END  learning_rate=0.001, max_depth=20, n_estimators=10;, score=nan total time=   0.0s
[CV 5/5] END  learning_rate=0.001,

400 fits failed out of a total of 400.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
53 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\argha\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\argha\anaconda3\lib\site-packages\xgboost\core.py", line 620, in inner_f
    return func(**kwargs)
  File "C:\Users\argha\anaconda3\lib\site-packages\xgboost\sklearn.py", line 1490, in fit
    self._Booster = train(
  File "C:\Users\argha\anaconda3\lib\site-packages\xgboost\core.py", line 620, in inner_f
    return func(**kwargs)
  File "C:\Users\argha\anaconda3\lib\site-packages\xgboost\traini

XGBoostError: [17:16:38] C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-0fc7796c793e6356f-1/xgboost/xgboost-ci-windows/src/learner.cc:749: Invalid parameter " learning_rate" contains whitespace.

In [23]:
# To  find the parameters givingmaximum accuracy
grid.best_params_

{' learning_rate': 1, 'max_depth': 3, 'n_estimators': 10}

In [24]:
# Create new model using the same parameters
new_model=XGBClassifier(learning_rate= 1, max_depth= 3, n_estimators= 10)
new_model.fit(train_x, train_y)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=10, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [25]:
y_pred_new = new_model.predict(test_x)
predictions_new = [round(value) for value in y_pred_new]
accuracy_new = accuracy_score(test_y,predictions_new)
accuracy_new

0.7532467532467533

As we have increased the accuracy of the model, we'll save this model

In [26]:
filename = 'xgboost_model.pickle'
pickle.dump(new_model, open(filename, 'wb'))

loaded_model = pickle.load(open(filename, 'rb'))

In [27]:
# we'll save the scaler object as well for prediction
filename_scaler = 'scaler_model.pickle'
pickle.dump(scaler, open(filename_scaler, 'wb'))

scaler_model = pickle.load(open(filename_scaler, 'rb'))

In [28]:
# Trying a random prediction
d=scaler_model.transform([[6,148,72,35,80,33.6,0.627,50]])
pred=loaded_model.predict(d)
print('This data belongs to class :',pred[0])

This data belongs to class : 1




**The main advantages:**
- out of the box feature of appropriate bias-variance trade-off,
- great computation speed as it utilises parallel computing and cache optimization,
- uses hardware optimization,
- works well even if the features are correlated
- robust even if there is noise for classification problem
- the facility of early stopping
- the package is evolving, i.e., new features are being added.