# Extreme Gradient Boosting(XGBoost)
***

Welcome to the Extreme Gradient Boosting Project!

**This is Extreme Gradient Boosting Project**

In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with Loan Prediction.

We have seen in the lectures How Extreme Gradient Boosting works.

## What we have learned so far:
*
*
*

## What we are going to do?
* You have clean data-set. We will use an approach similar to previous grid search but will divide the parmeter in two parts.
* Choose default values for Xgboost Classifier.
* Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree.

## What your will learn by doing this assignment ?


* You will learn to build Xgboost model.

### Dataset
To perform Logistic Regression task we will use `Loan Prediction` dataset.  

This dataset contains following features: 
- ApplicantIncome
- CoapplicantIncome
- Loan Amount
- Loan Amount term
- Credit History
- Property_Area
- Self_Employed
- Education
- Dependents
- Married
- Gender
- Loan_ID

**Target Variable:**
- Loan Status

**Details information is mentioned in each task.**



In [1]:
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score



In [2]:
# load data
dataset = pd.read_csv('../Data/loan_clean_data.csv')
# split data into X and y
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=9)

# Let's start Extreme Gradient Boosting with its Parameter tune

* Extreme Gradient Boosting has lot parameter to tune, but we will be touching some of it.
* We have divided the parameter for the ease.

## Write a function `myXGBoost` that:
* Will take following param_grid along with model, dataset, KFold that will fit a model and will return the accuracy and best_params.
* You will using GridSearchCV.
* You will be using ***kwargs* (To set parameters to the base classifier)

### Parameters:

| Parameter | dtype | argument type | default value | description |
| --- | --- | --- | --- | --- |
| X_train | DataFrame | compulsory | | Dataframe containing feature variables for training|
| X_test | DataFrame | compulsory | | Dataframe containing feature variables for testing|
| y_train | Series/DataFrame | compulsory | | Training dataset target Variable |
| y_test | Series/DataFrame | compulsory | | Testing dataset target Variable |
| model | int | compulsory | | Which model needs to be build |
| param_grid | Dict | compulsory | | Dictionary of parameter |
| KFold | int | optiional | 3 | For Kfold validation |
| **kwargs |  | compulsory | | additional parameter to be given |

### Return :

| Return | dtype | description |
| --- | --- | --- |
| accuracy | float | accuracy of model using those params |
| best_params | Dict | Dictionary of best fit parameter the model  |

In [14]:
param_grid1 = {"max_depth": [2, 3, 4, 5, 6, 7, 9, 11],
             "min_child_weight": [4, 6, 7, 8],
             "subsample": [0.6, .7, .8, .9, 1],
             "colsample_bytree": [0.6, .7, .8, .9, 1]
             }

param_grid2 = {"gamma": [0, 0.05, 0.1, 0.3, 0.7, 0.9, 1],
             "reg_alpha": [0, 0.001, 0.005, 0.01, 0.05, 0.1],
             "reg_lambda": [0.05, 0.1, 0.5, 1.0]
              }

In [4]:
def myXGBoost(X_train, X_test, y_train, model, param_grid, KFold=3, **kwargs):
    
    if kwargs:
        model.set_params(**kwargs)
    gs_cv = GridSearchCV(model, param_grid=param_grid, cv=KFold, verbose=0)
    gs_cv.fit(X_train, y_train)
    best_params = gs_cv.best_params_
    y_pred = gs_cv.predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)
    
    return accuracy, best_params

In [5]:
xgb = XGBClassifier(seed=9)
accuracy, best_params = myXGBoost(X_train, X_test, y_train, xgb, param_grid1, KFold=3)

In [6]:
accuracy, best_params

(0.79670329670329665,
 {'colsample_bytree': 0.7,
  'gamma': 0.9,
  'max_depth': 2,
  'min_child_weight': 6,
  'subsample': 1})

# Let's Continue with Extreme Gradient Boosting  Parameter tuning

* Now we have tunned the first few parameter, now we use them and tune the rest params.

## Write a function `param2` that:
* Will take following param_grid along with model, dataset that will use **myXGBoost** and will return the accuracy and best_params.

### Parameters:

| Parameter | dtype | argument type | default value | description |
| --- | --- | --- | --- | --- |
| X_train | DataFrame | compulsory | | Dataframe containing feature variables for training|
| X_test | DataFrame | compulsory | | Dataframe containing feature variables for testing|
| y_train | Series/DataFrame | compulsory | | Training dataset target Variable |
| y_test | Series/DataFrame | compulsory | | Testing dataset target Variable |
| model | int | compulsory | | Which model needs to be build |
| param_grid | Dict | compulsory | | Dictionary of parameter |

### Return :

| Return | dtype | description |
| --- | --- | --- |
| accuracy | float | accuracy of model using those params |
| best_params | Dict | Dictionary of best fit parameter the model  |

In [11]:
# xgb1 = XGBClassifier(seed=9,colsample_bytree=0.7,gamma= 0.9,max_depth=2,min_child_weight=6,subsample=1)
def param2(X_train, X_test, y_train, model, param_grid2):
    return myXGBoost(X_train, X_test, y_train, model, param_grid2,
                    colsample_bytree=0.7,gamma= 0.9,max_depth=2,min_child_weight=6,subsample=1)

In [15]:
accuracy1, best_params1 = param2(X_train, X_test, y_train, xgb, param_grid2)

In [17]:
accuracy1, best_params1

(0.79670329670329665, {'n_estimators': 20, 'reg_alpha': 0, 'reg_lambda': 0.05})

# Build a Xgboost using the bestmodel

* Now we have tunned the parameter, we gonna use them in the model.

## Write a function `xgboost` that:
* Will take following dataset and will return the accuracy and best_params.


### Parameters:

| Parameter | dtype | argument type | default value | description |
| --- | --- | --- | --- | --- |
| X_train | DataFrame | compulsory | | Dataframe containing feature variables for training|
| X_test | DataFrame | compulsory | | Dataframe containing feature variables for testing|
| y_train | Series/DataFrame | compulsory | | Training dataset target Variable |
| y_test | Series/DataFrame | compulsory | | Testing dataset target Variable |
| **kwargs |  | compulsory | | additional parameter to be given |

### Return :

| Return | dtype | description |
| --- | --- | --- |
| accuracy | float | accuracy of model using those params |

To-Do list :

Check for different n_estimators and learning_rate whether the score are varing and find the best score

In [73]:
def xgboost(X_train, X_test, y_train, **kwargs):
    xgb1 = XGBClassifier(seed=9)
    if kwargs:
        xgb1.set_params(**kwargs)
    xgb1.fit(X_train,y_train)

    y_pred = xgb1.predict(X_test)
    return accuracy_score(y_pred, y_test)
    

In [80]:
accuracy_score = xgboost(X_train, X_test, y_train,colsample_bytree=0.7,gamma= 0.9,max_depth=2,min_child_weight=6,subsample=1,n_estimators=10,
                       reg_alpha=0, reg_lambda=0.05,learning_rate=0.1)

0.79670329670329665