<h1 align="center">Gradient Boosting and XGBoost</h1>

In [7]:
!pip install xgboost

Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/51/c1/198915b13e98b62a98f48309c41012638464651da755d941f4abe384c012/xgboost-0.82-py2.py3-none-win_amd64.whl (7.7MB)
Installing collected packages: xgboost
Successfully installed xgboost-0.82


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../data/white-wine.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 1
***
### Instructions
* Split the white-wine dataset into train and test set with `test_size = 0.3`

In [3]:
from sklearn.model_selection import train_test_split as tts

X , y = df.iloc[:,:-1], df.iloc[:,-1]

x_train, x_test, y_train, y_test = tts(X,y,test_size=0.3,random_state =0)

<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 2
***

### Instructions
* Initialise a decision tree model(Weak Classifier) with DecisionTreeClassifier() having max_depth=1 & random_state=0 and save it to a variable called `dt_clf`.

* Fit the model on the training data `x_train` and `y_train` using the `fit()` method.

* Find out the accuracy score between `x_test` and `y_test` using the `score()` method and save it in a variable called `dt_score`

* Initialise a AdaBoost model with AdaBoostClassifier() having base_estimator=dt_clf & random_state=0 and save it to a variable called `ada_clf`.

* Fit the model on the training data `x_train` and `y_train` using the `fit()` method.

* Find out the accuracy score between `x_test` and `y_test` using the `score()` method and save it in a variable called `ada_score`

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

#Fitting of Weak Classifier
dt_clf = DecisionTreeClassifier(max_depth=1 , random_state=0)
dt_clf.fit(x_train,y_train)
dt_score = dt_clf.score(x_test,y_test)
print("Weak Learner Score: ",dt_score)

# Fitting of weak classifier with Adaboost
ada_clf = AdaBoostClassifier(base_estimator=dt_clf , random_state=0)
ada_clf.fit(x_train,y_train)
ada_score = ada_clf.score(x_test,y_test)
print("Adaboost Score: ",ada_score)

Weak Learner Score:  0.42857142857142855
Adaboost Score:  0.4326530612244898


<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 3
***

### Instructions

* Initialise a Gradient Boost model with GradientBoostingClassifier() having random_state=0 and save it to a variable called `gb_clf`.

* Fit the model on the training data `x_train` and `y_train` using the `fit()` method.

* Find out the accuracy score between `x_test` and `y_test` using the `score()` method and save it in a variable called `gb_score`

In [5]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(random_state =0)
gb_clf.fit(x_train,y_train)
gb_score = gb_clf.score(x_test,y_test)
print("Gradient Boost Score: ",gb_score)

Gradient Boost Score:  0.5727891156462585


<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 4
***
### Instructions
* For this challenge, you will have to install `xgboost` library. Import `xgboost` and instantiate `XGBClassifier()` and fit it on the training set

In [6]:
import xgboost as xgb

model = xgb.XGBClassifier(random_state=0)

model.fit(x_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 5
***
### Instructions
* Predict the model on the test set and check the accuracy score

In [7]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(x_test)

accuracy_score(y_test,y_pred)

0.5619047619047619

<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 6
***
### Instructions
* Now let's do some hyperparameter tuning by defining the parameters `'colsample_bytree': np.linspace(0.5, 0.9, 5)`,`'n_estimators':[5, 10]`,`'max_depth': [10, 15, 20, 25]` and then fit them into a GridSearchCV model with `scoring = 'neg_mean_squared_error'`and cross_ validation parameter `cv = 5`<br/><br/>Feel free to experiment on the hyperparameters.

In [8]:
from sklearn.model_selection import GridSearchCV

gbm_param_grid = {
     'colsample_bytree': np.linspace(0.5, 0.9, 5),
     'n_estimators':[5, 10],
     'max_depth': [10, 15, 20, 25]
}

grid_mse = GridSearchCV(estimator = model, param_grid = gbm_param_grid, scoring = 'accuracy', cv = 5)

<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 7
***
### Instructions
* Fit the GridSearchCV model on the train dataset and get the best parametres and the Highest Accuracy found

In [9]:
grid_mse.fit(x_train, y_train)
print("Best parameters found: ",grid_mse.best_params_)
print("Highest Accuracy found: ", grid_mse.best_score_)

Best parameters found:  {'colsample_bytree': 0.6, 'max_depth': 25, 'n_estimators': 10}
Highest Accuracy found:  0.6490665110851809


<img src="../images/icon/ppt-icons.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />

### Mini Challenge - 8
***
### Instructions
* Make a prediction on the test dataset and print the accuracy

In [10]:
pred = grid_mse.predict(x_test)
accuracy_score(y_test,pred)

0.6482993197278911

<img src="../images/icon/quiz.png" alt="Technical-Stuff" style="width: 100px;float:left; margin-right:15px"/>
<br />


# Thank You
***
### Next Session: Challenges in ML
For more queries - Reach out to academics@greyatom.com 