# Gradient Boosting (Classification)

Data Source: [Wine Quality]("https://archive.ics.uci.edu/ml/datasets/wine+quality")

**Data Set Information**

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

**Attribute Information**

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):

- 1 - fixed acidity
- 2 - volatile acidity
- 3 - citric acid
- 4 - residual sugar
- 5 - chlorides
- 6 - free sulfur dioxide
- 7 - total sulfur dioxide
- 8 - density
- 9 - pH
- 10 - sulphates
- 11 - alcohol

Output variable (based on sensory data):

- 12 - quality (score between 0 and 10)

In [1]:
# Importing the necessary packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Read the dataset
wine = pd.read_csv("./wine_quality/winequality-red.csv", delimiter = ";")
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
# Display the characteristics of dataset
print("Dimensions of dataset are: ", wine.shape)
print("The variables present in dataset are: \n", wine.columns)

Dimensions of dataset are:  (1599, 12)
The variables present in dataset are: 
 Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


In [4]:
# Using random seed function to generate the same data
np.random.seed(3000)

In [5]:
# Train-Test Split
# Dependent variable - quality
training, test = train_test_split(wine, test_size = 0.3)

x_trg = training.drop("quality", axis = 1)
y_trg = training["quality"]

x_test = test.drop("quality", axis = 1)
y_test = test["quality"]

### Creating a Gradient Boosting model

In [6]:
# Model building - Gradient Boosting
wine_grad = GradientBoostingClassifier(random_state = 0)

# Fit the model
wine_grad.fit(x_trg, y_trg)
print("Accuracy of Gradient Boosting model on training set is: ", wine_grad.score(x_trg, y_trg))
print("Accuracy of Gradient Boosting model on test set is: ", wine_grad.score(x_test, y_test))

Accuracy of Gradient Boosting model on training set is:  0.9115281501340483
Accuracy of Gradient Boosting model on test set is:  0.6729166666666667


In [7]:
# Determine the accuracy of Gradient Boosting model via Confusion matrix
# Prediction via Gradient Boosting model
wine_grad_pred = wine_grad.predict(x_test)

# Compute the accuracy of prediction
wine_grad_acc_score = accuracy_score(y_test, wine_grad_pred)
print("Accuracy of Gradient Boosting model on prediction: ", wine_grad_acc_score)

# Confusion Matrix
wine_grad_results = confusion_matrix(y_test, wine_grad_pred)
print("Confusion Matrix of Gradient Boosting model is: \n", wine_grad_results)

Accuracy of Gradient Boosting model on prediction:  0.6729166666666667
Confusion Matrix of Gradient Boosting model is: 
 [[  0   0   1   1   0   0]
 [  0   0   9   4   0   0]
 [  0   1 176  32   1   0]
 [  1   2  63 119  12   4]
 [  0   0   4  17  27   1]
 [  0   0   0   3   1   1]]


#### Create a new Gradient Boosting model with Grid Search

In [8]:
# Import the necessary package
from sklearn.model_selection import GridSearchCV

In [9]:
# Setting the parameters
param_grid = {"max_features" : ["auto", "sqrt"], "max_depth" : [3,4], "n_estimators" : [50,100,200],
             "learning_rate" : [0.05,0.1]}

In [10]:
# Model building - new Gradient Boosting model
wine_grad_grid = GradientBoostingClassifier()
wine_grad_CV = GridSearchCV(estimator = wine_grad_grid, param_grid = param_grid, cv = 5)

# Fit the model
wine_grad_result = wine_grad_CV.fit(x_trg, y_trg)
print("Best Parameters are: \n", wine_grad_CV.best_params_)

Best Parameters are: 
 {'learning_rate': 0.1, 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}


In [11]:
# Creating model with best scores
wine_grad_best = GradientBoostingClassifier(max_depth = wine_grad_result.best_params_["max_depth"],
                        max_features = wine_grad_result.best_params_["max_features"],
                        n_estimators = wine_grad_result.best_params_["n_estimators"],
                        learning_rate = wine_grad_result.best_params_["learning_rate"])

#### Evaluating the model with best scores

In [12]:
# Fit the model
wine_grad_best.fit(x_trg, y_trg)
print("Accuracy of GB model of best score on training set is: ", wine_grad_best.score(x_trg, y_trg))

Accuracy of GB model of best score on training set is:  1.0


In [13]:
# Prediction via GB model with best scores
wine_grad_pred_2 = wine_grad_best.predict(x_test)
print("Classification Report of GB model with best scores: \n",
      classification_report(y_test, wine_grad_pred_2))

Classification Report of GB model with best scores: 
               precision    recall  f1-score   support

           3       0.00      0.00      0.00         2
           4       0.00      0.00      0.00        13
           5       0.70      0.81      0.75       210
           6       0.69      0.64      0.66       201
           7       0.63      0.55      0.59        49
           8       0.33      0.20      0.25         5

    accuracy                           0.68       480
   macro avg       0.39      0.37      0.38       480
weighted avg       0.66      0.68      0.67       480



In [14]:
# Determine the accuracy of prediction via Confusion Matrix
# Compute the accuracy score of prediction
wine_grad_acc_score_2 = accuracy_score(y_test, wine_grad_pred_2)
print("Accuracy of GB model with best score is: ", wine_grad_acc_score_2)

# Confusion Matrix
wine_grad_results_2 = confusion_matrix(y_test, wine_grad_pred_2)
print("Confusion Matrix of GB model with best score is: \n", wine_grad_results_2)

Accuracy of GB model with best score is:  0.6833333333333333
Confusion Matrix of GB model with best score is: 
 [[  0   0   1   0   1   0]
 [  0   0   8   4   1   0]
 [  0   3 171  34   2   0]
 [  0   0  61 129   9   2]
 [  0   0   3  19  27   0]
 [  0   0   0   1   3   1]]


#### Creating AdaBoost model

In [15]:
# Model building - AdaBoost
wine_ada = AdaBoostClassifier(random_state = 0)

# Fit the model
wine_ada.fit(x_trg, y_trg)
print("Accuracy of AdaBoost model on training set is: ", wine_ada.score(x_trg, y_trg))
print("Accuracy of AdaBoost model on test set is: ", wine_ada.score(x_test, y_test))

# Prediction via AdaBoost
wine_ada_pred = wine_ada.predict(x_test)

# Compute the accuary of prediction
wine_ada_acc_score = accuracy_score(y_test, wine_ada_pred)
print("Accuracy of AdaBoost model on prediction is: ", wine_ada_acc_score)

# Confusion Matrix
wine_ada_results_2 = confusion_matrix(y_test, wine_ada_pred)
print("Confusion Matrix of AdaBoost model: \n", wine_ada_results_2)

Accuracy of AdaBoost model on training set is:  0.5442359249329759
Accuracy of AdaBoost model on test set is:  0.56875
Accuracy of AdaBoost model on prediction is:  0.56875
Confusion Matrix of AdaBoost model: 
 [[  0   0   2   0   0   0]
 [  0   0   8   4   0   1]
 [  0   0 181  27   0   2]
 [  0   0 107  92   0   2]
 [  0   0  13  36   0   0]
 [  0   0   0   5   0   0]]


#### Creating Extra Tree model

In [16]:
# Model building - Extra Tree
wine_extratree = ExtraTreesClassifier()

# Fit the model
wine_extratree.fit(x_trg, y_trg)
print("Accuracy of Extra Tree model on training set is: ", wine_extratree.score(x_trg, y_trg))
print("Accuracy of Extra Tree model on test set is: ", wine_extratree.score(x_test, y_test))

# Prediction via Extra Tree
wine_extratree_pred = wine_extratree.predict(x_test)

# Compute the accuracy score on prediction
wine_extratree_acc_score = accuracy_score(y_test, wine_extratree_pred)
print("Accruacy of Extra Tree model on prediction is: ", wine_extratree_acc_score)

# Confusion Matrix
wine_extratree_results = confusion_matrix(y_test, wine_extratree_pred)
print("Confusion Matrix of Extra Tree model: \n", wine_extratree_results)

Accuracy of Extra Tree model on training set is:  1.0
Accuracy of Extra Tree model on test set is:  0.6916666666666667
Accruacy of Extra Tree model on prediction is:  0.6916666666666667
Confusion Matrix of Extra Tree model: 
 [[  0   0   1   1   0   0]
 [  0   0   9   4   0   0]
 [  0   1 176  33   0   0]
 [  0   0  64 128   9   0]
 [  0   0   4  18  27   0]
 [  0   0   0   1   3   1]]


#### Creating Random Forest model

In [17]:
# Model building - Random Forest
wine_forest = RandomForestClassifier(random_state = 0)

# Fit the model
wine_forest.fit(x_trg, y_trg)
print("Accuracy of Random Forest model on training set is: ", wine_forest.score(x_trg, y_trg))
print("Accuracy of Random Forest model on test set is: ", wine_forest.score(x_test, y_test))

# Prediction via Random Forest
wine_forest_pred = wine_forest.predict(x_test)

# Compute the accuracy of prediction
wine_forest_acc_score = accuracy_score(y_test, wine_forest_pred)
print("Accuracy score of Random Forest on prediction is: ", wine_forest_acc_score)

# Confusion Matrix
wine_forest_results = confusion_matrix(y_test, wine_forest_pred)
print("Confusion Matrix of Random Forest is: \n", wine_forest_results)

Accuracy of Random Forest model on training set is:  1.0
Accuracy of Random Forest model on test set is:  0.69375
Accuracy score of Random Forest on prediction is:  0.69375
Confusion Matrix of Random Forest is: 
 [[  0   0   1   1   0   0]
 [  0   0   9   4   0   0]
 [  0   0 175  35   0   0]
 [  0   1  64 127   9   0]
 [  0   0   4  15  30   0]
 [  0   0   0   2   2   1]]


#### Creating Bagging model

In [18]:
# Model building - Bagging
wine_bag = BaggingClassifier(base_estimator = None, n_estimators = 10, max_samples = 1.0, max_features = 1.0,
                            bootstrap = True)

# Fit the model
wine_bag.fit(x_trg, y_trg)
print("Accuracy of Bagging model on trainig set is: ", wine_bag.score(x_trg, y_trg))
print("Accuracy of Bagging model on test set is: ", wine_bag.score(x_test, y_test))

# Predict via Bagging model
wine_bag_pred = wine_bag.predict(x_test)

# Compute the accuracy score of prediction
wine_bag_acc_score = accuracy_score(y_test, wine_bag_pred)
print("Accuracy score of Bagging model on prediction is: ", wine_bag_acc_score)

# Confusion Matrix
wine_bag_results = confusion_matrix(y_test, wine_bag_pred)
print("Confusion Matrix of Bagging model is: \n", wine_bag_results)

Accuracy of Bagging model on trainig set is:  0.9830205540661304
Accuracy of Bagging model on test set is:  0.66875
Accuracy score of Bagging model on prediction is:  0.66875
Confusion Matrix of Bagging model is: 
 [[  0   0   1   1   0   0]
 [  0   2   6   5   0   0]
 [  0   2 171  37   0   0]
 [  0   2  72 119   8   0]
 [  0   0   5  15  29   0]
 [  0   0   0   4   1   0]]


#### Creating Decision Tree model

In [19]:
# Model building - Decision Tree
wine_tree = DecisionTreeClassifier(random_state = 0)

# Fit the model
wine_tree.fit(x_trg, y_trg)
print("Accuracy of Decision Tree model on training set is: ", wine_tree.score(x_trg, y_trg))
print("Accuracy of Decision Tree model on test set is: ", wine_tree.score(x_test, y_test))

# Prediction via Decision Tree model
wine_tree_pred = wine_tree.predict(x_test)

# Compute the accuracy score of prediction
wine_tree_acc_score = accuracy_score(y_test, wine_tree_pred)
print("Accuracy score of Decision Tree model on prediction is: ", wine_tree_acc_score)

# Confusion Matrix
wine_tree_results = confusion_matrix(y_test, wine_tree_pred)
print("Confusion Matrix of Decision Tree model is: \n", wine_tree_results)

Accuracy of Decision Tree model on training set is:  1.0
Accuracy of Decision Tree model on test set is:  0.6041666666666666
Accuracy score of Decision Tree model on prediction is:  0.6041666666666666
Confusion Matrix of Decision Tree model is: 
 [[  0   0   1   1   0   0]
 [  0   3   7   1   2   0]
 [  0  18 141  45   6   0]
 [  0  10  42 120  27   2]
 [  0   0   5  19  25   0]
 [  0   0   0   3   1   1]]
