# Gradient Boosting Machine (GBM) model. 

GBM is a powerful and versatile machine learning algorithm that has proven to be effective in various prediction tasks. Some advantages of this model include:

- High Predictive Accuracy: GBM excels in capturing complex patterns and interactions in data, making it well-suited for predicting student performance. It combines multiple weak prediction models (decision trees) and iteratively improves them to minimize prediction errors, resulting in a highly accurate ensemble model.

- Handles Different Types of Variables: GBM can handle a mix of categorical and numerical variables, which is often the case in student performance prediction. It can automatically handle feature interactions, non-linear relationships, and missing data, reducing the need for extensive data preprocessing.

- Robust to Outliers and Noise: GBM is less affected by outliers and noisy data compared to some other algorithms. It achieves this robustness by using decision trees and ensemble methods, which reduce the impact of individual instances.

- Flexible Hyperparameter Tuning: GBM models offer various hyperparameters that can be tuned to optimize performance. This flexibility allows you to fine-tune the model based on the specific characteristics of your student performance prediction task.

# Portuguese

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 60)

por = pd.read_csv('output/encoded_por.csv')
por.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,score,school_GP,school_MS,sex_F,sex_M,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,Pstatus_T,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_at_home,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,18,4,4,2,2,0,1,0,0,0,1,1,0,0,4,3,4,1,1,3,4,7.33,1,0,1,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0
1,17,1,1,1,2,0,0,1,0,0,0,1,1,0,5,3,3,1,1,3,2,10.33,1,0,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0
2,15,1,1,1,2,0,1,0,0,0,1,1,1,0,4,3,2,2,3,3,6,12.33,1,0,1,0,0,1,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0
3,15,4,2,1,3,0,0,1,0,1,1,1,1,1,3,2,2,1,1,5,0,14.0,1,0,1,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0
4,16,3,3,1,2,0,0,1,0,0,1,1,0,0,4,3,2,1,2,5,0,12.33,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0


## Train-test split

In [230]:
from sklearn.model_selection import train_test_split

y = por['score']
X = por.drop(['score'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.1, 
                                                    random_state = 30)

In [3]:
print('Sample sizes\n')
print('Training set')
print('   predicted (y) \t', y_train.shape)
print('   predictors (X) \t', X_train.shape, '\n')
print('Validation set')
print('   predicted (y) \t', y_test.shape)
print('   predictors (X) \t', X_test.shape, '\n')

Sample sizes

Training set
   predicted (y) 	 (519,)
   predictors (X) 	 (519, 48) 

Validation set
   predicted (y) 	 (130,)
   predictors (X) 	 (130, 48) 



## Models

### Basic Gradient Boosting and Random Forest

In [231]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate = 0.05)
model.fit(X_train, y_train)
print('Gradient Boosting Score  %0f' % model.score(X_test, y_test))


from sklearn.ensemble import RandomForestRegressor
RFmodel = RandomForestRegressor()
RFmodel.fit(X_train, y_train)
print('Random Forest Score \t %0f' % RFmodel.score(X_test, y_test))

Gradient Boosting Score  0.380318
Random Forest Score 	 0.389118


### Retrieving statistically significant variables:

In [232]:
import statsmodels.api as sm

linear = sm.OLS(endog = por['score'], exog = por.drop('score', axis = 1))
results = linear.fit()
summary = results.summary()

results_as_html = summary.tables[1].as_html()
summary = pd.read_html(results_as_html, header=0, index_col=0)[0]
significant_vars = list(summary[summary['P>|t|'] < 0.05].index)
print(significant_vars)

['studytime', 'failures', 'schoolsup', 'higher', 'health', 'absences', 'school_GP', 'school_MS', 'sex_F', 'sex_M', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Mjob_health', 'Mjob_services', 'Mjob_teacher', 'Fjob_other', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other']


### Basic Gradient Boosting and Random Forest with selected significant variables

In [233]:
from sklearn.model_selection import train_test_split

y = por['score']
X = por[significant_vars]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.1, 
                                                    random_state = 30)

In [234]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

model = GradientBoostingRegressor(max_depth=2, learning_rate = 0.05, n_estimators=1000)
model.fit(X_train, y_train)
print('Gradient Boosting Score  %0f' % model.score(X_test, y_test))
print('Mean Squared Error: \t', round(mean_squared_error(y_test, model.predict(X_test)), 5))

Gradient Boosting Score  0.451227
Mean Squared Error: 	 4.75049


For comparison, let's create a Random Forest model to see how accurate it is. 

In [235]:
from sklearn.ensemble import RandomForestRegressor
RFmodel = RandomForestRegressor()
RFmodel.fit(X_train, y_train)
print('Random Forest Score \t %0f' % RFmodel.score(X_test, y_test))
print('Mean Squared Error: \t', round(mean_squared_error(y_test, RFmodel.predict(X_test)), 5))

Random Forest Score 	 0.386007
Mean Squared Error: 	 5.31507


As the accuracy of these models only considers exact predictions, let's try to compute it with an error margin. 

In [236]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score

model1 = GradientBoostingRegressor(max_depth=2, learning_rate = 0.05, n_estimators=1000)
model1.fit(X_train, y_train)
model2 = GradientBoostingRegressor(max_depth=3, learning_rate = 0.05, n_estimators=1000)
model2.fit(X_train, y_train)

y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)
error_margin = 1 # error margin of 5% 
within_margin1 = (abs(y_pred1 - y_test) <= error_margin).mean()
within_margin2 = (abs(y_pred2 - y_test) <= error_margin).mean()

score1 = within_margin1 * 100
score2 = within_margin2 * 100

print('(depth = 2, error margin = {}) Score: \t'.format(error_margin), score1)
print('(depth = 3, error margin = {}) Score: \t'.format(error_margin), score2)

(depth = 2, error margin = 1) Score: 	 32.30769230769231
(depth = 3, error margin = 1) Score: 	 38.46153846153847


In [237]:
def score_within_margin(fitted_model, X_test, y_test, margin):
    y_pred = fitted_model.predict(X_test)
    within_margin = (abs(y_pred - y_test) <= margin).mean()
    score = within_margin * 100
    print('(Error margin = {}) Score: \t'.format(margin), '{}%'.format(score))

In [238]:
score_within_margin(model, X_test, y_test, margin = 1)
score_within_margin(model, X_test, y_test, margin = 1.5)
score_within_margin(model, X_test, y_test, margin = 2)

(Error margin = 1) Score: 	 32.30769230769231%
(Error margin = 1.5) Score: 	 49.23076923076923%
(Error margin = 2) Score: 	 66.15384615384615%


### Improving Gradient Boosting

In Gradient Boosting models, the two main hyperparameters are the number of estimators, which defines how many branches used to train the model and how deep it is, and the learning rate, which sets the importance of each node. 

In [192]:
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.express as px

branches = []
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]

for learning_rate in learning_rates:
    model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=learning_rate)
    model.fit(X_train, y_train)
    errors = [mean_squared_error(y_test, y_pred) for y_pred in model.staged_predict(X_test)]
    best_number_of_estimators = np.argmin(errors)
    branches.append(best_number_of_estimators + 1)

fig = px.line(learning_rates, branches, 
        height = 500, width = 700)
fig.update_layout(xaxis_title = 'Learning rate', 
                  yaxis_title = 'Number of estimators')

As we can see in the graph, there is a trade-off between the number of estimators and the learning rate.

In [193]:
model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=0.5)
model.fit(X_train, y_train)
errors = [mean_squared_error(y_test, y_pred) for y_pred in model.staged_predict(X_test)]

fig = px.line(errors[:200], 
        height = 500, width = 900)
fig.update_layout(xaxis_title = 'Number of estimators', 
                  yaxis_title = 'Mean Squared Error', 
                  showlegend = False)

The mean squared error first decreases as the number of estimators is increased; then the error begins rising again after a optimum level. 

In [194]:
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

scores = []
learning_rates = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1]

for learning_rate in learning_rates:
    model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=learning_rate)
    model.fit(X_train, y_train)
    scores.append([learning_rate, model.score(X_test, y_test)])

print(pd.DataFrame(scores, columns = ['learning rate', 'score']).to_string(index = False))

 learning rate     score
          0.01  0.390535
          0.05  0.401486
          0.10  0.336018
          0.25  0.235441
          0.50  0.021986
          0.75 -0.170089
          1.00 -0.340070


In [195]:
for leaning_rate in list(np.linspace(0.01, 0.1, num=100)):
    model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=learning_rate)
    model.fit(X_train, y_train)
    scores.append([learning_rate, model.score(X_test, y_test)])

scr = pd.DataFrame(scores)
scr_max = scr[1].idxmax()
print('best learning rate: \t', scr.loc[scr_max, 0])

best learning rate: 	 0.05


In [196]:
model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=0.06)
model.fit(X_train, y_train)

print('Gradient Boosting Score %0f' % model.score(X_test, y_test))

Gradient Boosting Score 0.411592


### Grid Search 

Instead of tuning the model's hyperparemeters manually, one can use Grid Search methods to find the best combination of parameters. 

In [259]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

model = GradientBoostingRegressor()

param_grid = {
    'max_depth': [1, 2,],
    'n_estimators': [500, 1000],
    'learning_rate': [0.01, 0.05, 0.06, 0.1, 0.5]
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Retrieve the best parameter combination
best_params = grid_search.best_params_
print(best_params)

{'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 500}


In [258]:
model = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=0.06)
model.fit(X_train, y_train)

print('Gradient Boosting Score \t %0f' % model.score(X_test, y_test))
print('Mean squared error: \t\t', mean_squared_error(y_test, model.predict(X_test)))
score_within_margin(model, X_test, y_test, margin = 1)
score_within_margin(model, X_test, y_test, margin = 1.5)
score_within_margin(model, X_test, y_test, margin = 2)

Gradient Boosting Score 	 0.441768
Mean squared error: 		 4.83237141823588
(Error margin = 1) Score: 	 30.76923076923077%
(Error margin = 1.5) Score: 	 47.69230769230769%
(Error margin = 2) Score: 	 67.6923076923077%
