# XGBoost Regressor Model

### Imports

In [14]:
import sys
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor 
from sklearn.feature_selection import SelectPercentile
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split, KFold, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd

features = pd.read_csv("./data/prepped/modeling_features.csv").drop(["Unnamed: 0"], axis=1)
labels = pd.read_csv("./data/prepped/modeling_outcome.csv").drop(["Unnamed: 0"], axis=1)

train_features, test_features, train_outcome, test_outcome = train_test_split(
    features,
    labels,
    test_size=0.25,
    random_state=42
)

### Initial Model

The model used is the Extreme Gradient Boosted Regressor found [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html).

In [15]:
xgb_base = XGBRegressor(nthread=-1, random_state=42) 
xgb_base.fit(train_features, train_outcome)

base_predictions = xgb_base.score(test_features, test_outcome)
print(base_predictions)

0.07748200692578133


### Tree based feature selection

Through domain research and exploratory data analysis we were able to determine that some features are not as important as others. Therefore, here we attempt to reduce the number of features used within our model by extracting features that are more significant that others with Tree-Based Feature Selection.

In [16]:
from sklearn.feature_selection import SelectFromModel

feature_model = SelectFromModel(xgb_base, prefit=True)

# Transform into new training and testing feature sets
refined_train_features = feature_model.transform(train_features)
refined_test_features = feature_model.transform(test_features)

# Retrained with refined features
xgb_refined = XGBRegressor(nthread=-1, random_state=42)
xgb_refined.fit(refined_train_features, train_outcome)

print("Score: " + str(xgb_refined.score(refined_test_features, test_outcome)))
print("Feature Importance")
print(list(xgb_refined.feature_importances_))
feature_tuples = [(feature, round(importance, 2)) for feature, importance in zip(list(features.columns), list(xgb_refined.feature_importances_))]
refined_features = list(dict(feature_tuples))
print("Refined Feature Lists")
print(refined_features)

Score: 0.05672566781242483
Feature Importance
[0.19009584, 0.1086262, 0.052715655, 0.0686901, 0.13258786, 0.08306709, 0.052715655, 0.041533545, 0.07827476, 0.047923323, 0.095846646, 0.047923323]
Refined Feature Lists
['abv', 'ibu', 'diff_g', 'boil_time', 'efficiency', 'ferm_total_weight', 'ferm_type_base_malt', 'ferm_type_crystal_malt', 'ferm_type_roasted_malt', 'ferm_type_other', 'ferm_type_extract', 'ferm_type_sugar']


### Cross Validation

Cross validation was attempted for a range of parameters within our model. After reading more into RandomForestClassifier, I concluded on five parameters to tune in order to build a better model. Those parameters included that of n_estimators, max_features, max_depth, min_samples_split and bootstrap. These all are important as they allow us to control the number of trees in the forest as well as the number of features considered for splitting at each leaf node. The values derived here are a result of my intial model with variations to the range of potential values in order to attempt to capture a better fit within the range of parameter values.

Parameters for the model can be found [here]( https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#parameters-for-linear-booster-booster-gblinear).

In [39]:
# Number of trees
n_estimators = [int(x) for x in np.linspace(start = 300, stop = 800, num = 10)]

# Maximum number of levels in tree [0, inf]
max_depth = [int(x) for x in np.linspace(80, 120, num = 10)]

# Minimum number of samples required to split a node [0, 1]
learning_rate = [.009, .01, .09] #[i/10.0 for i in range(1,11)]

# Minimum loss reduction required to make a further partition on a leaf node of the tree [0, inf]
gamma = [.27, .3, .34] #[i/10.0 for i in range(3,6)]

# Subsample ratio of the training instances. (0,1]
subsample = [i/10.0 for i in range(6,11)]

# Subsample ratio of columns when constructing each tree. (0,1]
colsample_bytree = [.35, .4, .45] #[i/10.0 for i in range(1,8)]
    
# Minimum sum of instance weight (hessian) needed in a child. [0,∞]
min_child_weight = [.1, .15, .2, .25, 1]


# Create the random grid
random_grid = {
        'objective':['reg:linear'],
        'learning_rate': learning_rate,
        'min_child_weight': min_child_weight,
        'gamma': gamma,
        'subsample': subsample,
        'silent': [1],
        'colsample_bytree': colsample_bytree,
        'max_depth': max_depth,
        'n_estimators': n_estimators
        }

cv_model = RandomizedSearchCV(estimator=XGBRegressor(nthread=-1), param_distributions=random_grid,
                             n_iter = 100, scoring='r2', 
                              cv = 5, verbose=True, random_state=42, n_jobs=-1)

cv_model.fit(refined_train_features, train_outcome)

print(cv_model.score(refined_test_features, test_outcome))

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  3.2min finished


0.10205588301716784


In [None]:
cv_model.best_params_

# cv_model.best_score_

### GridSearch Cross Validation

In [44]:
folds = KFold(n_splits=10, shuffle=True, random_state=42)

pipeline = make_pipeline(
    MinMaxScaler(),
    SelectPercentile(),
    XGBRegressor(nthread=-1)
)


# Number of trees
n_estimators = [500, 530, 560]

# Maximum number of levels in tree [0, inf]
max_depth = [85, 93, 100]

# Minimum number of samples required to split a node [0, 1]
learning_rate = [.009, .01, .09]

# Minimum loss reduction required to make a further partition on a leaf node of the tree [0, inf]
gamma = [.3, .35, .4]

# Subsample ratio of the training instances. (0,1]
subsample = [.9, 1.0]

# Subsample ratio of columns when constructing each tree. (0,1]
colsample_bytree = [.35, .4, .45]
    
# Minimum sum of instance weight (hessian) needed in a child. [0,∞]
min_child_weight = [.05 ,.1, .15, .2]


pipeline_params = {
    "selectpercentile__percentile": [70, 80, 90],
    "xgbregressor__n_estimators": n_estimators,
    "xgbregressor__subsample": subsample,
    "xgbregressor__gamma": gamma,
    "xgbregressor__learning_rate": learning_rate,
    "xgbregressor__max_depth": max_depth,
    "xgbregressor__colsample_bytree": colsample_bytree,
    "xgbregressor__min_child_weight": min_child_weight,
}

model = GridSearchCV(pipeline, pipeline_params, cv=folds, n_jobs=-1, verbose=True)
model.fit(train_features, train_outcome)
score = model.score(test_features, test_outcome)

print("model score:", score)
print("best params:", model.best_params_)

Fitting 10 folds for each of 5832 candidates, totalling 58320 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 22.9min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 29.0min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 37.8min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed: 47.0min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed: 57.4min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed: 69.3min


KeyboardInterrupt: 