This is the base GradientBoosting Model and its resulting MAE.

In [None]:
base = GradientBoostingRegressor()
scores = cross_val_score(base, X_train, y_train, cv=5, scoring='neg_mean_absolute_error')
-scores.mean()

36775.600316829295

For efficiency, I used RandomizedSearchCV with 100000 iterations, as a GridSearchCV would be totalling 9878400 fits with 5-fold cross validation.

In [None]:
gbr = GradientBoostingRegressor()
gbr_params = {'n_estimators': [100, 120, 140, 160, 170, 180, 200],
                'max_depth': [3,4,5,6,7,8],
                'learning_rate': [0,0.01, 0.001, 0.1, 0.095, 0.4, 0.2],
                'subsample': [0,0.1, 0.3, 0.8, 1.0],
              'min_samples_split': [2,3,4,5],
              'min_samples_leaf': [0.1,0.3,0.5,0.7,0.9,1],
              'min_weight_fraction_leaf': [0,0.1,0.3,0.5,0.7,0.9,1],
              'criterion': ['friedman_mse', 'squared_error'],
              'loss': ['huber', 'log_loss', 'deviance', 'exponential']}
gbr_gridsearch = RandomizedSearchCV(gbr, gbr_params, n_iter = 100000, cv=5, verbose=1, n_jobs=-1, scoring='neg_mean_absolute_error')
gbr_gridsearch.fit(X_train, y_train)
print('Best gbr params:', gbr_gridsearch.best_params_)
print('Best gbr score:', gbr_gridsearch.best_score_)

Best gbr params: {'subsample': 1.0, 'n_estimators': 160, 'min_weight_fraction_leaf': 0, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 3, 'loss': 'huber', 'learning_rate': 0.095, 'criterion': 'squared_error'}
Best gbr score: -36322.8451418078

After getting the initial best parameters, I used GridSearchCV to find the best parameters in a smaller range. After a couple of rounds of tuning, n_estimators moved to a higher value than the first round would have suggested. Because the value for min_samples_split was on the edge of the range, I increased the range.

In [None]:
gbr = GradientBoostingRegressor()
gbr_params = {'n_estimators': range(172,190),
                'max_depth': [2,3,4,5],
                'learning_rate': [0.094, 0.095, 0.096],
                'subsample': [1.0],
              'min_samples_split': range(4,11),
              'min_samples_leaf': [1],
              'min_weight_fraction_leaf': [0],
              'loss': ['huber']}
gbr_gridsearch = GridSearchCV(gbr, gbr_params, cv=5, verbose=1, n_jobs=-1, scoring='neg_mean_absolute_error')
gbr_gridsearch.fit(X_train, y_train)
print('Best gbr params:', gbr_gridsearch.best_params_)
print('Best gbr score:', gbr_gridsearch.best_score_)

Best gbr params: {'learning_rate': 0.095, 'loss': 'huber', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 7, 'min_weight_fraction_leaf': 0, 'n_estimators': 189, 'subsample': 1.0}
Best gbr score: -36216.57778386479


After this round of tuning, the MAE on train data only got worse, so I decided to stop there.