Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final K-Fold Score #60

Closed
JoshuaC3 opened this issue Oct 9, 2017 · 10 comments
Closed

Final K-Fold Score #60

JoshuaC3 opened this issue Oct 9, 2017 · 10 comments

Comments

@JoshuaC3
Copy link

JoshuaC3 commented Oct 9, 2017

One thing I have been unable to figure out is how to get a k-fold cross validation score for the whole ensemble.

I have used sklearns built in cross_valid_score but this is very slow (I think because it ends up doing cvs whilst in another cv loop!).

How can I get a final k-fold cross validation score for the final ensemble please? (great package btw :) )

@flennerhag
Copy link
Owner

If you want to have a cv-score for the final predictions of the ensemble, then your simply need to run a cross_val_score on the ensemble, as you've done. I'd image Scikit-learn is slow here since there would be several nested cv-loops which it doesn't handle well.

I haven't implemented a similar function in mlens, but you should be able to use the Evaluator: just don't pass any parameters in the evaluate call. The Evaluator might complain and give you a warning, but it will run just fine and effectively do a cross_val_score for you.

You can also pass other estimators in the evaluate call if you want to benchmark you ensemble in one go.

I'm not sure how much of a speedup you're likely to see, could be anything from just a little to quite a bit. If you give it a try, please report back. Would be interesting to know : )

@JoshuaC3
Copy link
Author

@flennerhag - Yes, using SKLearns cvs seemed to give this issue. Training time was around 14 minutes (compared to 1 min for LightGBM and upto 3 mins for XGBoost).

IMHO an ensemble CV score (and maybe OOF prediction) would be one of the most useful features to add to the package. Ultimately, it would tell you if your ensemble was working well or not :)

Evaluate seems to work for the purpose of CV score (but seemed an unnatural way to do this). It gave a good speed up and the whole model ran in under 4 mins vs the previous 14 mins.

@flennerhag
Copy link
Owner

@JoshuaC3 thanks for the report and glad it sped things up. Agree with you that the Evaluator is not ideal for standard benchmarking. Should actually be pretty easy to just pull out the code from the Evaluator and create a benchmark function that accepts a list of estimators and benchmarks them. However this should be done in the 0.2.0 code base (FYI #57).

You can get cv scores for the base learners by passing a scoring function to the ensemble when you create it or in the add call.

It's actually possible to get cv-scores for the final ensemble as well during fitting if you don't declare the final layer as a meta layer (e.g. just use the standard add method with meta=False) but instead declare it as a stacking layer and pass a scorer.

This would fit the a copy of the meta learner on all folds and once on all data. The fold jobs are unnecessary from a training point of view, but it does allow you to get CV scores for the ensemble.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Oct 13, 2017

@flennerhag - This sounds like a much more intuitive way to get that as an output.

I gave the non-meta layer a go and it returned cv-score for the last layer as expected. It ran in 1.5 mins (about half the time!) so my absolute gut feeling is that it doesn't do the cv=2 default within the ensemble object (SuperLearner, Blender, Subsemble)? Is this correct?

In addition, I have started doing some benchmarking of mlens against some other stacking/ensembling packages. The functionality included is far superior but on my relatively simple models, as shown below, it performs somewhat poorly.

ensemble = SuperLearner(scorer=mae, folds=2, random_state=25, array_check=0, verbose=True)

ensemble.add([xgb.XGBRegressor(objective=fair_obj, colsample_bytree=0.9, subsample=0.6,
                               learning_rate=0.075, max_depth=9, n_estimators=200),
              lgb.LGBMRegressor(objective='fair', n_estimators=125)])

ensemble.add([xgb.XGBRegressor(objective=fair_obj)])

ensemble.fit(X.values, y.values)

cv results: 238.91 MAE

With MLXtend StackingRegressor

from mlxtend.regressor import StackingRegressor

...

xgbr = xgb.XGBRegressor(objective=fair_obj, colsample_bytree=0.9, subsample=0.6,
                        learning_rate=0.075, max_depth=9, n_estimators=200)
lgbr = lgb.LGBMRegressor(objective='fair', n_estimators=125)
xgbrm = xgb.XGBRegressor(objective=fair_obj)

stack = StackingRegressor(regressors=(xgbr, lgbr),
                                            meta_regressor=xgbrm)

cv results: 209.04 MAE

The results of the MLXtend on the hold out set also perform much better.

Am I using mlens correctly? i.e. fitting this on the whole set correctly? Or is mlens doing something rather different here? Many thanks again.

@flennerhag
Copy link
Owner

Great that i helped! I'm not sure I quite follow what you mean by default cv? When you use a stacked layer as meta, the default number of folds is 2, so you get 2-fold cv scores based on the input prediction array to the meta learner. If you use the Evaluator however, the entire ensemble (not just the meta-learner) is fitted on each fold, so that's going to be massively slower.

@flennerhag
Copy link
Owner

@JoshuaC3 really appreciate the benchmarking! I've been meaning to do that for some time but well, there's only 24hrs in a day.

I'm surprised mlens fails to outperform mlxtend. mlxtend doesn't actually do stacking (despite the name) - it merely fits the base learners on all data, and then the meta learner on those predictions. Hence the meta learner is trained on base learner training errors, but at test time faces test errors.

The two reasons for your results that I can think of is (a) if you have very little data the folds will be too noisy or (b) if the data is not i.i.d. the folds will be biased. I hit upon (b) when doing the MNIST benchmark, since creating folds without shuffling the data won't cover all classes.

As a code integrity check, I spun up a simple mlxtend-vs-mlens benchmark with your models (but the default objective function) on the sklearn.datasets.make_friedman1 synthetic dataset. The SuperLearner does indeed outperform mlxtend (on a hold-out validation set):

MAE:

No. obs |   mlens |  mlxtend |
   5000 |    0.38 |     0.48 |
  10000 |    0.31 |     0.38 |
  15000 |    0.28 |     0.33 |
  20000 |    0.26 |     0.28 |
  25000 |    0.24 |     0.26 |

As a sanity check, if you use the SequentialEnsemble with 'full' as the layer class (e.g. add('full', [xgbr, lgbr])), you should get the exact same results as mlxtend. I get:

No. obs |   mlens |  mlxtend |
   5000 |    0.48 |     0.48 |
  10000 |    0.38 |     0.38 |
  15000 |    0.33 |     0.33 |
  20000 |    0.28 |     0.28 |
  25000 |    0.26 |     0.26 |

So to me it looks like it should. For your benchmark, would you mind trying

  1. use SequentialEnsemble with 'full' layers to see if you get the same scores
  2. set shuffle=True (try both in the constructor and in the add method)
  3. use more data (if your dataset < 1000 samples)

@JoshuaC3
Copy link
Author

JoshuaC3 commented Oct 13, 2017

In reply to your first post, yes, I mean exactly that. The 2 folds were taken into account so it took ~x2 as long.

I will reply again comment #_2 in due course :)

@JoshuaC3
Copy link
Author

Sorry, I have been busy the last two weeks (finishing an ml competition which I won with mlens help!! :) ) but final done some investigating into this.

In the end the problem was very simple. It was due to mlens reacting sensitively to some np.inf and np.nan values. MLXtend accepted these values as default, passed them to the first level learners and lets them throw an error if they cannot handle them.

Mlens checks these before level-1 but can turn this off with array_check. I initially tried to apply array_check=0 but still my models seemed to fail. I removed the rows with np.nan and np.inf. It seemed there were enough of these instances, or their distribution was informative to the model, to the extent that it gave mlxtend an advantage. I have not been able to replicate this failure so I imagine it is down to user error. Rerunning the models with array_check=0 now seems to run as expected and out performs mlxtend.

I have started an ensemble comparison/benchmarking notebook and expect to have this first version finished today. I am comparing several python based ml ensemble packages, including mlxtend and mlens and there different methods of ensembling. Hopefully, this will be of some use.

@flennerhag
Copy link
Owner

@JoshuaC3 congratulations! Happy to hear : )

If the competition was public, it might be interesting to make a use case of it. Would that be of interest to you?

Also, please do share your benchmark results!

Btw closing this issue, see here instead.

@JoshuaC3
Copy link
Author

JoshuaC3 commented Nov 6, 2017

@flennerhag Thank you! Yes, however, the competition was not public; it was run through the company I work for. I will ask if it is OK to release the data, or at the least, my results and method. I can say that it was a regression task for predicting wind generation for each turbine on a wind farm, given >24h forecast weather data.

I have made a Jupyter Notebook and directory for this (hopefully this is ok. I think they are much more helpful for analyses!). You can see it on my fork, here. It started off as a benchmark against other packages but they all scored very similarly so in the end it became more of a quick comparison of packages. Is my use of Evaluate to get cv-scores correct? Any other comments, improvements or suggestions then raise an issue on my branch I guess? I have not contributed/collaborated on git/github before so not sure on the best method.

I have started a second Jupyter Notebook to compare preprocessing functionality, deeper layer stacking and scores. This to come shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants