-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing not working? #26
Comments
Thanks for your detailed report and sorry for not getting back to you sooner, somehow I did not get a notification.
|
Hello and no worries.
|
The "full" preprocessing option extends the
This depends on your dataset. The difference between "full" and the default "stability" preprocessing are:
Some more background: empirically, we saw that the "full" preprocessing pipeline performed slightly worse on our benchmark datasets (mostly because PCA hurt performance on some of the included data sets). Since all the above options are subject to hyperparameter tuning, you might find the same pipeline in the end with both options.
If you include the scaling in the pipeline, you will apply the same scaling factors from the training sets to the test set (which would not happen in the manually scaled scenario). |
Ok so I've gone back over my old rmds from the time and I think this is part me being stupid as I was experimenting with tidymodels, h2o, mlr3 (automl) all at the same time, and part the vignette for automl maybe being a bit too concise at the moment - meaning I got my wires a bit crossed as a result. Now I've read your explanation and the actual Just to expand, with mlr3 there's a commit where you can keep NAs in the target for imputation (but only impute the non-target variables and then drop the target NAs after). These samples can help the imputation in non-target NAs - as in, if you have variables A:E (with E the target) and you have a sample with E missing, then you don't necessarily want to drop this sample before imputation because it can still be useful to impute missing A:D values in other samples. If that makes sense? So you want to keep this sample for imputation and then drop it afterwards. Anyway, I accidentally used the data set where I had left in target NAs with automl and obviously did not reading the error message properly - leading to misinterpreting the preprocessing options totally. Shouldn't have been experimenting with so many packages at once probably as obviously I had too many things going through my head to think this through properly. Sorry about that, but talking this through has helped a lot! I have now run the full option on a dataset with target NAs removed (but not non-target NAs removed) and it worked beautifully. Thank you. |
Hello,
Thanks for this fantastic package. One strange thing I've noticed is that I seem to not be able to activate the preprocessing stage. I can tell for a couple of reasons (I think)! So I have some numeric data, that contains NAs, and that I want to do regression on. If I run my data as is using
AutoML(task)
with no options set then I get an error:Ok, so if I remove the NAs first then, after running predict on the full data set (minus NAs) I get something like:
Very good predictions except the fact that the predictions lie at an angle to the x-y line, I think, is a result of a lack of scaling and centering because, if I do this manually first then I get:
I mean, obviously the plot looks different now as mlr3 doesn't know about the scaling, but now the predictions lie nicely around the x-y line.
So it seems the default is to do no preprocessing (it's not quite clear from the help pages). But, when I set the option
AutoML(task, preprocessing = "full")
, I get no difference in the outcome with the original data or manually scaled data. Plus if I leave in the NAs then I still get the error:The help pages suggest NAs can be handled as they mention imputation - but I still get the error. And, as I mentioned above the predictions on data after removing NAs look the same as with not setting the
preprocessing
option. Am I missing something?EDIT;
but setting
preprocessing = po("scale")
does work:so it seems like it's the "full", "stability", "none" options that aren't being respected. Or I'm being stupid!
The text was updated successfully, but these errors were encountered: