Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing not working? #26

Closed
py9mrg opened this issue Apr 16, 2021 · 4 comments
Closed

Preprocessing not working? #26

py9mrg opened this issue Apr 16, 2021 · 4 comments

Comments

@py9mrg
Copy link

py9mrg commented Apr 16, 2021

Hello,

Thanks for this fantastic package. One strange thing I've noticed is that I seem to not be able to activate the preprocessing stage. I can tell for a couple of reasons (I think)! So I have some numeric data, that contains NAs, and that I want to do regression on. If I run my data as is using AutoML(task) with no options set then I get an error:

Error in check_prediction_data.PredictionDataRegr(pdata) : 
  Assertion on 'pdata$response' failed: Contains missing values (element 1).

Ok, so if I remove the NAs first then, after running predict on the full data set (minus NAs) I get something like:

image

Very good predictions except the fact that the predictions lie at an angle to the x-y line, I think, is a result of a lack of scaling and centering because, if I do this manually first then I get:

image

I mean, obviously the plot looks different now as mlr3 doesn't know about the scaling, but now the predictions lie nicely around the x-y line.

So it seems the default is to do no preprocessing (it's not quite clear from the help pages). But, when I set the option AutoML(task, preprocessing = "full"), I get no difference in the outcome with the original data or manually scaled data. Plus if I leave in the NAs then I still get the error:

Error in check_prediction_data.PredictionDataRegr(pdata) : 
  Assertion on 'pdata$response' failed: Contains missing values (element 1).

The help pages suggest NAs can be handled as they mention imputation - but I still get the error. And, as I mentioned above the predictions on data after removing NAs look the same as with not setting the preprocessing option. Am I missing something?

EDIT;

but setting preprocessing = po("scale") does work:

image

so it seems like it's the "full", "stability", "none" options that aren't being respected. Or I'm being stupid!

@a-hanf
Copy link
Owner

a-hanf commented Jun 6, 2021

Thanks for your detailed report and sorry for not getting back to you sooner, somehow I did not get a notification.
If I understand it correctly, you are raising two separate points in your issue here.

  1. It looks like your data set is containing missing values in the response. The mlr3 regression learners do not handle this, and I would consider this expected behavior for supervised learning algorithms. What would you expect to happen here? We should at least give a more explicit warning.

  2. Regarding the scaling of variables: you mentioned no differences in outcome based on scaling. Depending on how you were scaling the features, this is normal. mlr3automl has a few different algorithms for regression: ranger and xgboost are tree-based methods and indifferent to scaling. The regression SVM learner is from the e1071 package and will scale inputs itself. The other regression algorithms perform regularization, so scaling might have an effect there.
    If you can provide some more details on the data set, which learners were selected and how scaling changed things, I can have a look to see if there's anything going wrong there. I'll update the docs to make the default preprocessing behavior more explicit.

@py9mrg
Copy link
Author

py9mrg commented Jun 11, 2021

Hello and no worries.

  1. Yes I realised that in the end - and have raised an issue there that they are considering to introduce this functionality. It is in a commit now.
  2. My data set is purely numeric and I've generally been playing around with ranger::ranger and kernlab::ksvm (and e1071::svm). IIRC ranger gave very similar results to running autoML with default settings. I appreciate this might be me thinking in circles, but it seems to me that autoML(task, preprocessing = "full") and autoML(task, preprocessing = po("scale")) ought to give identical results - but they don't. Yet autoML(task, preprocessing = "full") and autoML(task) do and so do autoML(task, preprocessing = po("scale")) and autoML(task) if in the latter case I manually scale first. I could be completely wrong though in my expectations!
    Hope that's more clear, if not let me know and I'll try to make a reprex.

@a-hanf
Copy link
Owner

a-hanf commented Jun 12, 2021

it seems to me that autoML(task, preprocessing = "full") and autoML(task, preprocessing = po("scale")) ought to give identical results - but they don't

The "full" preprocessing option extends the pipeline_robustify function from mlr3pipelines by adding tunable preprocessing options for imputation methods (if your data has missing values) and PCA. The pipeline po(scale) only has the scaling operator, so these will be very different pipelines. Not sure why you expect them to return identical results, can you elaborate? Supplying a Graph object for the preprocessing does not extend the existing pipeline, but it replaces it. Maybe that was not clear from the docs?

Yet autoML(task, preprocessing = "full") and autoML(task) do

This depends on your dataset. The difference between "full" and the default "stability" preprocessing are:

  • different methods for imputation of missing data (not sure if this is relevant for you)
  • different methods for encoding categorical covariates (irrelevant for your case)
  • PCA for dimensionality reduction

Some more background: empirically, we saw that the "full" preprocessing pipeline performed slightly worse on our benchmark datasets (mostly because PCA hurt performance on some of the included data sets). Since all the above options are subject to hyperparameter tuning, you might find the same pipeline in the end with both options.

and so do autoML(task, preprocessing = po("scale")) and autoML(task) if in the latter case I manually scale first.

If you include the scaling in the pipeline, you will apply the same scaling factors from the training sets to the test set (which would not happen in the manually scaled scenario).
I am a bit surprised that scaling makes a difference here, as ranger should be indifferent to scaling and both them SVMs you mention perform scaling internally.

@py9mrg
Copy link
Author

py9mrg commented Jun 14, 2021

Ok so I've gone back over my old rmds from the time and I think this is part me being stupid as I was experimenting with tidymodels, h2o, mlr3 (automl) all at the same time, and part the vignette for automl maybe being a bit too concise at the moment - meaning I got my wires a bit crossed as a result. Now I've read your explanation and the actual ?AutoML doc rather than only the vignette it's much clearer.

Just to expand, with mlr3 there's a commit where you can keep NAs in the target for imputation (but only impute the non-target variables and then drop the target NAs after). These samples can help the imputation in non-target NAs - as in, if you have variables A:E (with E the target) and you have a sample with E missing, then you don't necessarily want to drop this sample before imputation because it can still be useful to impute missing A:D values in other samples. If that makes sense? So you want to keep this sample for imputation and then drop it afterwards.

Anyway, I accidentally used the data set where I had left in target NAs with automl and obviously did not reading the error message properly - leading to misinterpreting the preprocessing options totally. Shouldn't have been experimenting with so many packages at once probably as obviously I had too many things going through my head to think this through properly. Sorry about that, but talking this through has helped a lot!

I have now run the full option on a dataset with target NAs removed (but not non-target NAs removed) and it worked beautifully. Thank you.

@py9mrg py9mrg closed this as completed Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants