Handling Input to auto pytorch #89

franchuterivera · 2021-02-09T23:58:53Z

Enable any input type to autopytorch (if input in (pd.DataFrame, List, sparse, numpy))
Better error to user and plenty of more testing (included moving dataset to pytest)

autoPyTorch/api/tabular_classification.py

autoPyTorch/data/base_feature_validator.py

autoPyTorch/api/tabular_classification.py

autoPyTorch/data/tabular_feature_validator.py

autoPyTorch/pipeline/base_pipeline.py

ravinkohli · 2021-02-10T17:48:20Z

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/encoding/OrdinalEncoder.py

@@ -19,7 +19,9 @@ def fit(self, X: Dict[str, Any], y: Any = None) -> BaseEncoder:

        self.check_requirements(X, y)

-        self.preprocessor['categorical'] = OE(categories=X['dataset_properties']['categories'])
+        self.preprocessor['categorical'] = OE(handle_unknown='use_encoded_value',


maybe we dont need an ordinal encoder in the search space as the data has already gone through this phase. In case for a dataset, OneHotEncoding is not preferred, NoEncoder will have the same effect as doing Ordinal encoding.

Oh no wait, we do not touch the numerical columns. The objective here is to make the data numerical.

So if you have a numerical columns with values like 0.0000000001 and so on, maybe changing that to 1, 2, 3, 4 and so one with an ordinal encoder improves the performance.

but in its current form, its just working on the categorical columns. Also, that could be handled with the scaler component. Me and sebastian also discussed today about not having an encoding component. As when we merge the embedding PR which I'll finish soon, PyTorch has this nn.Embedding layer which performs OneHotEncoding and learns the embedding all by itself.

ravinkohli · 2021-02-10T19:05:41Z

autoPyTorch/data/tabular_target_validator.py

+                A holdout set of labels
+        """
+        if not self.is_classification or self.type_of_target == 'multilabel-indicator':
+            # Only fit an encoder for classification tasks


Should we also raise an error if in regression the data type of the target is not integer? Currently it seems like there is no check for that.

I think #85 can handle this. Now I do not know what you mean exactly, as we have different regression targets. Usually we expected regression to have float type, which is something that the code handles right now.

Oh I meant if somehow someone gives y values as a string. Which does not contain numbers but words, by mistake. I think in this validator module we should take care of this as well

ravinkohli

This PR looks really good. I have left a few minor comments to be addressed.

franchuterivera added 7 commits February 9, 2021 23:24

Validate the input to autopytorch

96dceb8

Bug fixes after rebase

dd873ab

Move to new scikit learn

d5bfd42

Remove dangerous convert dtype

04bfd98

Try to remove random float error again and make data pickable

5f2b3a1

Tets pickle on versions higher than 3.6

e5a0136

Tets pickle on versions higher than 3.6

0606d34

franchuterivera requested a review from ravinkohli February 10, 2021 12:44