Raise informative error on duplicated column names #686

david-cortes · 2023-07-22T15:49:49Z

This PR adds an informative error message in cases in which the user supplies inputs having duplicated column names, which otherwise manifest in hard-to-track errors (e.g. #681).

solegalli

Hey @david-cortes

Thank you so much for taking care of this issue.

I wonder if instead of modifying all the check variable functions, it would not be better to modify the check_X that checks the input dataframe. For example here:

feature_engine/feature_engine/dataframe_checks.py

Line 49 in 3343305

if isinstance(X, pd.DataFrame):

In this case, if the dataframe contains duplicated variables, it will raise an error and stop all computations right at the top of the fit method.

Do you know if / how pandas warns users of duplicated column names somehow?

solegalli · 2023-07-24T12:04:23Z

feature_engine/variable_handling/_variable_type_checks.py

+    if isinstance(df_check, pd.Series):
+        return
+    if df_check.columns.duplicated().any():
+        raise ValueError("Input data contains duplicated variable names.")


Quick questions:

Does df_check.columns.duplicated().any() raise an error on a series?

Is this implementation better practice? or would it be better to say:

if not isinstance(df_check, pd.Series): if df_check.columns.duplicated().any(): raise ValueError("Input data contains duplicated variable names.")

a. It does raise an error, since Series objects do not have columns. Nevertheless, if slicing a DataFrame with a column name returns a Series, it means there was only one column with that name.
b. No idea. The linter didn't complain about the one in my initial commit.

david-cortes · 2023-07-24T17:12:14Z

Moved the check to check_X instead as suggested above.

Regarding errors from pandas, it does throw errors sometimes when there are duplicates, but only under some particular situations:
https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html

Also changed the mechanism towards the attribute is_unique as it seems that's what they recommend in their guide.

solegalli

Hey @david-cortes

Thank you so much for the link to pandas documentation on duplicate labels. That was very useful. And thanks for the code changes in check_X. That looks very good.

After reading the documentation, I think this addition you are proposing is important and very useful.

I think, it'd be important to add a test in the test_check_x file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_dataframe_checks.py

Our test_files mirror our .py files. So we would not have a file in variable_handling if we do not add a function in variable handling. So I think we need to remove the file you added called test_error_on_duplicated, and pass the test to other files.

This test is a generic test that all transformers should pass. We have generic tests here: https://github.com/feature-engine/feature_engine/tree/main/tests/estimator_checks

But I need to refurbish that file a lot. So I wouldn't want to waste your time there. If you could just add a test to the test_check_x file I mention previously, and then create an issue saying that we need to add a generic test for all transformers to test that they raise error on duplicates, that would be great and we could finish this PR.

Thank you!

solegalli · 2023-07-25T06:46:28Z