Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise informative error on duplicated column names #686

Merged
merged 8 commits into from
Sep 12, 2023

Conversation

david-cortes
Copy link
Contributor

This PR adds an informative error message in cases in which the user supplies inputs having duplicated column names, which otherwise manifest in hard-to-track errors (e.g. #681).

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @david-cortes

Thank you so much for taking care of this issue.

I wonder if instead of modifying all the check variable functions, it would not be better to modify the check_X that checks the input dataframe. For example here:

if isinstance(X, pd.DataFrame):

In this case, if the dataframe contains duplicated variables, it will raise an error and stop all computations right at the top of the fit method.

Do you know if / how pandas warns users of duplicated column names somehow?

if isinstance(df_check, pd.Series):
return
if df_check.columns.duplicated().any():
raise ValueError("Input data contains duplicated variable names.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick questions:

Does df_check.columns.duplicated().any() raise an error on a series?

Is this implementation better practice? or would it be better to say:

    if not isinstance(df_check, pd.Series):
        if df_check.columns.duplicated().any():
            raise ValueError("Input data contains duplicated variable names.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. It does raise an error, since Series objects do not have columns. Nevertheless, if slicing a DataFrame with a column name returns a Series, it means there was only one column with that name.
b. No idea. The linter didn't complain about the one in my initial commit.

@david-cortes
Copy link
Contributor Author

david-cortes commented Jul 24, 2023

Moved the check to check_X instead as suggested above.

Regarding errors from pandas, it does throw errors sometimes when there are duplicates, but only under some particular situations:
https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html

Also changed the mechanism towards the attribute is_unique as it seems that's what they recommend in their guide.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @david-cortes

Thank you so much for the link to pandas documentation on duplicate labels. That was very useful. And thanks for the code changes in check_X. That looks very good.

After reading the documentation, I think this addition you are proposing is important and very useful.

I think, it'd be important to add a test in the test_check_x file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_dataframe_checks.py

Our test_files mirror our .py files. So we would not have a file in variable_handling if we do not add a function in variable handling. So I think we need to remove the file you added called test_error_on_duplicated, and pass the test to other files.

This test is a generic test that all transformers should pass. We have generic tests here: https://github.com/feature-engine/feature_engine/tree/main/tests/estimator_checks

But I need to refurbish that file a lot. So I wouldn't want to waste your time there. If you could just add a test to the test_check_x file I mention previously, and then create an issue saying that we need to add a generic test for all transformers to test that they raise error on duplicates, that would be great and we could finish this PR.

Thank you!

def test_error_on_autodetected_numerical():
with pytest.raises(ValueError):
est = BaseNumericalTransformer()
est.variables = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you take these 2 lines, which should not raise an error, out of the with pytest... ?

To be absolutely sure that we are triggering the desired error, we should also test that the error raised matches the text of the error in check_X

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this file anyways, so feel free to ignore my comments here.


def test_error_on_autodetected_datetime():
with pytest.raises(ValueError):
DatetimeFeatures().fit(df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to ensure that the fit method is what's failing, we should initialise the transformer outside the with pytest...

with pytest.raises(ValueError):
est = BaseNumericalTransformer()
est.variables = ["same"]
est.fit(df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could put together this test with that on line 26 and use parametrize to set variables first to None and then to same. Then it would be test_base_numerical_transformer

est = MockedCategoricalTransformer()
est.variables = ["same"]
est.ignore_format = True
est.transform(df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this test, we should put it together with that on line 33 and use parametrize.

I don't understand why you need a mock class here, and why you are overwriting the fit method?


def test_error_on_manually_specified_datetime():
with pytest.raises(ValueError):
DatetimeFeatures(variables=["same"]).fit(df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, put together in 1 test with parametrize.

@david-cortes
Copy link
Contributor Author

Moved the tests to test_dataframe_checks.py.

)
df.columns = ["same", "unique", "same"]

with pytest.raises(ValueError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: could we add that it tests the error message as well? And example on how to do so here:

with pytest.raises(ValueError) as record:
encoder.fit(df[["var_A", "var_B"]], df["target"])
msg = (
"During the WoE calculation, some of the categories in the "
"following features contained 0 in the denominator or numerator, "
"and hence the WoE can't be calculated: var_A."
)
assert str(record.value) == msg

@david-cortes
Copy link
Contributor Author

Added a check on the error message.

@solegalli
Copy link
Collaborator

Hey @david-cortes

I made a PR to your repo: david-cortes#4

Where I rebase main and add this contribution to the changelog.

Would you have time to merge over there, so it updates here and I can merge and close?

Thanks a lot!

rebase main to fix failing tests and add change to changelog
@david-cortes
Copy link
Contributor Author

Hey @david-cortes

I made a PR to your repo: david-cortes#4

Where I rebase main and add this contribution to the changelog.

Would you have time to merge over there, so it updates here and I can merge and close?

Thanks a lot!

Thanks, although I think you should also be able to push changes to the branch directly.

@codecov
Copy link

codecov bot commented Sep 12, 2023

Codecov Report

Merging #686 (a53a7bd) into main (3343305) will increase coverage by 0.00%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #686   +/-   ##
=======================================
  Coverage   97.99%   97.99%           
=======================================
  Files         100      100           
  Lines        3843     3849    +6     
  Branches      754      752    -2     
=======================================
+ Hits         3766     3772    +6     
  Misses         28       28           
  Partials       49       49           
Files Changed Coverage Δ
feature_engine/creation/math_features.py 97.77% <ø> (ø)
feature_engine/dataframe_checks.py 97.05% <100.00%> (+0.08%) ⬆️
feature_engine/datetime/datetime.py 100.00% <100.00%> (ø)
feature_engine/datetime/datetime_subtraction.py 94.73% <100.00%> (+0.07%) ⬆️
feature_engine/encoding/base_encoder.py 100.00% <100.00%> (ø)
feature_engine/encoding/one_hot.py 100.00% <100.00%> (ø)
feature_engine/encoding/rare_label.py 100.00% <100.00%> (ø)
feature_engine/imputation/categorical.py 95.31% <100.00%> (ø)
feature_engine/selection/shuffle_features.py 100.00% <100.00%> (ø)
feature_engine/transformation/yeojohnson.py 100.00% <100.00%> (ø)
... and 1 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@solegalli solegalli merged commit bd4c8fc into feature-engine:main Sep 12, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants