-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise informative error on duplicated column names #686
Raise informative error on duplicated column names #686
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @david-cortes
Thank you so much for taking care of this issue.
I wonder if instead of modifying all the check variable functions, it would not be better to modify the check_X that checks the input dataframe. For example here:
if isinstance(X, pd.DataFrame): |
In this case, if the dataframe contains duplicated variables, it will raise an error and stop all computations right at the top of the fit method.
Do you know if / how pandas warns users of duplicated column names somehow?
if isinstance(df_check, pd.Series): | ||
return | ||
if df_check.columns.duplicated().any(): | ||
raise ValueError("Input data contains duplicated variable names.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick questions:
Does df_check.columns.duplicated().any() raise an error on a series?
Is this implementation better practice? or would it be better to say:
if not isinstance(df_check, pd.Series):
if df_check.columns.duplicated().any():
raise ValueError("Input data contains duplicated variable names.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. It does raise an error, since Series
objects do not have columns
. Nevertheless, if slicing a DataFrame
with a column name returns a Series
, it means there was only one column with that name.
b. No idea. The linter didn't complain about the one in my initial commit.
3a4f516
to
427aaa2
Compare
Moved the check to Regarding errors from Also changed the mechanism towards the attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @david-cortes
Thank you so much for the link to pandas documentation on duplicate labels. That was very useful. And thanks for the code changes in check_X. That looks very good.
After reading the documentation, I think this addition you are proposing is important and very useful.
I think, it'd be important to add a test in the test_check_x file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_dataframe_checks.py
Our test_files mirror our .py files. So we would not have a file in variable_handling if we do not add a function in variable handling. So I think we need to remove the file you added called test_error_on_duplicated
, and pass the test to other files.
This test is a generic test that all transformers should pass. We have generic tests here: https://github.com/feature-engine/feature_engine/tree/main/tests/estimator_checks
But I need to refurbish that file a lot. So I wouldn't want to waste your time there. If you could just add a test to the test_check_x file I mention previously, and then create an issue saying that we need to add a generic test for all transformers to test that they raise error on duplicates, that would be great and we could finish this PR.
Thank you!
def test_error_on_autodetected_numerical(): | ||
with pytest.raises(ValueError): | ||
est = BaseNumericalTransformer() | ||
est.variables = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take these 2 lines, which should not raise an error, out of the with pytest...
?
To be absolutely sure that we are triggering the desired error, we should also test that the error raised matches the text of the error in check_X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this file anyways, so feel free to ignore my comments here.
|
||
def test_error_on_autodetected_datetime(): | ||
with pytest.raises(ValueError): | ||
DatetimeFeatures().fit(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to ensure that the fit method is what's failing, we should initialise the transformer outside the with pytest...
with pytest.raises(ValueError): | ||
est = BaseNumericalTransformer() | ||
est.variables = ["same"] | ||
est.fit(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could put together this test with that on line 26 and use parametrize to set variables first to None and then to same. Then it would be test_base_numerical_transformer
est = MockedCategoricalTransformer() | ||
est.variables = ["same"] | ||
est.ignore_format = True | ||
est.transform(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this test, we should put it together with that on line 33 and use parametrize.
I don't understand why you need a mock class here, and why you are overwriting the fit method?
|
||
def test_error_on_manually_specified_datetime(): | ||
with pytest.raises(ValueError): | ||
DatetimeFeatures(variables=["same"]).fit(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, put together in 1 test with parametrize.
Moved the tests to |
tests/test_dataframe_checks.py
Outdated
) | ||
df.columns = ["same", "unique", "same"] | ||
|
||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: could we add that it tests the error message as well? And example on how to do so here:
feature_engine/tests/test_encoding/test_woe/test_woe_encoder.py
Lines 189 to 197 in 3343305
with pytest.raises(ValueError) as record: | |
encoder.fit(df[["var_A", "var_B"]], df["target"]) | |
msg = ( | |
"During the WoE calculation, some of the categories in the " | |
"following features contained 0 in the denominator or numerator, " | |
"and hence the WoE can't be calculated: var_A." | |
) | |
assert str(record.value) == msg |
Added a check on the error message. |
Hey @david-cortes I made a PR to your repo: david-cortes#4 Where I rebase main and add this contribution to the changelog. Would you have time to merge over there, so it updates here and I can merge and close? Thanks a lot! |
rebase main to fix failing tests and add change to changelog
Thanks, although I think you should also be able to push changes to the branch directly. |
Codecov Report
@@ Coverage Diff @@
## main #686 +/- ##
=======================================
Coverage 97.99% 97.99%
=======================================
Files 100 100
Lines 3843 3849 +6
Branches 754 752 -2
=======================================
+ Hits 3766 3772 +6
Misses 28 28
Partials 49 49
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
This PR adds an informative error message in cases in which the user supplies inputs having duplicated column names, which otherwise manifest in hard-to-track errors (e.g. #681).