-
-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor tree encoder: add encoding dict, param unseen, inverse transform #757
Conversation
Hey @VascoSch92 @glevv @93lorenzo If you have a minute to spare, a quick look to see that there is nothing big that I am missing will be much appreciated :) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #757 +/- ##
=======================================
Coverage 98.18% 98.19%
=======================================
Files 105 105
Lines 4074 4093 +19
Branches 795 802 +7
=======================================
+ Hits 4000 4019 +19
Misses 29 29
Partials 45 45 ☔ View full report in Codecov by Sentry. |
Thank you guys! @93lorenzo @glevv @VascoSch92 Really appreciate it :) |
"Parameter `unseen` takes only values ignore, raise, encode. " | ||
f"Got {unseen} instead." | ||
) | ||
with pytest.raises(ValueError) as record: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should also be fixed, but I'm not sure why there would be an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment new? I don't think I left any record in the file I committed last. Just doubled checked and didnt find any left.
The error comes from the values within parametrize, when I pass a list or a tuple, say ["raise", "ignore"] and then that tuple is passed as {unseen}, it doesn't like it. But it resolved with re.escape(message here)
f"{var_ls}." | ||
) | ||
# new categories will raise an error | ||
with pytest.raises(ValueError) as record: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also look at this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I didn't miss them. I changed them to match=msg, but then these 2 returned an error which I didn't understand and I was lazy to resolve. Although the messages were identical, the test failed, and it said something about did you want to escape something? Did you come across this before? Probably something to do with regex. which is not my forte :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both lines of the msg should be f-strings in both cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure I follow: This is the error
E AssertionError: Regex pattern did not match.
E Regex: 'During the encoding, NaN values were introduced in the feature(s) var_A, var_B.'
E Input: 'During the encoding, NaN values were introduced in the feature(s) var_A, var_B.'
E Did you mean to `re.escape()` the regex?
And this is the code:
def test_fit_errors_if_new_cat_values_and_unseen_is_raise_param(df_enc):
encoder = DecisionTreeEncoder(unseen="raise", regression=False)
encoder.fit(df_enc[["var_A", "var_B"]], df_enc["target"])
X = pd.DataFrame(
{
"var_A": ["A", "ZZZ", "YYY"],
"var_B": ["C", "YYY", "ZZZ"],
}
)
var_ls = "var_A, var_B"
msg = (
f"During the encoding, NaN values were introduced in the feature(s) {var_ls}."
)
# new categories will raise an error
with pytest.raises(ValueError, match=msg):
encoder.transform(X)
# assert str(record.value) == msg
What am I doing wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Match is a regex pattern search, so it gets confused by (s)
. Put an escape character (should be \
) like this:
rf"During the encoding, NaN values were introduced in the feature\(s\) {var_ls}."
Other options would be calling re.escape() on msg:
re.escape(f"During the encoding, NaN values were introduced in the feature(s) {var_ls}.")
or shortening the msg (since its not an equality but a regex match it would work):
f"{var_ls}."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Life saver @glevv thank you so much! Now it works :)
"unseen", ["string", False, ("raise", "ignore"), ["ignore"], np.nan] | ||
) | ||
def test_error_if_unseen_gets_not_permitted_value(unseen): | ||
msg = re.escape( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glevv fixed here
closes #729
closes #728
closes #588