Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor tree encoder: add encoding dict, param unseen, inverse transform #757

Merged
merged 7 commits into from May 14, 2024

Conversation

solegalli
Copy link
Collaborator

closes #729
closes #728
closes #588

@solegalli
Copy link
Collaborator Author

Hey @VascoSch92 @glevv @93lorenzo

If you have a minute to spare, a quick look to see that there is nothing big that I am missing will be much appreciated :)

Copy link

codecov bot commented May 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.19%. Comparing base (46155a0) to head (c0c1407).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #757   +/-   ##
=======================================
  Coverage   98.18%   98.19%           
=======================================
  Files         105      105           
  Lines        4074     4093   +19     
  Branches      795      802    +7     
=======================================
+ Hits         4000     4019   +19     
  Misses         29       29           
  Partials       45       45           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@solegalli solegalli changed the title refactor tree encoder: add encoding dict, param unseen, inverse transform [MRG] refactor tree encoder: add encoding dict, param unseen, inverse transform May 6, 2024
@solegalli
Copy link
Collaborator Author

Thank you guys! @93lorenzo @glevv @VascoSch92

Really appreciate it :)

"Parameter `unseen` takes only values ignore, raise, encode. "
f"Got {unseen} instead."
)
with pytest.raises(ValueError) as record:
Copy link
Contributor

@glevv glevv May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed one

Copy link
Contributor

@glevv glevv May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should also be fixed, but I'm not sure why there would be an error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment new? I don't think I left any record in the file I committed last. Just doubled checked and didnt find any left.

The error comes from the values within parametrize, when I pass a list or a tuple, say ["raise", "ignore"] and then that tuple is passed as {unseen}, it doesn't like it. But it resolved with re.escape(message here)

f"{var_ls}."
)
# new categories will raise an error
with pytest.raises(ValueError) as record:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also look at this one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I didn't miss them. I changed them to match=msg, but then these 2 returned an error which I didn't understand and I was lazy to resolve. Although the messages were identical, the test failed, and it said something about did you want to escape something? Did you come across this before? Probably something to do with regex. which is not my forte :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both lines of the msg should be f-strings in both cases

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure I follow: This is the error

E       AssertionError: Regex pattern did not match.
E        Regex: 'During the encoding, NaN values were introduced in the feature(s) var_A, var_B.'
E        Input: 'During the encoding, NaN values were introduced in the feature(s) var_A, var_B.'
E        Did you mean to `re.escape()` the regex?

And this is the code:

def test_fit_errors_if_new_cat_values_and_unseen_is_raise_param(df_enc):
    encoder = DecisionTreeEncoder(unseen="raise", regression=False)
    encoder.fit(df_enc[["var_A", "var_B"]], df_enc["target"])
    X = pd.DataFrame(
        {
            "var_A": ["A", "ZZZ", "YYY"],
            "var_B": ["C", "YYY", "ZZZ"],
        }
    )
    var_ls = "var_A, var_B"
    msg = (
        f"During the encoding, NaN values were introduced in the feature(s) {var_ls}."
    )
    # new categories will raise an error
    with pytest.raises(ValueError, match=msg):
        encoder.transform(X)
    # assert str(record.value) == msg

What am I doing wrong?

Copy link
Contributor

@glevv glevv May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Match is a regex pattern search, so it gets confused by (s). Put an escape character (should be \) like this:

rf"During the encoding, NaN values were introduced in the feature\(s\) {var_ls}."

Other options would be calling re.escape() on msg:

re.escape(f"During the encoding, NaN values were introduced in the feature(s) {var_ls}.")

or shortening the msg (since its not an equality but a regex match it would work):

f"{var_ls}."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Life saver @glevv thank you so much! Now it works :)

"unseen", ["string", False, ("raise", "ignore"), ["ignore"], np.nan]
)
def test_error_if_unseen_gets_not_permitted_value(unseen):
msg = re.escape(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glevv fixed here

@solegalli solegalli changed the title [MRG] refactor tree encoder: add encoding dict, param unseen, inverse transform refactor tree encoder: add encoding dict, param unseen, inverse transform May 14, 2024
@solegalli solegalli merged commit 389b515 into main May 14, 2024
10 checks passed
@solegalli solegalli deleted the reformat_tree_encoder branch May 14, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants