-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add __str__
and __repr__
for components and pipelines
#1218
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1218 +/- ##
=======================================
Coverage 99.92% 99.92%
=======================================
Files 200 200
Lines 12369 12468 +99
=======================================
+ Hits 12360 12459 +99
Misses 9 9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great but can you add unit tests that test str and repr (in the same fashion you have for the mock components) for all existing components with @pytest.mark.parametrize("component_class", all_components())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great! Theres just one extra print before merging. I do have one design question though. Would users enjoy more information out of __str__
as well? We specced it out to do self.name
but maybe we could include parameters as well. We can think a little more about this and file another issue or just see if we get a feature request for it but not blocking on this issue. Good work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left two nit-picky comments but that's all 😁
if type(value) == str: | ||
rpr = rpr + f"{parameter}='{value}'," | ||
elif value == float('inf') or value == float('-inf'): | ||
rpr = rpr + f"{parameter}=float('{value}')," | ||
else: | ||
rpr = rpr + f"{parameter}={value}," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe overkill but could be useful to make a helper function that takes in parameters
and returns that portion of the repr, so that it can be shared with the pipeline_base
implementation? :o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that idea, but unfortunately pipelines need the parameters as a dict, while components need them as parameters (basically just having :
or =
in there). If you have any ideas for how to work around that I'm open to them!
assert enc.describe(return_dict=True) == {'name': 'One Hot Encoder', 'parameters': {'top_n': 10, | ||
'categories': None, | ||
'drop': None, | ||
'handle_unknown': 'ignore', | ||
'handle_missing': 'error'}} | ||
drop_col_transformer = DropColumns(columns=['col_one', 'col_two']) | ||
assert imputer.describe(return_dict=True) == {'name': 'Simple Imputer', 'parameters': {'impute_strategy': 'mean', 'fill_value': None}} | ||
assert imputer.describe(return_dict=True) == {'name': 'Imputer', 'parameters': {'categorical_impute_strategy': "most_frequent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol thanks for this! I wonder if there's a way to automate this since it's clear that we missed quite a few of our new components (maybe using all_components()
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay great work on this! There's some seriously tricky stuff going on here for a feature which looks so simple from the outside. It's really cool we're doing this :)
I left a suggested change on the impl along with some discussion. I also left a couple items on testing. Should be all set after that IMO.
else: | ||
rpr = rpr + f"{parameter}={value}," | ||
rpr = rpr + ')' | ||
return rpr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay good call looking into float('inf')
!! Another one is float('nan')
. You've hit on what appears to be a funny bug and/or design flaw with python's float
type: repr(float('inf'))
comes back as 'inf'
, which... is just messed up! 🤔 Same for float('nan')
. I think they did this so that float(repr(float('inf')))
is equivalent to float('inf')
, and same for nan.
The same is true for np.inf
and np.nan
, by the way:
In [6]: parameters = {'int': 42, 'string': 'string', 'float': 3.14159,
'inf': float('inf'), 'np inf': np.inf, 'np nan': np.nan}
In [7]: repr(parameters)
Out[7]: "{'int': 42, 'string': 'string', 'float': 3.14159, 'inf': inf, 'np inf': inf, 'np nan': nan}"
So, what do we do? It would be really nice if we could have what comes out of repr(component)
be something we could paste into a terminal in order to get an identical instance.
I like the approach you're taking. I played with it for a bit and here's what I got:
# define this in gen_utils so we can use it for the pipeline repr too
def safe_repr(value):
if isinstance(value, float):
if pd.isna(value):
return 'np.nan'
if np.isinf(value):
return f'float({repr(value)})'
return repr(value)
# then the component repr:
def __repr__(self):
parameters_repr = ', '.join([f'{key}={safe_repr(value)}' for key, value in self.parameters.items()])
return f'{self.name}({parameters_repr})'
Here's a test I did of the safe_repr
, which shows that it outputs the string we need in order to avoid some ugly bugs:
In [1]: [safe_repr(el) for el in ['string', float('nan'), float('-inf'), float('inf'), np.nan, -np.inf, np.inf]]
Out[1]:
["'string'", 'np.nan', 'float(-inf)', 'float(inf)', 'np.nan', 'float(-inf)', 'float(inf)']
What do you think of that? Weird stuff!
evalml/pipelines/pipeline_base.py
Outdated
rpr = rpr + f"'{parameter}': {value}, " | ||
rpr = rpr + "}, " | ||
rpr = rpr + '})' | ||
return rpr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for component repr, let's flatten this out, and great thinking trying nan/inf types!
assert eval(repr(pipeline_with_parameters)) == pipeline_with_parameters | ||
|
||
pipeline_with_inf_parameters = MockPipeline(parameters={'Imputer': {'numeric_fill_value': float('inf')}}) | ||
assert eval(repr(pipeline_with_inf_parameters)) == pipeline_with_inf_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is pretty cool! Great thinking enforcing this invariant on repr
Two things:
-
RE our discussion of nan types at the top, it would be ideal if we had a similar test to this one which checks all those edge cases. Example: define a mock component which has a few input parameters, and pass in
float('nan')
/np.nan
and the native/np infs. -
There are security concerns with using python's
eval
. It turns out if you can compromise what goes intoeval
you can do a lot of bad stuff. To avoid this entirely, my recommendation is to avoid usingeval
and just check the string output matches what we expect. I know, this is test code so it feels kinda silly, lol--related reading if you're interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet!! Approved pending a couple small things:
- Remove extra curly braces from pipeline repr
- Add docstring for
safe_repr
return ', '.join([f"'{key}': {safe_repr(value)}" for key, value in parameters.items()]) | ||
|
||
parameters_repr = ' '.join([f"'{component}':{{{repr_component(parameters)}}}," for component, parameters in self.parameters.items()]) | ||
return f'{(type(self).__name__)}(parameters={{{parameters_repr}}})' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you want double curly braces instead of triple here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the parameters is a dictionary and contains an expression, I do need all three! The first two are for literal curly braces and the third is for the formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm interesting, but then the output (taken from the unit test) ends up as
MockPipeline(parameters={{'Imputer':{{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}}, '{final_estimator}':{{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}},}})
But ideally I think it should be
MockPipeline(parameters={'Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'final_estimator': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}})
Do you agree? The second output could be evaluated in the python repl and turned into an object. I think the first would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay but, I think its fine to merge this and we can circle back. It adds value even if there's a couple details we may wanna discuss further :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double curly braces in the unit tests are once again for f-string formatting! If you actually print the repr, you get the second code block you pasted there, and calling eval on the expected_repr produces the correct object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay hmm could you paste an example? I don't follow yet. I thought that __repr__
returns a string, which requires no further formatting, and that's the end of it 😂 In other words, I thought we should output the second snippet, because that can be pasted into the python REPL and evaluated, but we're currently outputting the first.
Again, not blocking merge, I'm just trying to make sure we all agree on desired behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay ok, I think I understand now. So the string I copied was a format string, which is why you need the double curlys there.
I just did a test locally and everything looks great
Realizing the unit tests were using a format string was the key point I was missing, lol. Thanks! 😆
component_graph = ['Imputer', final_estimator] | ||
|
||
pipeline = MockPipeline(parameters={}) | ||
expected_repr = f"MockPipeline(parameters={{'Imputer':{{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}}, '{final_estimator}':{{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}},}})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay I just noticed: can we replace '{final_estimator}':
with 'final_estimator':
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't block merge though IMO! Just a detail we should probably chase down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the way these tests are written, we can't. The string __repr__
prints out uses the name of the estimator, so checking for string equality will fail. If I replace {final_estimator}
with final_estimator
and try assert eval(repr(pipeline)) == eval(expected_repr)
, that will pass, but that will reintroduce eval
to the code.
The alternative of having __repr__
output final_estimator
instead of the name of said estimator feels unnecessarily clunky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! My bad, I didn't notice this was a format string until just now. Got it, makes sense. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful, glad we're on the same page 😅. Does this also clear up our other discussion?
Closes #474