Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __str__ and __repr__ for components and pipelines #1218

Merged
merged 16 commits into from
Sep 29, 2020
Merged

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Sep 23, 2020

Closes #474

@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #1218 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1218   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         200      200           
  Lines       12369    12468   +99     
=======================================
+ Hits        12360    12459   +99     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/pipelines/components/component_base.py 100.00% <100.00%> (ø)
evalml/pipelines/pipeline_base.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <100.00%> (ø)
evalml/utils/__init__.py 100.00% <100.00%> (ø)
evalml/utils/gen_utils.py 99.09% <100.00%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9cba15...57040d2. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review September 23, 2020 20:57
@eccabay eccabay self-assigned this Sep 23, 2020
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great but can you add unit tests that test str and repr (in the same fashion you have for the mock components) for all existing components with @pytest.mark.parametrize("component_class", all_components())?

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! Theres just one extra print before merging. I do have one design question though. Would users enjoy more information out of __str__ as well? We specced it out to do self.name but maybe we could include parameters as well. We can think a little more about this and file another issue or just see if we get a feature request for it but not blocking on this issue. Good work!

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left two nit-picky comments but that's all 😁

Comment on lines 161 to 166
if type(value) == str:
rpr = rpr + f"{parameter}='{value}',"
elif value == float('inf') or value == float('-inf'):
rpr = rpr + f"{parameter}=float('{value}'),"
else:
rpr = rpr + f"{parameter}={value},"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe overkill but could be useful to make a helper function that takes in parameters and returns that portion of the repr, so that it can be shared with the pipeline_base implementation? :o

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that idea, but unfortunately pipelines need the parameters as a dict, while components need them as parameters (basically just having : or = in there). If you have any ideas for how to work around that I'm open to them!

assert enc.describe(return_dict=True) == {'name': 'One Hot Encoder', 'parameters': {'top_n': 10,
'categories': None,
'drop': None,
'handle_unknown': 'ignore',
'handle_missing': 'error'}}
drop_col_transformer = DropColumns(columns=['col_one', 'col_two'])
assert imputer.describe(return_dict=True) == {'name': 'Simple Imputer', 'parameters': {'impute_strategy': 'mean', 'fill_value': None}}
assert imputer.describe(return_dict=True) == {'name': 'Imputer', 'parameters': {'categorical_impute_strategy': "most_frequent",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol thanks for this! I wonder if there's a way to automate this since it's clear that we missed quite a few of our new components (maybe using all_components()?)

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay great work on this! There's some seriously tricky stuff going on here for a feature which looks so simple from the outside. It's really cool we're doing this :)

I left a suggested change on the impl along with some discussion. I also left a couple items on testing. Should be all set after that IMO.

evalml/pipelines/components/component_base.py Show resolved Hide resolved
else:
rpr = rpr + f"{parameter}={value},"
rpr = rpr + ')'
return rpr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay good call looking into float('inf')!! Another one is float('nan'). You've hit on what appears to be a funny bug and/or design flaw with python's float type: repr(float('inf')) comes back as 'inf', which... is just messed up! 🤔 Same for float('nan'). I think they did this so that float(repr(float('inf'))) is equivalent to float('inf'), and same for nan.

The same is true for np.inf and np.nan, by the way:

In [6]: parameters = {'int': 42, 'string': 'string', 'float': 3.14159,
                      'inf': float('inf'), 'np inf': np.inf, 'np nan': np.nan}

In [7]: repr(parameters)
Out[7]: "{'int': 42, 'string': 'string', 'float': 3.14159, 'inf': inf, 'np inf': inf, 'np nan': nan}"

So, what do we do? It would be really nice if we could have what comes out of repr(component) be something we could paste into a terminal in order to get an identical instance.

I like the approach you're taking. I played with it for a bit and here's what I got:

# define this in gen_utils so we can use it for the pipeline repr too
def safe_repr(value):
    if isinstance(value, float):
        if pd.isna(value):
            return 'np.nan'
        if np.isinf(value):
            return f'float({repr(value)})'
    return repr(value)

# then the component repr:
def __repr__(self):
    parameters_repr = ', '.join([f'{key}={safe_repr(value)}' for key, value in self.parameters.items()])
    return f'{self.name}({parameters_repr})'

Here's a test I did of the safe_repr, which shows that it outputs the string we need in order to avoid some ugly bugs:

In [1]: [safe_repr(el) for el in ['string', float('nan'), float('-inf'), float('inf'), np.nan, -np.inf, np.inf]]
Out[1]:
["'string'", 'np.nan', 'float(-inf)', 'float(inf)', 'np.nan', 'float(-inf)', 'float(inf)']

What do you think of that? Weird stuff!

rpr = rpr + f"'{parameter}': {value}, "
rpr = rpr + "}, "
rpr = rpr + '})'
return rpr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for component repr, let's flatten this out, and great thinking trying nan/inf types!

assert eval(repr(pipeline_with_parameters)) == pipeline_with_parameters

pipeline_with_inf_parameters = MockPipeline(parameters={'Imputer': {'numeric_fill_value': float('inf')}})
assert eval(repr(pipeline_with_inf_parameters)) == pipeline_with_inf_parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is pretty cool! Great thinking enforcing this invariant on repr

Two things:

  1. RE our discussion of nan types at the top, it would be ideal if we had a similar test to this one which checks all those edge cases. Example: define a mock component which has a few input parameters, and pass in float('nan')/np.nan and the native/np infs.

  2. There are security concerns with using python's eval. It turns out if you can compromise what goes into eval you can do a lot of bad stuff. To avoid this entirely, my recommendation is to avoid using eval and just check the string output matches what we expect. I know, this is test code so it feels kinda silly, lol--related reading if you're interested.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet!! Approved pending a couple small things:

  • Remove extra curly braces from pipeline repr
  • Add docstring for safe_repr

evalml/pipelines/components/component_base.py Outdated Show resolved Hide resolved
evalml/utils/gen_utils.py Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
return ', '.join([f"'{key}': {safe_repr(value)}" for key, value in parameters.items()])

parameters_repr = ' '.join([f"'{component}':{{{repr_component(parameters)}}}," for component, parameters in self.parameters.items()])
return f'{(type(self).__name__)}(parameters={{{parameters_repr}}})'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the parameters is a dictionary and contains an expression, I do need all three! The first two are for literal curly braces and the third is for the formatting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm interesting, but then the output (taken from the unit test) ends up as

MockPipeline(parameters={{'Imputer':{{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}}, '{final_estimator}':{{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}},}})

But ideally I think it should be

MockPipeline(parameters={'Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'final_estimator': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}})

Do you agree? The second output could be evaluated in the python repl and turned into an object. I think the first would fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay but, I think its fine to merge this and we can circle back. It adds value even if there's a couple details we may wanna discuss further :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double curly braces in the unit tests are once again for f-string formatting! If you actually print the repr, you get the second code block you pasted there, and calling eval on the expected_repr produces the correct object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay hmm could you paste an example? I don't follow yet. I thought that __repr__ returns a string, which requires no further formatting, and that's the end of it 😂 In other words, I thought we should output the second snippet, because that can be pasted into the python REPL and evaluated, but we're currently outputting the first.

Again, not blocking merge, I'm just trying to make sure we all agree on desired behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay ok, I think I understand now. So the string I copied was a format string, which is why you need the double curlys there.

I just did a test locally and everything looks great

Screen Shot 2020-09-29 at 10 40 39 AM

Realizing the unit tests were using a format string was the key point I was missing, lol. Thanks! 😆

component_graph = ['Imputer', final_estimator]

pipeline = MockPipeline(parameters={})
expected_repr = f"MockPipeline(parameters={{'Imputer':{{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}}, '{final_estimator}':{{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}},}})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay I just noticed: can we replace '{final_estimator}': with 'final_estimator':?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't block merge though IMO! Just a detail we should probably chase down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the way these tests are written, we can't. The string __repr__ prints out uses the name of the estimator, so checking for string equality will fail. If I replace {final_estimator} with final_estimator and try assert eval(repr(pipeline)) == eval(expected_repr), that will pass, but that will reintroduce eval to the code.

The alternative of having __repr__ output final_estimator instead of the name of said estimator feels unnecessarily clunky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! My bad, I didn't notice this was a format string until just now. Got it, makes sense. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, glad we're on the same page 😅. Does this also clear up our other discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add __str__+__repr__ for ComponentBase+PipelineBase
4 participants