Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update generate_pipeline_code to return pipeline instances without generating custom pipeline class #2227

Merged
merged 12 commits into from May 5, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented May 4, 2021

Closes #2160.

This is done by updating our pipelines' repr (since its a bit outdated and only contained parameters) and then using that updated repr in generate_pipeline_code :)

Also fixes mini bug with baseline pipelines in AutoMLSearch.

@angela97lin angela97lin self-assigned this May 4, 2021
@codecov
Copy link

codecov bot commented May 5, 2021

Codecov Report

Merging #2227 (b5947d2) into main (96fed90) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2227     +/-   ##
=======================================
- Coverage   99.9%   99.9%   -0.0%     
=======================================
  Files        280     280             
  Lines      24152   24089     -63     
=======================================
- Hits       24127   24064     -63     
  Misses        25      25             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.0% <ø> (ø)
evalml/pipelines/pipeline_base.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/pipelines/utils.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96fed90...b5947d2. Read the comment docs.

@@ -720,15 +720,15 @@ def _get_baseline_pipeline(self):
if self.problem_type == ProblemTypes.BINARY:
baseline = BinaryClassificationPipeline(component_graph=["Baseline Classifier"],
custom_name="Mode Baseline Binary Classification Pipeline",
custom_hyperparameters={"strategy": ["mode"]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the odd dask issue that came up--it wasn't happy with this line. This is incorrect anyways (doesnt specify which component) but funny that this didn't come up in other tests... I think it's because the default is mode/mean anyways so automl just did its thing and ignored this, but now that i'm adding custom hyperparameters to repr, it bugged out 😂

I think it makes sense to change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the dask stacktrace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here ya go!


    def test_automl_max_iterations(self):
        """ Making sure that the max_iterations parameter limits the number of pipelines run. """
        X, y = self.X_y_binary
        max_iterations = 4
        par_automl = AutoMLSearch(X_train=X, y_train=y, problem_type="binary", engine=self.parallel_engine,
                                  max_iterations=max_iterations)
>       par_automl.search()

evalml\tests\automl_tests\test_automl_dask.py:57: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
evalml\automl\automl_search.py:580: in search
    self._add_baseline_pipelines()
evalml\automl\automl_search.py:753: in _add_baseline_pipelines
    computation = self._engine.submit_evaluation_job(self.automl_config, baseline, self.X_train, self.y_train)
evalml\automl\engine\dask_engine.py:89: in submit_evaluation_job
    logger=logger)
c:\users\runneradmin\miniconda3\envs\curr_py\lib\site-packages\distributed\client.py:1565: in submit
    key = funcname(func) + "-" + tokenize(func, kwargs, *args)
c:\users\runneradmin\miniconda3\envs\curr_py\lib\site-packages\dask\base.py:795: in tokenize
    return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest()
c:\users\runneradmin\miniconda3\envs\curr_py\lib\site-packages\dask\utils.py:512: in __call__
    return meth(arg, *args, **kwargs)
c:\users\runneradmin\miniconda3\envs\curr_py\lib\site-packages\dask\base.py:806: in normalize_dict
    return normalize_token(sorted(d.items(), key=str))
evalml\pipelines\pipeline_base.py:528: in __repr__
    custom_hyperparameters_repr = ', '.join([f"'{component}':{{{repr_component(hyperparameters)}}}," for component, hyperparameters in self.custom_hyperparameters.items()]) if self.custom_hyperparameters else None
evalml\pipelines\pipeline_base.py:528: in <listcomp>
    custom_hyperparameters_repr = ', '.join([f"'{component}':{{{repr_component(hyperparameters)}}}," for component, hyperparameters in self.custom_hyperparameters.items()]) if self.custom_hyperparameters else None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

parameters = ['mode']

    def repr_component(parameters):
>       return ', '.join([f"'{key}': {safe_repr(value)}" for key, value in parameters.items()])
E       AttributeError: 'list' object has no attribute 'items'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I wanted to know how dask ended up calling repr

'\n\nparameters = json.loads("""{\n\t"My Custom Estimator": {\n\t\t"random_arg": false\n\t}\n}""")' \
'\npipeline = MockAllCustom(parameters)'
pipeline = generate_pipeline_code(mockAllCustom)
mock_pipeline_with_custom_components = BinaryClassificationPipeline([CustomTransformer, CustomEstimator])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted a lot here, I don't think we need to test one with transformer, one with estimator, one with both--just testing one with both.

@@ -591,7 +591,7 @@ class MockRegressionPipeline(RegressionPipeline):
generate_pipeline_code([Imputer(), LogisticRegressionClassifier()])


def test_generate_code_pipeline_json_errors():
def test_generate_code_pipeline_json_with_objects():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer errors :)

With that, just kept one or two tests of different types of objects :)

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think there's a bug somewhere? I get a syntax error when I specify a pipeline with more than one key in custom_hyperparameters:

image

Just specifying the random forest classifier seems to work:

image

I also noticed that the skopt imports are not added to the code string which I believe would cause a stacktrace if you ran the code in a different environment.

Did we ever decide what the long-term plan for custom_hyperparameters is? I thought we had talked about deleting it but I don't see it in the epic (that's been closed?).

@angela97lin
Copy link
Contributor Author

@freddyaboulton Aiyah, thank you--I'll take a look at this, and add a test. Good catch! 😅

I think it's okay that the imports aren't added and we require the user to add the external imports themselves. What do you think?

Also, long term for custom_hyperparameters is tracked here: #2130

We closed the epic since we figured it was just one-off improvement issues and didn't require an overarching epic :)

@freddyaboulton
Copy link
Contributor

Sounds good @angela97lin ! Thanks for pointing me to #2130 !

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I like this fix! I do think that it would be preferred by users to have formatted code returned, as this was an issue filed here. Functionality-wise, this looks good to me (beyond the bug @freddyaboulton found).

Let me know what you think!

@angela97lin
Copy link
Contributor Author

angela97lin commented May 5, 2021

@freddyaboulton @bchen1116 Fixed! lol turns out I just had an extra floating comma lying around, thanks for bringing it up!

Also updated the repr test to make sure, and added a statement in our notebook that external imports are necessary to execute the code :))

Also, formatted code is nice but functionality wise not necessary. Since there's already an issue for it, happy to leave that as is and not address it here 😂

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed the reformatting issue here.

Solid tests and changes! LGTM!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @angela97lin ! Thanks for the fix!

@angela97lin angela97lin merged commit 3fc568d into main May 5, 2021
@angela97lin angela97lin deleted the 2160_generate_code branch May 5, 2021 21:29
@chukarsten chukarsten mentioned this pull request May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update generate_pipeline_code to create instances instead of pipeline subclasses
3 participants