Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up pipeline graphing code #423

Merged
merged 15 commits into from
Mar 6, 2020
Merged

Clean up pipeline graphing code #423

merged 15 commits into from
Mar 6, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Feb 27, 2020

Cleaning up the API for 1) generating images of pipeline graphs and 2) generating bar graphs of feature importances. It felt like these should be direct members of PipelineBase rather than having indirection through another class.

I haven't modified the logic for generating the graphs, just how that functionality is invoked.

@dsherry dsherry added enhancement An improvement to an existing feature. refactor Work being done to refactor code. labels Feb 27, 2020
@codecov
Copy link

codecov bot commented Feb 27, 2020

Codecov Report

Merging #423 into master will increase coverage by 0.06%.
The diff coverage is 97.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #423      +/-   ##
==========================================
+ Coverage   98.19%   98.25%   +0.06%     
==========================================
  Files         104      104              
  Lines        3260     3265       +5     
==========================================
+ Hits         3201     3208       +7     
+ Misses         59       57       -2     
Impacted Files Coverage Δ
evalml/pipelines/graphs.py 95.83% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 080f509...4941a4e. Read the comment docs.

@angela97lin
Copy link
Contributor

Hmmm, this is something we've previously discussed here. I think we decided to move everything into a separate PipelinePlot class because it would keep the implementation separate from the plots, making it cleaner to understand what was what. I believe we implemented this similar to how pandas did it too (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html). What are the benefits you see in adding them to the base class directly?

@kmax12, do you have any thoughts on this?

Also, could be important to keep this issue in mind?

@kmax12
Copy link
Contributor

kmax12 commented Feb 27, 2020

seems like there are two concerns here

  1. what should the API be for users

pipeline.plot.feature_importances vs pipeline.feature_importance_plot

I don't have a strong opinion on this. just that we are consistent. I can see that the .plot accessor might not be the best.

  1. how to organize code
    I'm a proponent of separating behind the scene
  • Helps us separate business logic / mach learning code from visualization code
  • makes the easier to one day separate the visualize into it's own library
  • make it easier to test (i could be wrong about this)
  • you can use the visualization code as long as you have the required data rather than needing an actual pipeline instance. this can make it lighter weight to generate visualization because we don't need to load in the actual model into memory in order to visualize

@dsherry
Copy link
Contributor Author

dsherry commented Feb 27, 2020

Thanks for the context @angela97lin. I like that summary @max.

My rationale here was to do the simplest thing, both in terms of the user API and the code organization.

For the user API: I think its more clear to have the graph methods be direct attributes of the pipeline, without indirection through an additional .plot namespace. I see what you mean about pandas plot: they've created a plot namespace attached to the dataframe object. I don't feel strongly about this. I guess one thing to consider is: do we anticipate adding more pipeline-related graphs? If not, this approach could be better. Pandas defines like 15 or something lol, hence the namespace.

For the code organization: this PR keeps the graphing code in a separate file, with flat methods, and then calls those methods in PipelineBase. I think this organization makes it easy to follow the code. I'm not opposed to keeping this in a class, this just felt more appropriate in conjunction with removing the .plot namespace above.

Also I changed the naming from "plot" to "graph", because both the pipeline graph and feature importance bar graph are graphs :)

@kmax12
Copy link
Contributor

kmax12 commented Mar 2, 2020

For the user API: I think its more clear to have the graph methods be direct attributes of the pipeline, without indirection through an additional .plot namespace. I see what you mean about pandas plot: they've created a plot namespace attached to the dataframe object. I don't feel strongly about this. I guess one thing to consider is: do we anticipate adding more pipeline-related graphs? If not, this approach could be better. Pandas defines like 15 or something lol, hence the namespace.

I wouldn't be surprised if we do add more, but i don't think we need to make an API decision in anticipation of that.

I think the only other ones to be aware are the confusion_matrix and roc plot ones. These are currently defined on the Auto search classes, but I know we been discussing moving them to the pipeline object, which I 100% agree with.

For the code organization: this PR keeps the graphing code in a separate file, with flat methods, and then calls those methods in PipelineBase. I think this organization makes it easy to follow the code. I'm not opposed to keeping this in a class, this just felt more appropriate in conjunction with removing the .plot namespace above.

Yep, I actually like this organization of the code more than the plotting classes we currently have.

My thought would be to actually push those methods you've defined to be more general purpose. e.g the make_feature_importance_graph function take in the dataframe of feature importances rather than the pipeline object. similarly make_pipeline_graph should take in a schema representation of the pipeline rather than the object. this would also those methods to be used with feature importances and pipelines defined with other libraries or when you only have that data, but not the full objects

@@ -8,6 +8,7 @@ Changelog
* Add CatBoost (gradient-boosted trees) classification and regression components and pipelines :pr:`247`
* Added Tuner abstract base class :pr:`351`
* Added n_jobs as parameter for AutoClassificationSearch and AutoRegressionSearch :pr:`403`
* Added PipelineBase graph and feature_importance_graph methods, moved from previous location :pr:`423`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also add this as a breaking change

Returns:
plotly.Figure, a bar graph showing features and their importances
"""
feat_imp = pipeline.feature_importances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the input to this method should be a dataframe i.e whatever pipeline.feature_importances is. that would make it more general purpose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, will do

@dsherry dsherry self-assigned this Mar 6, 2020
@dsherry dsherry force-pushed the ds_rename_pipeline_plots branch from 1deb2d8 to 97a3466 Compare March 6, 2020 20:29
f = open(filepath, 'w')
f.close()
except IOError:
raise ValueError(('Specified parent directory does not exist: {}'.format(filepath)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did end up adding one thing: make sure the file location is writeable. Added unit test

@dsherry
Copy link
Contributor Author

dsherry commented Mar 6, 2020

Ugh, the codecov/patch job is failing. I think it's because I added a new file for the graphing, but the unit tests cover the callsite of the functions in the file, not the functions themselves. I stand by this as a testing practice; the code is covered where it is called. Therefore I'm going to ignore the codecov failure and merge anyways. It's annoying that the failed job doesn't explain why it failed...

@dsherry dsherry force-pushed the ds_rename_pipeline_plots branch from 3a59ddf to 4941a4e Compare March 6, 2020 20:56
@dsherry dsherry merged commit 91d62ec into master Mar 6, 2020
@dsherry dsherry deleted the ds_rename_pipeline_plots branch March 6, 2020 21:12
@angela97lin angela97lin mentioned this pull request Mar 9, 2020
@dsherry dsherry mentioned this pull request Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature. refactor Work being done to refactor code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants