New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache feature names #536
Cache feature names #536
Conversation
Computing the name of a feature can be expensive because primitives use reflection to generate their names, and when features are stacked the number of calls to get_name() grows. This commit caches the name when it is computed the first time. With max_depth=2 this sped up ft.dfs (just building the features) by about 3x, and the improvement is greater for higher values of max_depth.
Codecov Report
@@ Coverage Diff @@
## master #536 +/- ##
=======================================
Coverage 96.26% 96.26%
=======================================
Files 114 114
Lines 9258 9258
=======================================
Hits 8912 8912
Misses 346 346
Continue to review full report at Codecov.
|
I can't think of any reason the name would have to be invalidated since there are no public methods on features that allow you to change the underlying pieces that are used to generate a name. @rwedge can you think of anything? |
I can't think of anything |
if self._name is not None: | ||
return self._name | ||
return self.generate_name() | ||
if not self._name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rwedge I realized that _name
is defined as a class variable, but when we generate the name we then create an instance variable shadowing the class variable. This seems more confusing to me than putting self._name = None
in the initializer, or is it a python idiom that I'm not familiar with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. we should move to self._name = None
in the init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Computing the name of a feature can be expensive because primitives use
reflection to generate their names, and when features are stacked the
number of calls to
get_name()
grows. This commit caches the name when itis computed the first time. Is it reasonable to assume that features are
immutable, or are there places where we should invalidate the cache?
With
max_depth=2
this sped upft.dfs
(just building the features) byabout 3x, and the improvement is greater for higher values of
max_depth
.