-
Notifications
You must be signed in to change notification settings - Fork 898
Add Rolling Transform primitives #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f7c3afa
to
82056ca
Compare
return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'}) | ||
compatibility = [Library.PANDAS, Library.DASK, Library.KOALAS] | ||
|
||
def __init__(self, window_length=3, gap=0, min_periods=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the default window-length was 1d
, which makes sense and will be a valid window with multiple instances present if the data's sampling frequency is less than 1 day. At this point in the implementation, though, we do not have access to date offsets, and setting the window length to 1
doesn't make much sense, as it would never end up with more than one instance present in the window.
I think I might prefer using window_length=7
or something that has some time-related meaning. But chose 3 for the moment just to show the arbitrary nature of having to set a default for this parameter.
Also, when we implement #1757, we can go back to setting the default to be 1d
, but for now, I don't think there's a way to ensure this won't be a breaking change when that's implemented.
""" | ||
# Workaround for pandas' fixed but unreleased bug: https://github.com/pandas-dev/pandas/issues/43016 | ||
# Can remove when upgraded to pandas 1.4.0 | ||
if str(series.dtype) == 'Int64': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to remove this eventually #1771
Codecov Report
@@ Coverage Diff @@
## main #1770 +/- ##
==========================================
+ Coverage 98.70% 98.72% +0.01%
==========================================
Files 139 141 +2
Lines 15419 15644 +225
==========================================
+ Hits 15220 15445 +225
Misses 199 199
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey The behavior of these primitives makes sense to me! The only thing that stood out to me was the difference in parameter definitions between EvalML and Featuretools that we chatted about. When we get to datestrings we will need your help in transforming the EvalML definitions to featuretools, but we agreed the "add 1" operation is possible.
Don't think EvalML will ever set min_periods to be different than window_length but that may be useful for other kinds of problems.
If users use RollingSTD
with min_periods=1, there will be one extra nan compared to the other primitives because the standard deviation of a single observation is not defined. That makes sense to me.
featuretools/primitives/utils.py
Outdated
window_length (int): The number of rows to be included in each window. For data | ||
with a uniform sampling frequency, for example of one day, the window_length will | ||
correspond to a period of time, in this case, 7 days for a window_length of 7. | ||
gap (int, optional): The number of rows prior to each instance to be skipped before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit-pick: The use of "skipped" confused me at first. If gap=1, and we're saying that 1 row prior to each instance will be skipped, that makes me think it's doing what gap=2 actually does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking decent, but just have a couple things:
- For the tests in
test_rolling_primitive.py
andtest_deep_feature_synthesis.py
we are only using the pandas version of the test fixture series. Should/could we also test with Dask/Koalas? - Do we have any concerns with changing the init arguments for these primitives from their previous definitions? I think this means any existing applications where these were used as premium primitives will likely break, but I'm not sure that is an issue. Just wanted to mention in case it is important.
efbd45b
to
98afebc
Compare
Using all three was originally my plan, which was why I created all three fixtures, but when I started running into various issues (koalas' implementation of rolling sometimes has different behavior than pandas, and dask was having other unexplained failutes), I ended up decided to just allow Pandas, which was how they previously were. I still left the dask and koalas fixtures in place, since Happy to remove the extra fixtures if it doesn't make sense to keep them in if the primitives aren't yet compatible. Or, if we're okay with this taking a little longer to get in, I could investigate the dask and koalas issues further to try and get the primitives compatible. @gsheni may have thoughts on whether it'd be worth it.
We'd need to get the date offset implementation in at the same time to have the primitives not break if they are currently in use anywhere. That would be a bigger ask for this PR, that I think we didn't want to have to wait for. Also ran this by @gsheni a while back to confirm this was okay. Can revisit if it's going to be problematic to break them. |
@tamargrey is correct. We are punting on the Koalas/Dask implementation of these primitives. Additionally, we are okay with this being a breaking change for the RollingX primitives (compared to premium primitives) |
I think my vote would be to remove the Dask and Koalas fixtures if we are not supporting those types of inputs now, and we can add them back in when they are needed. Otherwise they could end up hanging around for a long time without really being needed. Not a big deal to me one way or the other though. |
window_length (int): The number of rows to be included in each frame. For data | ||
with a uniform sampling frequency, for example of one day, the window_length will | ||
correspond to a period of time, in this case, 7 days for a window_length of 7. | ||
gap (int, optional): The number of rows prior to each instance to be skipped before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's this sound
The number of rows backward from the target instance to be skipped before the window of usable data begins. Defaults to 0, which will include the target instance in the window
If include_cutoff_time
is False does gap=0 function the same as gap=1 would normally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion. The one thing I might change is just completely getting rid of the word "skipped" based off of @freddyaboulton's comment above. How about
The number of rows backward from the target instance before the window of usable data begins. Defaults to 0, which will include the target instance in the window
I think just talking about "n rows back" better encapsulates the idea that one row back is immediately above and zero rows back is just that instance. Instead of also including the idea of "n rows skipped" which would lead me to think there would be n
entire rows that are ignored before starting the window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your second point: unless I'm missing something, setting a cutoff time and having include_cutoff_time=False
shouldn't have an impact on the values included in any window.
In the example below where we have primitives set with gap=0 and gap=1 and it just impacts the final row.
If I set cutoff_times for each row to be its own time index value (and now we have to have include_cutoff_time=True or else we'll have no data even though windows technically would be before the cutoff time--may need to be a separate issue if we wanted to handle that case specifically), it has no impact at all. (If we also add uses_full_dataframe = True
to the primitive because of that separate bug)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, I think I was confusing using the cutoff time to create a window with using the target instance for a window
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh fair. Anything I can add to the docstrings to help clarify what happens in these primitives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put this information in our documentation. Add it as a section or a new page on Time Series. Use it as a chance to explain rolling primitives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have an issue to do that (#1758 - could also just be adding an entirely new Time series page). Was imagining it could happen once we have a lot of the time series stuff in, but we could always start it now and update it as we go along
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey your call. Happy with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna leave the documentation for later. Should be a bit easier to create a more cohesive narrative when everything is implemented.
@tamargrey can you pull in the latest changes? |
1c0c5e4
to
85d5ffb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming test pass
3b166ad
to
326de65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Seems like the test coverage CI failure is due to some flaky Dask behavior and nothing in this MR.
featuretools/utils/plot_utils.py
Outdated
try: | ||
valid_formats = graphviz.backend.FORMATS | ||
except AttributeError: | ||
from graphviz.parameters import FORMATS | ||
valid_formats = FORMATS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try using graphviz.FORMATS
9e7b07d
to
1389e2f
Compare
Adds five rolling transform primitives that all takeinteger
window_length
,gap
andmin_periods
parameters.closes #1756