Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rolling Transform primitives #1770

Merged
merged 43 commits into from Nov 15, 2021
Merged

Add Rolling Transform primitives #1770

merged 43 commits into from Nov 15, 2021

Conversation

tamargrey
Copy link
Contributor

Adds five rolling transform primitives that all takeinteger window_length, gap and min_periods parameters.
closes #1756

return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})
compatibility = [Library.PANDAS, Library.DASK, Library.KOALAS]

def __init__(self, window_length=3, gap=0, min_periods=1):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the default window-length was 1d, which makes sense and will be a valid window with multiple instances present if the data's sampling frequency is less than 1 day. At this point in the implementation, though, we do not have access to date offsets, and setting the window length to 1 doesn't make much sense, as it would never end up with more than one instance present in the window.

I think I might prefer using window_length=7 or something that has some time-related meaning. But chose 3 for the moment just to show the arbitrary nature of having to set a default for this parameter.

Also, when we implement #1757, we can go back to setting the default to be 1d, but for now, I don't think there's a way to ensure this won't be a breaking change when that's implemented.

"""
# Workaround for pandas' fixed but unreleased bug: https://github.com/pandas-dev/pandas/issues/43016
# Can remove when upgraded to pandas 1.4.0
if str(series.dtype) == 'Int64':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue to remove this eventually #1771

@codecov
Copy link

codecov bot commented Nov 4, 2021

Codecov Report

Merging #1770 (1389e2f) into main (98988da) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1770      +/-   ##
==========================================
+ Coverage   98.70%   98.72%   +0.01%     
==========================================
  Files         139      141       +2     
  Lines       15419    15644     +225     
==========================================
+ Hits        15220    15445     +225     
  Misses        199      199              
Impacted Files Coverage Δ
featuretools/primitives/standard/api.py 100.00% <100.00%> (ø)
...primitives/standard/rolling_transform_primitive.py 100.00% <100.00%> (ø)
featuretools/primitives/utils.py 99.24% <100.00%> (+0.04%) ⬆️
featuretools/tests/conftest.py 100.00% <100.00%> (ø)
...ools/tests/primitive_tests/test_primitive_utils.py 100.00% <100.00%> (ø)
...ls/tests/primitive_tests/test_rolling_primitive.py 100.00% <100.00%> (ø)
...ols/tests/synthesis/test_deep_feature_synthesis.py 99.33% <100.00%> (+0.01%) ⬆️
featuretools/utils/plot_utils.py 92.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98988da...1389e2f. Read the comment docs.

Copy link

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey The behavior of these primitives makes sense to me! The only thing that stood out to me was the difference in parameter definitions between EvalML and Featuretools that we chatted about. When we get to datestrings we will need your help in transforming the EvalML definitions to featuretools, but we agreed the "add 1" operation is possible.

Don't think EvalML will ever set min_periods to be different than window_length but that may be useful for other kinds of problems.

If users use RollingSTD with min_periods=1, there will be one extra nan compared to the other primitives because the standard deviation of a single observation is not defined. That makes sense to me.

window_length (int): The number of rows to be included in each window. For data
with a uniform sampling frequency, for example of one day, the window_length will
correspond to a period of time, in this case, 7 days for a window_length of 7.
gap (int, optional): The number of rows prior to each instance to be skipped before

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick: The use of "skipped" confused me at first. If gap=1, and we're saying that 1 row prior to each instance will be skipped, that makes me think it's doing what gap=2 actually does.

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking decent, but just have a couple things:

  • For the tests in test_rolling_primitive.py and test_deep_feature_synthesis.py we are only using the pandas version of the test fixture series. Should/could we also test with Dask/Koalas?
  • Do we have any concerns with changing the init arguments for these primitives from their previous definitions? I think this means any existing applications where these were used as premium primitives will likely break, but I'm not sure that is an issue. Just wanted to mention in case it is important.

@tamargrey
Copy link
Contributor Author

@thehomebrewnerd

For the tests in test_rolling_primitive.py and test_deep_feature_synthesis.py we are only using the pandas version of the test fixture series. Should/could we also test with Dask/Koalas?

Using all three was originally my plan, which was why I created all three fixtures, but when I started running into various issues (koalas' implementation of rolling sometimes has different behavior than pandas, and dask was having other unexplained failutes), I ended up decided to just allow Pandas, which was how they previously were. I still left the dask and koalas fixtures in place, since _roll_series_with_gap is still tested with them, so in theory we could add the dask and koalas compatibility if it's requested a bit easier.

Happy to remove the extra fixtures if it doesn't make sense to keep them in if the primitives aren't yet compatible. Or, if we're okay with this taking a little longer to get in, I could investigate the dask and koalas issues further to try and get the primitives compatible. @gsheni may have thoughts on whether it'd be worth it.

Do we have any concerns with changing the init arguments for these primitives from their previous definitions? I think this means any existing applications where these were used as premium primitives will likely break, but I'm not sure that is an issue. Just wanted to mention in case it is important.

We'd need to get the date offset implementation in at the same time to have the primitives not break if they are currently in use anywhere. That would be a bigger ask for this PR, that I think we didn't want to have to wait for. Also ran this by @gsheni a while back to confirm this was okay. Can revisit if it's going to be problematic to break them.

@gsheni
Copy link
Contributor

gsheni commented Nov 8, 2021

@tamargrey is correct. We are punting on the Koalas/Dask implementation of these primitives. Additionally, we are okay with this being a breaking change for the RollingX primitives (compared to premium primitives)

@thehomebrewnerd
Copy link
Contributor

thehomebrewnerd commented Nov 8, 2021

I think my vote would be to remove the Dask and Koalas fixtures if we are not supporting those types of inputs now, and we can add them back in when they are needed. Otherwise they could end up hanging around for a long time without really being needed.

Not a big deal to me one way or the other though.

featuretools/tests/primitive_tests/test_primitive_utils.py Outdated Show resolved Hide resolved
window_length (int): The number of rows to be included in each frame. For data
with a uniform sampling frequency, for example of one day, the window_length will
correspond to a period of time, in this case, 7 days for a window_length of 7.
gap (int, optional): The number of rows prior to each instance to be skipped before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's this sound

The number of rows backward from the target instance to be skipped before the window of usable data begins. Defaults to 0, which will include the target instance in the window

If include_cutoff_time is False does gap=0 function the same as gap=1 would normally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestion. The one thing I might change is just completely getting rid of the word "skipped" based off of @freddyaboulton's comment above. How about

The number of rows backward from the target instance before the window of usable data begins. Defaults to 0, which will include the target instance in the window

I think just talking about "n rows back" better encapsulates the idea that one row back is immediately above and zero rows back is just that instance. Instead of also including the idea of "n rows skipped" which would lead me to think there would be n entire rows that are ignored before starting the window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your second point: unless I'm missing something, setting a cutoff time and having include_cutoff_time=False shouldn't have an impact on the values included in any window.

In the example below where we have primitives set with gap=0 and gap=1 and it just impacts the final row.
image

If I set cutoff_times for each row to be its own time index value (and now we have to have include_cutoff_time=True or else we'll have no data even though windows technically would be before the cutoff time--may need to be a separate issue if we wanted to handle that case specifically), it has no impact at all. (If we also add uses_full_dataframe = True to the primitive because of that separate bug)

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I think I was confusing using the cutoff time to create a window with using the target instance for a window

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh fair. Anything I can add to the docstrings to help clarify what happens in these primitives?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could put this information in our documentation. Add it as a section or a new page on Time Series. Use it as a chance to explain rolling primitives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have an issue to do that (#1758 - could also just be adding an entirely new Time series page). Was imagining it could happen once we have a lot of the time series stuff in, but we could always start it now and update it as we go along

Copy link
Contributor

@gsheni gsheni Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey your call. Happy with either.

Copy link
Contributor Author

@tamargrey tamargrey Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna leave the documentation for later. Should be a bit easier to create a more cohesive narrative when everything is implemented.

@rwedge
Copy link
Contributor

rwedge commented Nov 11, 2021

@tamargrey can you pull in the latest changes?

rwedge
rwedge previously approved these changes Nov 11, 2021
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming test pass

gsheni
gsheni previously approved these changes Nov 11, 2021
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Seems like the test coverage CI failure is due to some flaky Dask behavior and nothing in this MR.

Comment on lines 37 to 41
try:
valid_formats = graphviz.backend.FORMATS
except AttributeError:
from graphviz.parameters import FORMATS
valid_formats = FORMATS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try using graphviz.FORMATS

@tamargrey tamargrey merged commit e368071 into main Nov 15, 2021
@tamargrey tamargrey deleted the add-rolling-primitives branch November 15, 2021 20:03
@tamargrey tamargrey mentioned this pull request Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add rolling gap primitives to Transform Primitives with integer parameters
5 participants