-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to use rolling primitives with pandas' offset aliases #1809
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1809 +/- ##
==========================================
+ Coverage 98.80% 98.82% +0.02%
==========================================
Files 144 145 +1
Lines 15984 16268 +284
==========================================
+ Hits 15793 16077 +284
Misses 191 191
Continue to review full report at Codecov.
|
featuretools/primitives/standard/rolling_transform_primitive.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot to digest here, but here are a couple things I've noticed so far.
@@ -0,0 +1,5 @@ | |||
def get_number_of_days(offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a big deal since we are only using this in tests, but I think this would return the wrong value if we ever used this in a test with a number above 9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can either clarify the limitations in a docstring/what the intended use is or update it to return any numeric characters before a string. I think my preference for now would be to keep it simple and add a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to add the docstring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yeah forgot to push
def _offset_max(self, series): | ||
return _apply_roll_with_offset_gap(series, self.gap, max, self.min_periods) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it matters too much, but do we need these _offset*
functions as methods on the primitive? I haven't tested this to confirm, but I think you can just call _apply_roll_with_offset_gap
directly in the rolled_series.apply
call below, passing in the additional required arguments. Something like this:
if isinstance(self.gap, str):
additional_args = (self.gap, max, self.min_periods)
return rolled_series.apply(_apply_roll_with_offset_gap, args=additional_args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Updated to do this
@@ -251,13 +386,20 @@ def __init__(self, window_length=3, gap=0, min_periods=1): | |||
self.gap = gap | |||
self.min_periods = min_periods | |||
|
|||
def _pandas_std(self, series): | |||
return series.std() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried using np.std
, but it seems like there's a slightly different handling of nans that produces different results from the pandas implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you can define this as a function inside the scope of rolling_std
since that is the only place it's used, rather than adding it as a class method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense! changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more minor things - I'd still like to go through this again to solidify my understanding of what is happening exactly, but nothing major is jumping out to me so far.
@@ -251,13 +386,20 @@ def __init__(self, window_length=3, gap=0, min_periods=1): | |||
self.gap = gap | |||
self.min_periods = min_periods | |||
|
|||
def _pandas_std(self, series): | |||
return series.std() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you can define this as a function inside the scope of rolling_std
since that is the only place it's used, rather than adding it as a class method.
@@ -109,14 +192,13 @@ def test_rolling_count(rolling_series_pd): | |||
[ | |||
(0, 2), # 0 and 1 get treated the same | |||
(1, 2), | |||
(1, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, removed!
@@ -0,0 +1,5 @@ | |||
def get_number_of_days(offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to add the docstring here?
cf12c7d
to
ce7348d
Compare
else: | ||
num_nans = gap_num + num_nans_from_min_periods - 1 | ||
|
||
# The extra 1 at the beinning is because the std pandas function returns NaN if there's only one value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beinning
-> beginning
ce7348d
to
2a2be98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over the recent commit that includes the value checking stuff I suggested. That part looks good! The rest is quite a lot so I'll have to spend some more time on it.
# Add the window_size and gap so that the rolling operation correctly takes gap into account. | ||
# That way, we can later remove the gap rows in order to apply the primitive function | ||
# to the correct window | ||
functional_window_length = to_offset(window_size) + to_offset(gap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey Thank you for the work on this! I agree with the implementation you chose where we compute window_end - gap to account for the gap. I left a comment on adding coverage but what you have here (and some local tests in a jupyter notebook) matches what I'd expect.
|
||
|
||
@pytest.mark.parametrize("primitive", [RollingCount, RollingMax, RollingMin, RollingMean]) | ||
def test_rolling_primitives_non_uniform_data(primitive): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth it to actually pre-compute the expected series and check with assert_series_equal
? My reasoning is that in two weeks when we ask ourselves how this works, it'll be easier/faster to just look at the expected answer in the unit tests.
That being said, I stepped through this manually and the resulting series matches what I'd expect. Right now, the window size in this test only includes one observation so all the primitives return the same result. I think it'd be good if we test on some situations where there is more than one observation per window.
Maybe something like this could work:
import pandas as pd
from featuretools.primitives import RollingMean, RollingCount
# When the data isn't uniform, this impacts the number of values in each rolling window
datetimes = (list(pd.date_range(start='2017-01-01', freq='1d', periods=3)) +
list(pd.date_range(start='2017-01-10', freq='2d', periods=4)) +
list(pd.date_range(start='2017-01-22', freq='1d', periods=7)))
no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton both good ideas! I added two tests in 4f90dd2 that test the primitives on a series that has windows with multiple observations present and use assert_series_equal
. If the tests from this commit were along the lines of what you were thinking, I'll go ahead and update the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also updated a test in test_primitive_utils
to use the same approach for non uniform series: b0e5e25
@pytest.mark.parametrize( | ||
"window_length", | ||
[ | ||
3, | ||
"3d" | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a style thing but this seems suitable to collapse to one or a a couple lines, given it's only 2 settings for a single parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapsed to one line!
test-requirements.txt
Outdated
@@ -11,3 +11,4 @@ boto3>=1.17.46 | |||
composeml>=0.3.0 | |||
urllib3>=1.26.5 | |||
pyarrow>=3.0.0 | |||
mock==4.0.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the builtin version of mock instead?
https://docs.python.org/dev/library/unittest.mock.html#unittest.mock.patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Allows users to pass strings like
1D
or1H
asgap
andwindow_length
parameters to the rolling transform primitives. This allows data that does not have a uniform sampling frequency to be used accurately with the rolling transform primitives.closes #1757
To Do: