Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to use rolling primitives with pandas' offset aliases #1809

Merged
merged 45 commits into from
Jan 25, 2022

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Dec 14, 2021

Allows users to pass strings like 1D or 1H as gap and window_length parameters to the rolling transform primitives. This allows data that does not have a uniform sampling frequency to be used accurately with the rolling transform primitives.

closes #1757

To Do:

  • Handle input exceptions
    • Non offset strings?
    • Combination of parameter types?
    • No datetime present on the input series to _roll_series_with_gap?
    • Min periods fine for offset string but too big for integer input?
  • Update docstrings
    • Parameters
    • Note limitations
      • Min periods can only be as big as window length for integers but no limitation for offset strings so be careful
      • Window length can only be a fixed frequency (day or hour; not month or week) offset due to pandas' limitation; gap can only be fixed frequency due to introduced ft limitaion (we add gap to window length so they need to be fixed)
    • Examples in primitive docstrings
  • Add edge case tests
    • large min periods because data's frequency is much higher than the observation window (hourly data with a window of 7 days)
    • Empty series?
    • Mismatched parameters
  • Clean up
    • remove unnecessary commnts
    • Code quality/clarity
    • Test clarity/repetition

@codecov
Copy link

codecov bot commented Dec 15, 2021

Codecov Report

Merging #1809 (81eb9af) into main (35f65d7) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1809      +/-   ##
==========================================
+ Coverage   98.80%   98.82%   +0.02%     
==========================================
  Files         144      145       +1     
  Lines       15984    16268     +284     
==========================================
+ Hits        15793    16077     +284     
  Misses        191      191              
Impacted Files Coverage Δ
...primitives/standard/rolling_transform_primitive.py 100.00% <100.00%> (ø)
featuretools/primitives/utils.py 99.51% <100.00%> (+0.12%) ⬆️
...ools/tests/primitive_tests/test_primitive_utils.py 100.00% <100.00%> (ø)
...ls/tests/primitive_tests/test_rolling_primitive.py 100.00% <100.00%> (ø)
featuretools/tests/primitive_tests/utils.py 100.00% <100.00%> (ø)
...ols/tests/synthesis/test_deep_feature_synthesis.py 99.33% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35f65d7...81eb9af. Read the comment docs.

@tamargrey tamargrey marked this pull request as ready for review December 16, 2021 21:14
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot to digest here, but here are a couple things I've noticed so far.

@@ -0,0 +1,5 @@
def get_number_of_days(offset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal since we are only using this in tests, but I think this would return the wrong value if we ever used this in a test with a number above 9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can either clarify the limitations in a docstring/what the intended use is or update it to return any numeric characters before a string. I think my preference for now would be to keep it simple and add a docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to add the docstring here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yeah forgot to push

Comment on lines 100 to 102
def _offset_max(self, series):
return _apply_roll_with_offset_gap(series, self.gap, max, self.min_periods)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it matters too much, but do we need these _offset* functions as methods on the primitive? I haven't tested this to confirm, but I think you can just call _apply_roll_with_offset_gap directly in the rolled_series.apply call below, passing in the additional required arguments. Something like this:

if isinstance(self.gap, str):
    additional_args = (self.gap, max, self.min_periods)
    return rolled_series.apply(_apply_roll_with_offset_gap, args=additional_args)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Updated to do this

@@ -251,13 +386,20 @@ def __init__(self, window_length=3, gap=0, min_periods=1):
self.gap = gap
self.min_periods = min_periods

def _pandas_std(self, series):
return series.std()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried using np.std, but it seems like there's a slightly different handling of nans that produces different results from the pandas implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can define this as a function inside the scope of rolling_std since that is the only place it's used, rather than adding it as a class method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense! changing

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more minor things - I'd still like to go through this again to solidify my understanding of what is happening exactly, but nothing major is jumping out to me so far.

@@ -251,13 +386,20 @@ def __init__(self, window_length=3, gap=0, min_periods=1):
self.gap = gap
self.min_periods = min_periods

def _pandas_std(self, series):
return series.std()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can define this as a function inside the scope of rolling_std since that is the only place it's used, rather than adding it as a class method.

@@ -109,14 +192,13 @@ def test_rolling_count(rolling_series_pd):
[
(0, 2), # 0 and 1 get treated the same
(1, 2),
(1, 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, removed!

@@ -0,0 +1,5 @@
def get_number_of_days(offset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to add the docstring here?

else:
num_nans = gap_num + num_nans_from_min_periods - 1

# The extra 1 at the beinning is because the std pandas function returns NaN if there's only one value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beinning -> beginning

Copy link
Contributor

@davesque davesque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over the recent commit that includes the value checking stuff I suggested. That part looks good! The rest is quite a lot so I'll have to spend some more time on it.

# Add the window_size and gap so that the rolling operation correctly takes gap into account.
# That way, we can later remove the gap rows in order to apply the primitive function
# to the correct window
functional_window_length = to_offset(window_size) + to_offset(gap)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation makes sense to me

Copy link

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey Thank you for the work on this! I agree with the implementation you chose where we compute window_end - gap to account for the gap. I left a comment on adding coverage but what you have here (and some local tests in a jupyter notebook) matches what I'd expect.



@pytest.mark.parametrize("primitive", [RollingCount, RollingMax, RollingMin, RollingMean])
def test_rolling_primitives_non_uniform_data(primitive):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth it to actually pre-compute the expected series and check with assert_series_equal? My reasoning is that in two weeks when we ask ourselves how this works, it'll be easier/faster to just look at the expected answer in the unit tests.

That being said, I stepped through this manually and the resulting series matches what I'd expect. Right now, the window size in this test only includes one observation so all the primitives return the same result. I think it'd be good if we test on some situations where there is more than one observation per window.

Maybe something like this could work:

import pandas as pd
from featuretools.primitives import RollingMean, RollingCount

# When the data isn't uniform, this impacts the number of values in each rolling window
datetimes = (list(pd.date_range(start='2017-01-01', freq='1d', periods=3)) +
             list(pd.date_range(start='2017-01-10', freq='2d', periods=4)) +
             list(pd.date_range(start='2017-01-22', freq='1d', periods=7)))

no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton both good ideas! I added two tests in 4f90dd2 that test the primitives on a series that has windows with multiple observations present and use assert_series_equal. If the tests from this commit were along the lines of what you were thinking, I'll go ahead and update the other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also updated a test in test_primitive_utils to use the same approach for non uniform series: b0e5e25

Comment on lines 221 to 227
@pytest.mark.parametrize(
"window_length",
[
3,
"3d"
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a style thing but this seems suitable to collapse to one or a a couple lines, given it's only 2 settings for a single parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsed to one line!

@@ -11,3 +11,4 @@ boto3>=1.17.46
composeml>=0.3.0
urllib3>=1.26.5
pyarrow>=3.0.0
mock==4.0.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the builtin version of mock instead?

https://docs.python.org/dev/library/unittest.mock.html#unittest.mock.patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

@gsheni gsheni requested a review from rwedge January 20, 2022 21:37
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tamargrey tamargrey merged commit dfdd2b4 into main Jan 25, 2022
@tamargrey tamargrey deleted the rolling-offset branch January 25, 2022 21:58
@dvreed77 dvreed77 mentioned this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow date offsets as parameters to rolling Transform primitives
5 participants