Skip to content

Set uses_full_dataframe=True for rolling and exponential primitives#2354

Merged
thehomebrewnerd merged 6 commits into
mainfrom
rolling-uses-full-dataframe
Nov 3, 2022
Merged

Set uses_full_dataframe=True for rolling and exponential primitives#2354
thehomebrewnerd merged 6 commits into
mainfrom
rolling-uses-full-dataframe

Conversation

@thehomebrewnerd
Copy link
Copy Markdown
Contributor

@thehomebrewnerd thehomebrewnerd commented Nov 3, 2022

Set uses_full_dataframe=True for rolling primitives

Sets uses_full_dataframe = True for the Rolling* and Exponential* primitives. This is necessary to ensure proper feature value calculation if n_jobs is greater than 1.

@sbadithe
Copy link
Copy Markdown
Contributor

sbadithe commented Nov 3, 2022

Should we do this for Exponential primitives too, as they are window primitives as well?

@thehomebrewnerd
Copy link
Copy Markdown
Contributor Author

thehomebrewnerd commented Nov 3, 2022

Should we do this for Exponential primitives too, as they are window primitives as well?

@sbadithe Yes. Any time we need to access values from a previous row for feature calculation, we need to set this to True. This makes sure all the data is available when calculating features, and is particularly important if you run DFS with n_jobs greater than 1. If this isn't set, the data can get split apart and chunked by rows, so without this set additional nan values will get introduced into the columns.

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 3, 2022

Codecov Report

Merging #2354 (7035b0b) into main (5f61790) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2354   +/-   ##
=======================================
  Coverage   99.48%   99.48%           
=======================================
  Files         309      309           
  Lines       19771    19781   +10     
=======================================
+ Hits        19670    19680   +10     
  Misses        101      101           
Impacted Files Coverage Δ
...nsform/exponential/exponential_weighted_average.py 100.00% <100.00%> (ø)
.../transform/exponential/exponential_weighted_std.py 100.00% <100.00%> (ø)
...sform/exponential/exponential_weighted_variance.py 100.00% <100.00%> (ø)
...es/standard/transform/time_series/rolling_count.py 100.00% <100.00%> (ø)
...ives/standard/transform/time_series/rolling_max.py 100.00% <100.00%> (ø)
...ves/standard/transform/time_series/rolling_mean.py 100.00% <100.00%> (ø)
...ives/standard/transform/time_series/rolling_min.py 100.00% <100.00%> (ø)
...ard/transform/time_series/rolling_outlier_count.py 100.00% <100.00%> (ø)
...ives/standard/transform/time_series/rolling_std.py 100.00% <100.00%> (ø)
...es/standard/transform/time_series/rolling_trend.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sbadithe
Copy link
Copy Markdown
Contributor

sbadithe commented Nov 3, 2022

Should we do this for Exponential primitives too, as they are window primitives as well?

@sbadithe Yes. Any time we need to access values from a previous row for feature calculation, we need to set this to True. This makes sure all the data is available when calculating features, and is particularly important if you run DFS with n_jobs greater than 1. If this isn't set, the data can get split apart and chunked by rows, so without this set additional nan values will get introduced into the columns.

Thanks for the explanation! That sounds good. I think it would make sense to add the Exponential primitives to this PR. Alternatively, we can update them as part of the addition of the Expanding primitives, or in a separate PR altogether.

@gsheni
Copy link
Copy Markdown
Contributor

gsheni commented Nov 3, 2022

@thehomebrewnerd I wonder if we should add an attribute ordering_is_important to each primitive?

We could default this to False and make it True for AbsoluteDiff, CumMax, GreaterThanPrevious, IsMaxSoFar, RollingMean, Lag, NumericBin.
Or does this accomplish he same thing?

@thehomebrewnerd
Copy link
Copy Markdown
Contributor Author

Or does this accomplish he same thing?

@gsheni No, this doesn't accomplish the same thing. This basically says that we need all the data during feature value calculation, but doesn't necessarily mean the order is important. Take Percentile for example. To get the right value for an individual row, we need all the data (we can't chunk it and do each row separately), but the row ordering isn't important to get the right value for a particular row.

We might need to consider adding an attribute to indicate that ordering is important separately.

@thehomebrewnerd
Copy link
Copy Markdown
Contributor Author

@sbadithe we should update them as part of the expanding primitive PR (not on this PR).

@gsheni I think @sbadithe was talking about the existing exponential primitives (ExponentialWeightedAverage, STD and Variance) not the new ones.

@thehomebrewnerd thehomebrewnerd changed the title Set uses_full_dataframe=True for rolling primitives Set uses_full_dataframe=True for rolling and exponential primitives Nov 3, 2022
Comment thread docs/source/release_notes.rst Outdated
@thehomebrewnerd thehomebrewnerd enabled auto-merge (squash) November 3, 2022 20:57
@thehomebrewnerd thehomebrewnerd merged commit 8ca3236 into main Nov 3, 2022
@thehomebrewnerd thehomebrewnerd deleted the rolling-uses-full-dataframe branch November 3, 2022 20:57
@gsheni gsheni mentioned this pull request Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants