Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set index after adding ancestor relationship variables #668

Merged
merged 5 commits into from Jul 19, 2019

Conversation

@kmax12
Copy link
Member

commented Jul 14, 2019

This fixes a bug where we essentially reset the dataframe index after adding ancestor variables. This breaks merging later when trying to create aggregation features because we merge on the index

https://github.com/Featuretools/featuretools/blob/master/featuretools/computational_backends/feature_set_calculator.py#L611

This only occurs when you stack to a certain depth because you need to be creating features for an entity whose dataframe has had ancestor relationship variables added to it.

The test cases uses a string index to avoid the situation where the reset index is masked because it is the same as the existing index.

Fixes #643

@codecov

This comment has been minimized.

Copy link

commented Jul 14, 2019

Codecov Report

Merging #668 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #668      +/-   ##
==========================================
+ Coverage   97.44%   97.44%   +<.01%     
==========================================
  Files         118      118              
  Lines        9618     9634      +16     
==========================================
+ Hits         9372     9388      +16     
  Misses        246      246
Impacted Files Coverage Δ
...mputational_backend/test_feature_set_calculator.py 100% <100%> (ø) ⬆️
...s/computational_backends/feature_set_calculator.py 98.1% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d191c64...e452c62. Read the comment docs.

@CJStadler
Copy link
Contributor

left a comment

LGTM!

@@ -337,6 +337,9 @@ def _add_ancestor_relationship_variables(self, child_df, parent_df,
left_on=relationship.child_variable.id,
right_on=relationship.child_variable.id)

# ensure index is maintained
df = df.set_index(relationship.child_entity.index, drop=False)

This comment has been minimized.

Copy link
@CJStadler

CJStadler Jul 18, 2019

Contributor

Probably not a big deal, but inplace looks like it might be faster.

In [1]: import pandas as pd

In [2]: df10k = pd.DataFrame({'a': range(10000)}, index=range(10000))

In [3]: df100k = pd.DataFrame({'a': range(100000)}, index=range(100000))

In [4]: %timeit df10k.set_index('a', drop=False)
312 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df100k.set_index('a', drop=False)
912 µs ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit df10k.set_index('a', drop=False, inplace=True)
101 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit df100k.set_index('a', drop=False, inplace=True)
113 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
kmax12 added 2 commits Jul 18, 2019

@kmax12 kmax12 merged commit 278c0c4 into master Jul 19, 2019

4 checks passed

codecov/patch 100% of diff hit (target 97.44%)
Details
codecov/project 97.44% (+<.01%) compared to d191c64
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details
@rwedge rwedge referenced this pull request Aug 19, 2019

@kmax12 kmax12 deleted the set-index-featureset-calculator branch Sep 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.