New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding non-feature columns to calculated feature matrix #78

Merged
merged 12 commits into from Feb 5, 2018

Conversation

Projects
None yet
4 participants
@rwedge
Contributor

rwedge commented Jan 24, 2018

Closes #74

Extra columns in the cutoff_time argument to calculate_feature_matrix are included as additional columns in the resulting feature matrix. When taking advantage of this feature the order of columns is important: the first two columns of cutoff_time must be the instance ids and and time columns (the order of these first two columns doesn't matter).

This PR also fixes a bug that would occur when using the approximate option for calculate_feature_matrix and every aggregation feature could be approximated. The time index values for the calculated feature matrix (see the cutoff_time_in_index option) were the times the aggregation features were approximated at, not the times specified in cutoff_time.

@codecov-io

This comment has been minimized.

codecov-io commented Jan 24, 2018

Codecov Report

Merging #78 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #78     +/-   ##
=========================================
+ Coverage   87.51%   87.62%   +0.1%     
=========================================
  Files          75       73      -2     
  Lines        7147     7191     +44     
=========================================
+ Hits         6255     6301     +46     
+ Misses        892      890      -2
Impacted Files Coverage Δ
featuretools/synthesis/dfs.py 100% <ø> (ø)
...utational_backend/test_calculate_feature_matrix.py 98.61% <100%> (ø)
...computational_backends/calculate_feature_matrix.py 96.84% <100%> (ø)
primitives/cum_transform_feature.py
tests/computational_backend/test_pandas_backend.py
entityset/print.py
tests/entityset_tests/test_pandas_es.py
entityset/entityset.py
entityset/serialization.py
tests/demo_tests/test_demo_data.py
... and 141 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9eac716...e58a646. Read the comment docs.

rwedge and others added some commits Jan 24, 2018

.. ipython:: python
cutoff_times['labels'] = pd.Series([0, 0, 1, 0, 1])

This comment has been minimized.

@kmax12

kmax12 Jan 30, 2018

Member

it's more common to see this called just label with no s

@@ -71,8 +71,18 @@ There is one row in the feature matrix corresponding to a row in ``cutoff_times`
cutoff_time_in_index=True)
feature_matrix
It is often the case that we want our labels in our calculated feature matrix so that the ordering is consistent between the labels and the rows of the feature matrix. However, adding labels to the initial dataframe means that you would have to explicitly prohibit ``dfs`` from building features with that column. To bypass this, we can provide additional columns to cutoff times which will be added directly the feature matrix. While the two columns will be used as an index and cutoff time, any additional columns will appear as features in the resulting feature matrix.

This comment has been minimized.

@kmax12

kmax12 Jan 30, 2018

Member

do you think it is explicit enough that it is the first two columns that get used and that order matters

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 30, 2018

Contributor

My understanding is that the first two columns get used but the order of those columns doesn't matter. How about the phrasing:
"While the first two columns will be used as index and cutoff time regardless of their order in the dataframe, any additional columns will appear as features in the resulting matrix."

@@ -71,18 +71,18 @@ There is one row in the feature matrix corresponding to a row in ``cutoff_times`
cutoff_time_in_index=True)
feature_matrix
It is often the case that we want our labels in our calculated feature matrix so that the ordering is consistent between the labels and the rows of the feature matrix. However, adding labels to the initial dataframe means that you would have to explicitly prohibit ``dfs`` from building features with that column. To bypass this, we can provide additional columns to cutoff times which will be added directly the feature matrix. While the two columns will be used as an index and cutoff time, any additional columns will appear as features in the resulting feature matrix.
It is often the case that we want our labels in our calculated feature matrix so that the ordering is consistent between the labels and the rows of the feature matrix. However, adding labels to the initial dataframe means that you would have to explicitly prohibit ``dfs`` from building features with that column. To bypass this, we can provide additional columns to cutoff times which will be added directly the feature matrix. While the first two columns will be used as an index and cutoff time regardless of their order in the dataframe, any additional columns will appear as features in the resulting feature matrix.

This comment has been minimized.

@kmax12

kmax12 Jan 30, 2018

Member

we may support mixing the order up if you name things correctly, but let’s document instance id first, followed by cutoff time. just to make sure people are consistent

Seth-Rothschild
@kmax12

This comment has been minimized.

Member

kmax12 commented Feb 5, 2018

my only thought is that we should have a way to provide labels in the case where you using a single cutoff time and the instance_ids parameter. right now, the work around for that is to simply create a dataframe cutoff times where you repeat the cutoff time.

because there is a work around for that situation, this is good to merge.

@kmax12 kmax12 merged commit 506ffaf into master Feb 5, 2018

2 checks passed

ci/circleci Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details

@Seth-Rothschild Seth-Rothschild deleted the additional_columns branch Feb 8, 2018

@rwedge rwedge referenced this pull request Feb 27, 2018

Merged

Release v0.1.18 #102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment