Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop_first attribute added in encode features #647

Merged
merged 10 commits into from Jul 14, 2019

Conversation

ayushpatidar
Copy link
Contributor

@ayushpatidar ayushpatidar commented Jul 8, 2019

Pull Request Description

drop_first attribute is added as param in encode_feature() function.
This feature will help user to control on redundant feature in dataframe

Fixes #635

After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request.

@CLAassistant
Copy link

CLAassistant commented Jul 8, 2019

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Jul 8, 2019

Codecov Report

No coverage uploaded for pull request base (master@6da8c8b). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #647   +/-   ##
=========================================
  Coverage          ?   97.42%           
=========================================
  Files             ?      118           
  Lines             ?     9560           
  Branches          ?        0           
=========================================
  Hits              ?     9314           
  Misses            ?      246           
  Partials          ?        0
Impacted Files Coverage Δ
featuretools/synthesis/encode_features.py 98.43% <100%> (ø)
...aturetools/tests/synthesis/test_encode_features.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da8c8b...c57b7fa. Read the comment docs.

@kmax12
Copy link
Contributor

kmax12 commented Jul 8, 2019

Thank you for the contribution, @ayushpatidar.

A couple things

  1. can you sign the CLA? (you may have to do it a second time for this PR).
  2. Please add a test case for this new parameter to https://github.com/Featuretools/featuretools/blob/master/featuretools/tests/synthesis/test_encode_features.py

Once you are passing all the github checks, we can review.

kmax12
kmax12 previously approved these changes Jul 9, 2019
Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Once addressed, i believe this is ready to merge!

@@ -23,6 +23,9 @@ def encode_features(feature_matrix, features, top_n=10, include_unknown=True,
defaults to encode all necessary features.
inplace (bool): Encode feature_matrix in place. Defaults to False.
verbose (str): Print progress info.
drop_first (bool): Whether to get l-1 dummies out of l categorical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use k instead of l

unique = val_counts.head(top_n).index.tolist()
select_n = top_n
if drop_first:
if len(val_counts) != 1 or top_n != 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we remove the conditional and just do these two lines?

select_n = min(len(val_counts), top_n) 
select_n = max(select_n - 1, 1)  # make sure at least 1 category is selected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12
But we have to keep outer conditional statement ( if drop_first ).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sorry, I was just referring to inner conditional

es.entity_from_dataframe(entity_id='a', dataframe=df, index='index', make_index=True)
features, feature_defs = dfs(entityset=es, target_entity='a')
features_enc, feature_defs_enc = encode_features(features, feature_defs,
drop_first=True, top_n=10, include_unknown=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to include top_n in this test case

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the contribution!

@kmax12 kmax12 merged commit eb86522 into alteryx:master Jul 14, 2019
@rwedge rwedge mentioned this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Ability to drop redundant variable in ft.encode_features()
4 participants