Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selection #1126

Merged
merged 23 commits into from Sep 4, 2020
Merged

Feature selection #1126

merged 23 commits into from Sep 4, 2020

Conversation

tamargrey
Copy link
Contributor

Feature Selection

In order to provide some insight into the quality of features produced by dfs, we're adding in functions that can notify users which features might not be of much importance to a machine learning model. The feature selection tools we are adding at this stage are as follows:

  • find_highly_null_features - which features have many null values
  • find_single_value_features - which features don't have much variance in their values
  • find_highly_correlated_features - which features are highly correlated to one another

We're leaving the potential for these to be integrated into dfs or to be built up to have a larger and more strict API like that of EvalML's data checks. For now, though, these will just be standalone functions that can be called on a feature matrix produced by dfs.

@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

Merging #1126 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1126      +/-   ##
==========================================
+ Coverage   98.35%   98.37%   +0.01%     
==========================================
  Files         126      126              
  Lines       13308    13466     +158     
==========================================
+ Hits        13089    13247     +158     
  Misses        219      219              
Impacted Files Coverage Δ
featuretools/__init__.py 82.85% <100.00%> (+0.50%) ⬆️
featuretools/selection/selection.py 100.00% <100.00%> (ø)
featuretools/tests/selection/test_selection.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81f8362...058a459. Read the comment docs.

@tamargrey
Copy link
Contributor Author

A couple things still up for discussion:

  • What should these functions return - feature names from feature matrix columns or the actual Feature objects?
  • Which features should find_highly_correlated_features be checking? If we're looking at all pairs, it can blow up quickly.
    • The format of the return value can inform the usefulness
    • I think @tuethan1999 and @dsherry talked about this a bit during Ethan's Feed Your Mind talk

@candalfigomoro
Copy link

candalfigomoro commented Aug 28, 2020

I'm not sure featuretools should contain a feature selection module, as feature selection is an extremely complex matter and it also depends on which model you will use afterwards (e.g. maybe a Gradient Boosting Machine could take advantage of information and interactions that your simple correlation-based method couldn't capture).

However, if you really want to add a selection module, maybe you could take a look at tsfresh's selection module: https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html

To do so, for every feature the influence on the target is evaluated by an univariate tests and the p-Value is calculated.

Obviously this is possible only if you have a target.

Afaik it doesn't compute correlation between features since, as you already said, it can blow up quickly.

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions should be added to the API reference. The Returns: block of the docstrings will probably need to be reformatted to display well in the docs. The docs expect a different form

Returns:
    return type: return values description

see #1125 for discussion on the formatting

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functions should be usable like this:

import featuretools as ft
...
ft.selection.remove_highly_null_features(feature_matrix, features)

accessible after the regular featuretools import, without needing to import the selection module separately.

featuretools/selection/selection.py Outdated Show resolved Hide resolved
continue

if abs(col1.corr(col2)) >= pct_corr_threshold:
dropped.add(f_name1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to drop both features -- we want to keep one and drop the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to keep the less complex of the two features, which we'll treat as the feature that comes earlier in the list

# Get all pairs of columns and calculate their correlation, dropping any columns that are highly correlated
dropped = set()
for f_name1, col1 in fm_to_check.iteritems():
for f_name2, col2 in fm_to_check.iteritems():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if f_name1 is in dropped here would let us skip iterating over the other columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation iterates from most complex to least complex, so when we run into a pair of columns that are highly correlated, we can drop the more complex one and don't need to keep checking its correlation to the rest of features. This also keeps us from dropping any columns that we'll run into later on in the iteration.

feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.
features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.
count_nan_as_value (bool): If True, missing values will be counted as their own unique value.
If set to True, a feature that has one unique value and all other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

If set to False, a feature that has one unique value and all other data missing will be removed from the feature matrix.

Unless I'm misunderstanding, one unique value and the rest of the data missing should count as two unique values when NaNs are being counted as a unique value so the feature shouldn't get dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No yeah, you're understanding correctly. It should be False here--thanks for pointing it out!

new_feature_names = set(new_matrix.columns)

if features is not None:
features = [f for f in features if f.get_name() in new_feature_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this with a multi-output feature? this might not handle the case where a feature has multiple columns associated with it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test_multi_output_selection and it's performing like I'd expect it to--not sure if the extra test is necessary though. Could maybe just view it as a multiple entity test as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_multioutput_selection would need to pass in the feature list as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. Also, I had the wrong target entity, so I wasn't seeing the issue.

So if we have a feature <Feature: N_MOST_COMMON(second.quarter)> that outputs 3 features, with names of the format N_MOST_COMMON(second.all_nulls)[i] and we only keep one of them in the resulting feature matrix, how do we present that feature?

I feel like just keeping the feature in the feature list as is is misleading, because it would still show all the possible output features.

Copy link
Contributor

@rwedge rwedge Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can divide up the feature into FeatureOutputSlice sub-features if some but not all feature columns should be kept

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using FeatureOutputSlice to do this



def remove_highly_correlated_features(feature_matrix, features=None, pct_corr_threshold=0.95, features_to_check=None, features_to_keep=None):
"""Removes columns in feature matrix that are highly correlated with another column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption that the right-most features are more complex is based around the user not re-ordering the feature list returned by DFS. We might want to make note of that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, adding this section

    Note:
        We make the assumption that, for a pair of features, the feature that is further right in the feature matrix
        produced by ``dfs`` is the more complex one. The assumption does not hold if the order of columns
        in the feature matrix has changed from what ``dfs`` produces.

sliced_features.append(f)
else:
sliced_features.extend([f[i] for i in range(f.number_output_features)])
new_features = [f for f in sliced_features if f.get_name() in new_feature_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new_feature_names is a set of column names, wouldn't f.get_name() fail for a multi-output feature?

frances-h
frances-h previously approved these changes Sep 4, 2020
Copy link
Contributor

@frances-h frances-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@tamargrey tamargrey merged commit 804df26 into main Sep 4, 2020
@tamargrey tamargrey mentioned this pull request Sep 8, 2020
@rwedge rwedge deleted the feature-selection branch September 21, 2020 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants