Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selection guide #1184

Merged
merged 11 commits into from
Oct 20, 2020
Merged

Feature selection guide #1184

merged 11 commits into from
Oct 20, 2020

Conversation

tamargrey
Copy link
Contributor

closes #1167

Adds a guide for the feature selection functions:

  • ft.selection.remove_highly_null_features
  • ft.selection.remove_single_value_features
  • ft.selection.remove_highly_correlated_features

@tamargrey tamargrey requested a review from rwedge October 9, 2020 21:22
" \"second\": (df2, 'id', None, {'words': NaturalLanguage}),\n",
" }\n",
"\n",
"es = ft.EntitySet(\"data\", entities, )\n",
Copy link
Contributor

@kmax12 kmax12 Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should just use one of the demo datasets? perhaps we need to add a demo dataset with nulls.

i think that it's better for a examples to use standardize datasets. mostly for the user to have consistency as they read the docs, but also to avoid potential mistakes like the example here missing a relationship between the entities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point.

The reason I hadn't used the demo dataset here was because it seemed like the best way to show the behavior was to have small, contrived examples. It's also why there's no relationship between the two entities in the example; it just added more columns to the results that weren't proving the point of the guide.

Would it be helpful to add a section with a full demo dataset to show feature selection on more plausible data? Kind of fits along with the comment below about showing how it can be used with EvalML

@@ -0,0 +1,304 @@
{
Copy link
Contributor

@kmax12 kmax12 Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this guide, it would make sense to add a section about using evalml for feature selection. we can do it in the PR, or do that as a follow up.

cc @dsherry @tyler3991

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even maybe we add a guide to the evalml docs about using evalml for feature selection and link to it from here. that is probably a better solution

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create an issue in EvalML to add a guide, and we can link the guide in here later. @kmax12

@tamargrey tamargrey requested a review from rwedge October 14, 2020 18:43
@tamargrey
Copy link
Contributor Author

Switched to using the flight demo dataset.

Used the cutoff_time parameter to dfs to create some null values in the feature matrix.

Also added sections that list the features removed to hopefully make the results of running the functions more clear

@rwedge
Copy link
Contributor

rwedge commented Oct 14, 2020

@kmax12 thoughts on the dataset change / showing all dropped features?

@codecov
Copy link

codecov bot commented Oct 16, 2020

Codecov Report

Merging #1184 into main will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1184   +/-   ##
=======================================
  Coverage   98.60%   98.60%           
=======================================
  Files         130      130           
  Lines       13932    13932           
=======================================
  Hits        13738    13738           
  Misses        194      194           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7492925...962b384. Read the comment docs.

"metadata": {},
"outputs": [],
"source": [
"ft.selection.remove_highly_null_features(fm)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the other examples, it looks like we pass in features=features, would it make sense to do the same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely can! The feature list is optional, and it just impacts whether or not we get an updated feature list back, which I'd added in for the others so that we could highlight which features were removed.

If it's better to be consistent here, happy to pass the feature list in here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i dont know if consistency is what is most important. rather i would make sure it reads as a clear narrative. so it's fine to start without using the parameter, but then add it in later, just make sure it is explained

so, i'd make sure to clearly call out and explain the usage of the features parameter in the guide. perhaps the best way to do that would be a note section like we use elsewhere

image

Copy link
Contributor Author

@tamargrey tamargrey Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 Got it--added a note when we first use features as a parameter and an extra line highlighting how we use the results

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks!

kmax12
kmax12 previously approved these changes Oct 16, 2020
kmax12
kmax12 previously approved these changes Oct 19, 2020
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@tamargrey tamargrey merged commit bbd754a into main Oct 20, 2020
@jeff-hernandez jeff-hernandez mentioned this pull request Oct 30, 2020
@rwedge rwedge deleted the feature-selection-docs branch February 19, 2021 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add docs page for the feature selection functions
4 participants