New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update feature description guide to use Woodwork #1603
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
dd580af
create feature description notebook and move contents from rst file
31b0297
superficial updating of code and language
5a12fb8
make sure outputs are as expected and that language makes more sense …
128acb0
format links and headers
edc5fb1
remove comments and hide cell
e63a72a
remove rst file
cd55220
PR comments
2da44a8
reword warning about getitem usage
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,384 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "54fcae4f", | ||
"metadata": {}, | ||
"source": [ | ||
"# Generating Feature Descriptions\n", | ||
"\n", | ||
"As features become more complicated, their names can become harder to understand. Both the [describe_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.graph_feature.html) function and the [graph_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.describe_feature.html) function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the ``describe_feature`` function can be augmented by providing custom definitions and templates to improve the resulting descriptions. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9f45803d", | ||
"metadata": { | ||
"nbsphinx": "hidden" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import featuretools as ft\n", | ||
"es = ft.demo.load_mock_customer(return_entityset=True)\n", | ||
"\n", | ||
"feature_defs = ft.dfs(entityset=es,\n", | ||
" target_dataframe_name=\"customers\",\n", | ||
" agg_primitives=[\"mean\", \"sum\", \"mode\", \"n_most_common\"],\n", | ||
" trans_primitives=[\"month\", \"hour\"],\n", | ||
" max_depth=2,\n", | ||
" features_only=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7b789d45", | ||
"metadata": {}, | ||
"source": [ | ||
"By default, ``describe_feature`` uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "07d48d9b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature_defs[9]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2e5e2490", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature_defs[9])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "0d07cc88", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature_defs[14]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "8941fcc1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature_defs[14])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "655399b9", | ||
"metadata": {}, | ||
"source": [ | ||
"## Improving Descriptions\n", | ||
"\n", | ||
"While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions. \n", | ||
"\n", | ||
"#### Feature Descriptions\n", | ||
"Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a `ColumnSchema` or feature is, or to provide descriptions that take advantage of a user's existing knowledge about the data or domain. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "16317515", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature_descriptions = {'customers: join_date': 'the date the customer joined'}\n", | ||
"\n", | ||
"ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9f9f3cc3", | ||
"metadata": {}, | ||
"source": [ | ||
"For example, the above replaces the column name, ``\"join_date\"``, with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the ``description`` attribute present on each `ColumnSchema`:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "72f17801", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"join_date_column_schema = es['customers'].ww.columns['join_date']\n", | ||
"join_date_column_schema.description = 'the date the customer joined'\n", | ||
"\n", | ||
"es['customers'].ww.columns['join_date'].description" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a6d5bf37", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature = ft.TransformFeature(ft.IdentityFeature(es, 'customers', 'join_date'), ft.primitives.Hour)\n", | ||
"feature" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "31be0b92", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "raw", | ||
"id": "27013806", | ||
"metadata": {}, | ||
"source": [ | ||
".. note::\n", | ||
"\n", | ||
" When setting a description on a column in a DataFrame as described above, be careful to avoid setting the description via ``df.ww[col_name].ww.description``. The use of ``df.ww[col_name]`` creates an entirely new Series object that is not related to the EntitySet from which feature descriptions are built. Therefore, setting the description in any way other than going through the ``columns`` attribute will not set the column's description in a way that will be propogated to the feature description. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e9cb7f93", | ||
"metadata": {}, | ||
"source": [ | ||
"Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to ``describe_feature`` with ``feature_descriptions``, the description in the `feature_descriptions` parameter will take presedence.\n", | ||
"\n", | ||
"Feature descriptions can also be provided for generated features." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "69e42e65", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature_descriptions = {\n", | ||
" 'sessions: SUM(transactions.amount)': 'the total transaction amount for a session'}\n", | ||
"\n", | ||
"feature_defs[14]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6a6e5484", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c56d7bf7", | ||
"metadata": {}, | ||
"source": [ | ||
"Here, we create and pass in a custom description of the intermediate feature ``SUM(transactions.amount)``. The description for ``MEAN(sessions.SUM(transactions.amount))``, which is built on top of ``SUM(transactions.amount)``, uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form ``\"[dataframe_name]: [feature_name]\"``, as shown above.\n", | ||
"\n", | ||
"#### Primitive Templates\n", | ||
"Primitives descriptions are generated using primitive templates. By default, these are defined using the ``description_template`` attribute on the primitive. Primitives without a template default to using the ``name`` attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into ``describe_feature`` through the ``primitive_templates`` argument. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "506d7db3", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"primitive_templates = {'sum': 'the total of {}'}\n", | ||
"\n", | ||
"feature_defs[6]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ebb086bc", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e64d2553", | ||
"metadata": {}, | ||
"source": [ | ||
"In this example, we override the default template of ``'the sum of {}'`` with our custom template ``'the total of {}'``. The description uses our custom template instead of the default.\n", | ||
"\n", | ||
"Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the \"nth\" form is available through the ``nth_slice`` keyword." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "32346acf", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature = feature_defs[5]\n", | ||
"feature" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "514e213b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"primitive_templates = {\n", | ||
" 'n_most_common': [\n", | ||
" 'the 3 most common elements of {}', # generic multi-output feature\n", | ||
" 'the {nth_slice} most common element of {}']} # template for each slice \n", | ||
"\n", | ||
"ft.describe_feature(feature, primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "1f2834c5", | ||
"metadata": {}, | ||
"source": [ | ||
"Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "73ad767e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature[0], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d23312f4", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature[1], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "028814bb", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature[2], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7fc4c168", | ||
"metadata": {}, | ||
"source": [ | ||
"Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e52d0309", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"primitive_templates = {\n", | ||
" 'n_most_common': [\n", | ||
" 'the 3 most common elements of {}',\n", | ||
" 'the most common element of {}',\n", | ||
" 'the second most common element of {}',\n", | ||
" 'the third most common element of {}']}\n", | ||
"\n", | ||
"ft.describe_feature(feature, primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2d51c451", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature[0], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "8031ff76", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature[1], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "fd495e9f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ft.describe_feature(feature[2], primitive_templates=primitive_templates)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "513b0ca0", | ||
"metadata": {}, | ||
"source": [ | ||
"Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the ``describe_feature`` function using the ``metadata_file`` keyword argument. Descriptions passed in directly through the ``feature_descriptions`` and ``primitive_templates`` keyword arguments will take precedence over any descriptions provided in the JSON metadata file." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"celltoolbar": "Raw Cell Format", | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot.
We could also add on at the end "This also applies when updating column metadata." Not necessary for the note but might be useful to share
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this gotcha might be an important thing to reference in the Woodwork typing guide. Since, like you said, it'll be relevant to updating information stored in he metadata as well.
There isn't really a section in the guide on accessing descriptions and metadata, but it might be worth it to add a section on "Updating a Columns Typing Information" at the bottom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that adding the situation here to the FAQs in #1578 will mean that we shouldn't need to reference metadata here. I think it's still useful to have a note specific to the situation at hand in these guides when this situation comes up but that we don't need to reference the other situations.