Skip to content

Commit

Permalink
Allow dask dataframe during entity creation (#783)
Browse files Browse the repository at this point in the history
* allow dask dataframe for entity creation

* update create entity from dask df test

* update test to specify variable types

* add dask entityset relationships and dfs

* multiple dask updates and add simple dask test

* multiple updates for dask dataframes

* update dask requirements

* update dask test for test_hackathon

* fixes after merging in changes from master

* initial updates for aggregation with dask dataframes

* update dask tests

* updated dask tests

* dask multipartition tests

* update requirements to fix circleci featuretools installation

* generalize path to hackathon dataset

* add hackathon data to manifest

* bump circleci resource size for unit tests

* fix test for py35, try bumping circleci resources again

* remove hackathon test from circleci, reset resources

* use pd.testing.assert_frame_equal with check_like=True for different column orderings

* additional test fixes

* performance testing improvements

* add profiling script

* add checking for df types when creating entityset

* add test for training_window and fix text for cutoff time df

* update dask tests for consistency

* add test for approximate (doesn't pass currently)

* add test for adding last_time_index to dask entityset

* add dask test for secondary_time_index

* fix issue with TimeSince primitive with dask entityset

* update hackathon test

* remove some easy to remove computes

* fix Pandas 1.0 issues

* various updates for dask entities

* lint fix plus missing_ids change

* fix hackathon test

* update requirements.txt

* fixes for windows tests

* dask dfs fixes

* update aggregation primitives to use dask aggregation

* add temp tests directory

* update temporary tests

* update agg test file

* update encode_features for dask

* featuretools/dask-tests-tmp/test_instacart.py

* update dask tests

* update entity creation code

* lint and test updates

* instacart test updates

* lint fix

* remove leftover head() call

* fix encode features inplace test

* fix some issues with dask aggregations

* various dask updates

* update instacart test files

* instacart test updates

* instacart test updates

* cutoff time updates in cfm

* entity updates for _handle_time

* instacart test update

* update add_last_time_index to use dask

* instacart test updates

* add dask test for time_window

* update add_last_time_indeices

* improve Entity.query_by_values() implementation for Dask

* update dask tests

* lint fix

* revert entityset __repr__ code back to master code

* Fix issue with make_index and Dask entities (#895)

* start test for dask with make_index

* update dask test for make_index

* remove unnecessary code in _create_index()

* update dask make index test

* Update set_time_index code path for Dask dataframes and impacted test (#914)

* set_time_index converts variable type in dask

* remove uses_full_entity primitives

* use groupby_trans_primitives for uses_full_entity primitives

* remove groupby features (currently unsupported in dask)

* simplify logic for getting time_type

* Update dask tests (#920)

* remove unsupported primitives, update tests

* update test_aggregation to remove ambiguity

* don't run windows tests in parallel to find failing

* skip dask-tmp-tests

* revert circleci config

* skip tmp dask tests on circleci for windows

* lint

* move ignore dask-temp-tests to setup.cfg

* Compose compatability for Dask (#909)

* convert pass_through df to pandas for dask

* add compose to test_instacart, lint

* add test to check compose label_times accepted

* lint and add composeml to test requirements

* remove force to pandas, add dask compose test

* use >=0.2.0 for composeml

* test fixes

* add tests

* Refactor update_feature_columns (#924)

* refactor update_feature_columns

* update primitives to work with new update_feature_columns code

* update dask tests to skip unsupported primitives

* lint fix

* fix test_aggregation

* lint-fix

* remove check for list input in NumWords and NumCharacters

* remove check for list input from binary transform primitives

* Dask DFS errors with unsupported primitives (#925)

* add dask_compatible flag to primitives

* add tests

* remove unsupported from default, fix tests

* lint

* percentile needs full entity

* update error message

* Error if dask dataframe used for cutoff_time (#931)

* error if dask dataframe used for cutoff_time

* dfs compute with warning

* split out test

Co-authored-by: Roy Wedge <rwedge@featurelabs.com>

* Error if no vtypes given for Dask entity (#929)

* error if no vtypes supplied

* lint

Co-authored-by: Roy Wedge <rwedge@featurelabs.com>

* Restore len() call for Pandas in EntitySets.add_relationships (#943)

* restore len check for pandas and add test

* add dtype check to test

* error if feature_matrix is not Pandas df (#955)

* error if approximate or training window used with dask (#954)

* Revert changes in infer_variable_types (#957)

* update infer variable types

* remove unnecessary change

* Updates for running home credit example with Dask (#953)

* home credit tests

* update home credit test

* Improve column assignment for trans features

* update home credit test

* testing updates

* home credit test updates

* home credit notebook update

* home credit test updates

* home credit test updates

* home credit test updates

* home credit test updates

* home credit updates

* update feature_set_calculator.py

* remove unnecessary repartitioning from notebook

* update notebook text

* update notebook to use os.path.join

* Update list_primitives to indicate Dask compatibility (#963)

* update ft.list_primitives to include dask_compatible column

* fix merge mess

* Add Dask support to EqualScalar and NotEqualScalar primitives (#967)

* add dask support to EqualScalar and NotEqualScalar primitives

* remove pd.Series cast

* Add demo notebook for using Dask with Instacart dataset (#956)

* add instacart with dask notebook example

* update notebook text

* remove %%time from notebook cells

* Dask Test Updates (#973)

* remove dask hackathon test

* reorganize and remove unnecessary tests

* remove dask worker files

* remove dask-tests-tmp directory

* remove dask_profiling.py

* update Makefile and MANIFEST.in

* Dask entityset serialization/deserialization (#981)

* add deserialize support for dask entities

* add tests

* error if to_pickle, fix tests

* add to_pickle errors test

* bump schema version

* fix merge issue

* Support numeric time index for Dask entityset (#992)

* support numeric time index for Dask entityset

* remove unused test fixture

* refactor pass through cols merging

* fix dask test with cutoff times

* Update docs for using Dask entitysets (#965)

* initial doc updates for dask

* update parallel computation guide

* finish initial docs for using dask entitysets

* update EntitySet styling in docs

* various doc additions and improvements

* wording updates

* label dask entityset support as beta and add link for reporting issues

* clear faq notebook output

* remove dask_profiling.py

* fix spelling errors

* update Dask guide

* Dask cleanup (#964)

* initial clean + revert unused

* lint

* more reverts

* remove unused check

* Run unit tests on pandas and dask entitysets (#999)

* rename es to pd_es and parameterize es

* initial work on synthesis tests

* more synthesis test updates

* update utils_test

* update mock_customer_es fixture

* start primitive_tests updates

* update primitive_tests

* parameterize diamond_es

* parameterize games_es fixture

* update primitive_tests

* fix dfs tests with compose

* synthesis test updates

* first pass, failing tests need investigation

* lint and add missing dask tests to test_es.py

* update tests in test_last_time_index.py

* fix test_dask_primitives.py

* use dd.to_numeric for dask type conversions

* update test_es to xfail

* xfail synthesis tests

* update entityset_tests with xfail

* xfail primitive_tests

* xfail computational backend tests

* fix failing dask test

* lint

* fix tests and clean up

* cleanup and xfail plotting

* small fixes

Co-authored-by: Nate Parsons <nate.parsons@alteryx.com>

* changelog

* changelog

* revert changes

* Dask reverts for performance (#1008)

* reverts for performance

* update compose tests

* remove unused code and update docs (#1012)

* Uncomment Future Release section

* fix docs build

* Dask documentation improvements (#1015)

* improve Dask docs

* combine parallel computation and performance guides

* more doc updates

* fix note text

* Update setup.cfg

Co-authored-by: Frances Hartwell <frances.hartwell@alteryx.com>
Co-authored-by: Roy Wedge <rwedge@featurelabs.com>
  • Loading branch information
3 people committed Jun 4, 2020
1 parent 5f8adb6 commit 996c189
Show file tree
Hide file tree
Showing 56 changed files with 3,479 additions and 1,027 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ featuretools/tests/integration_data/products.gzip
featuretools/tests/integration_data/regions.gzip
featuretools/tests/integration_data/sessions.gzip
featuretools/tests/integration_data/stores.gzip
dask-worker-space/*
**/dask-worker-space/*
*.dirlock
*.~lock*

Expand Down
3 changes: 2 additions & 1 deletion docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Changelog
---------
**Future Release**
* Enhancements
* Support use of Dask DataFrames in entitysets (:pr:`783`)
* Add ``make_index`` when initializing an EntitySet by passing in an ``entities`` dictionary (:pr:`1010`)
* Fixes
* Changes
Expand All @@ -12,7 +13,7 @@ Changelog
* Update tests for numpy v1.19.0 compatability (:pr:`1016`)

Thanks to the following people for contributing to this release:
:user:`gsheni`, :user:`frances-h`
:user:`gsheni`, :user:`frances-h`, :user:`rwedge`, :user:`thehomebrewnerd`

**v0.15.0 May 29, 2020**
* Enhancements
Expand Down
30 changes: 29 additions & 1 deletion docs/source/frequently_asked_questions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,21 @@
"feature_matrix[[\"COUNT(sessions WHERE product_id_device = 5 and tablet)\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Can I create an `EntitySet` using Dask dataframes? (BETA)\n",
"\n",
"Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
"\n",
"Yes! Featuretools supports creating an `EntitySet` from Dask dataframes. You can simply follow the same process you would when creating an `EntitySet` from pandas dataframes.\n",
"\n",
"There are some limitations to be aware of when using Dask dataframes. When creating an `Entity` from a Dask dataframe, variable type inference is not performed as it is for pandas entities, so the user must supply a list of variable types during creation. Also, other quality checks are not performed, such as checking for unique index values. An `EntitySet` must be created entirely of Dask entities or pandas entities - you cannot mix pandas entities with Dask entitites in the same `EntitySet`.\n",
"\n",
"For more information on creating an `EntitySet` from Dask dataframes, see the [Using Dask EntitySets](https://docs.featuretools.com/en/stable/guides/using_dask_entitysets.html) guide."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -1509,7 +1524,7 @@
"source": [
"### How do I get a list of all Aggregation and Transform primitives?\n",
"\n",
"You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a Dataframe with the names, type, and description of the primitives. You can also visit [primitives.featurelabs.com](https://primitives.featurelabs.com/) to obtain a list of all available primitives."
"You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a Dataframe with the names, type, and description of the primitives, and if the primitive can be used with entitysets created from Dask dataframes. You can also visit [primitives.featurelabs.com](https://primitives.featurelabs.com/) to obtain a list of all available primitives."
]
},
{
Expand All @@ -1531,6 +1546,19 @@
"df_primitives.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What primitives can I use when creating a feature matrix from a Dask `EntitySet`? (BETA)\n",
"\n",
"Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
"\n",
"When creating a feature matrix from a Dask `EntitySet`, only certain primitives can be used. Computation of certain features is quite expensive in a distributed environment, and as a result only a subset of Featuretools primitives are currently supported when using a Dask `EntitySet`.\n",
"\n",
"The table returned by `featuretools.list_primitives()` will contain a column labeled `dask_compatible`. Any primitive that has a value of `True` in this column can be used safely when computing a feature matrix from a Dask `EntitySet`."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
64 changes: 0 additions & 64 deletions docs/source/guides/parallel.rst

This file was deleted.

Loading

0 comments on commit 996c189

Please sign in to comment.