Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for running home credit example with Dask #953

Merged
merged 19 commits into from
May 12, 2020

Conversation

thehomebrewnerd
Copy link
Contributor

This PR includes changed made to run benchmark tests with the Kaggle home credit default risk dataset. This also includes the creation of a notebook that shows how a feature matrix can be generated from this dataset using Dask entities.

Fixes #884

@thehomebrewnerd thehomebrewnerd changed the title Updated for running home credit example with Dask Updates for running home credit example with Dask May 7, 2020
@codecov
Copy link

codecov bot commented May 7, 2020

Codecov Report

Merging #953 into dask-entity will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           dask-entity     #953      +/-   ##
===============================================
- Coverage        97.15%   97.14%   -0.02%     
===============================================
  Files              120      120              
  Lines            11617    11616       -1     
===============================================
- Hits             11287    11284       -3     
- Misses             330      332       +2     
Impacted Files Coverage Δ
...s/computational_backends/feature_set_calculator.py 89.90% <100.00%> (-0.03%) ⬇️
featuretools/synthesis/deep_feature_synthesis.py 97.34% <0.00%> (-0.49%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e413378...85ab893. Read the comment docs.

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@thehomebrewnerd thehomebrewnerd merged commit 8a30486 into dask-entity May 12, 2020
@thehomebrewnerd thehomebrewnerd deleted the dask-home-credit branch May 12, 2020 21:12
thehomebrewnerd added a commit that referenced this pull request Jun 4, 2020
* allow dask dataframe for entity creation

* update create entity from dask df test

* update test to specify variable types

* add dask entityset relationships and dfs

* multiple dask updates and add simple dask test

* multiple updates for dask dataframes

* update dask requirements

* update dask test for test_hackathon

* fixes after merging in changes from master

* initial updates for aggregation with dask dataframes

* update dask tests

* updated dask tests

* dask multipartition tests

* update requirements to fix circleci featuretools installation

* generalize path to hackathon dataset

* add hackathon data to manifest

* bump circleci resource size for unit tests

* fix test for py35, try bumping circleci resources again

* remove hackathon test from circleci, reset resources

* use pd.testing.assert_frame_equal with check_like=True for different column orderings

* additional test fixes

* performance testing improvements

* add profiling script

* add checking for df types when creating entityset

* add test for training_window and fix text for cutoff time df

* update dask tests for consistency

* add test for approximate (doesn't pass currently)

* add test for adding last_time_index to dask entityset

* add dask test for secondary_time_index

* fix issue with TimeSince primitive with dask entityset

* update hackathon test

* remove some easy to remove computes

* fix Pandas 1.0 issues

* various updates for dask entities

* lint fix plus missing_ids change

* fix hackathon test

* update requirements.txt

* fixes for windows tests

* dask dfs fixes

* update aggregation primitives to use dask aggregation

* add temp tests directory

* update temporary tests

* update agg test file

* update encode_features for dask

* featuretools/dask-tests-tmp/test_instacart.py

* update dask tests

* update entity creation code

* lint and test updates

* instacart test updates

* lint fix

* remove leftover head() call

* fix encode features inplace test

* fix some issues with dask aggregations

* various dask updates

* update instacart test files

* instacart test updates

* instacart test updates

* cutoff time updates in cfm

* entity updates for _handle_time

* instacart test update

* update add_last_time_index to use dask

* instacart test updates

* add dask test for time_window

* update add_last_time_indeices

* improve Entity.query_by_values() implementation for Dask

* update dask tests

* lint fix

* revert entityset __repr__ code back to master code

* Fix issue with make_index and Dask entities (#895)

* start test for dask with make_index

* update dask test for make_index

* remove unnecessary code in _create_index()

* update dask make index test

* Update set_time_index code path for Dask dataframes and impacted test (#914)

* set_time_index converts variable type in dask

* remove uses_full_entity primitives

* use groupby_trans_primitives for uses_full_entity primitives

* remove groupby features (currently unsupported in dask)

* simplify logic for getting time_type

* Update dask tests (#920)

* remove unsupported primitives, update tests

* update test_aggregation to remove ambiguity

* don't run windows tests in parallel to find failing

* skip dask-tmp-tests

* revert circleci config

* skip tmp dask tests on circleci for windows

* lint

* move ignore dask-temp-tests to setup.cfg

* Compose compatability for Dask (#909)

* convert pass_through df to pandas for dask

* add compose to test_instacart, lint

* add test to check compose label_times accepted

* lint and add composeml to test requirements

* remove force to pandas, add dask compose test

* use >=0.2.0 for composeml

* test fixes

* add tests

* Refactor update_feature_columns (#924)

* refactor update_feature_columns

* update primitives to work with new update_feature_columns code

* update dask tests to skip unsupported primitives

* lint fix

* fix test_aggregation

* lint-fix

* remove check for list input in NumWords and NumCharacters

* remove check for list input from binary transform primitives

* Dask DFS errors with unsupported primitives (#925)

* add dask_compatible flag to primitives

* add tests

* remove unsupported from default, fix tests

* lint

* percentile needs full entity

* update error message

* Error if dask dataframe used for cutoff_time (#931)

* error if dask dataframe used for cutoff_time

* dfs compute with warning

* split out test

Co-authored-by: Roy Wedge <rwedge@featurelabs.com>

* Error if no vtypes given for Dask entity (#929)

* error if no vtypes supplied

* lint

Co-authored-by: Roy Wedge <rwedge@featurelabs.com>

* Restore len() call for Pandas in EntitySets.add_relationships (#943)

* restore len check for pandas and add test

* add dtype check to test

* error if feature_matrix is not Pandas df (#955)

* error if approximate or training window used with dask (#954)

* Revert changes in infer_variable_types (#957)

* update infer variable types

* remove unnecessary change

* Updates for running home credit example with Dask (#953)

* home credit tests

* update home credit test

* Improve column assignment for trans features

* update home credit test

* testing updates

* home credit test updates

* home credit notebook update

* home credit test updates

* home credit test updates

* home credit test updates

* home credit test updates

* home credit updates

* update feature_set_calculator.py

* remove unnecessary repartitioning from notebook

* update notebook text

* update notebook to use os.path.join

* Update list_primitives to indicate Dask compatibility (#963)

* update ft.list_primitives to include dask_compatible column

* fix merge mess

* Add Dask support to EqualScalar and NotEqualScalar primitives (#967)

* add dask support to EqualScalar and NotEqualScalar primitives

* remove pd.Series cast

* Add demo notebook for using Dask with Instacart dataset (#956)

* add instacart with dask notebook example

* update notebook text

* remove %%time from notebook cells

* Dask Test Updates (#973)

* remove dask hackathon test

* reorganize and remove unnecessary tests

* remove dask worker files

* remove dask-tests-tmp directory

* remove dask_profiling.py

* update Makefile and MANIFEST.in

* Dask entityset serialization/deserialization (#981)

* add deserialize support for dask entities

* add tests

* error if to_pickle, fix tests

* add to_pickle errors test

* bump schema version

* fix merge issue

* Support numeric time index for Dask entityset (#992)

* support numeric time index for Dask entityset

* remove unused test fixture

* refactor pass through cols merging

* fix dask test with cutoff times

* Update docs for using Dask entitysets (#965)

* initial doc updates for dask

* update parallel computation guide

* finish initial docs for using dask entitysets

* update EntitySet styling in docs

* various doc additions and improvements

* wording updates

* label dask entityset support as beta and add link for reporting issues

* clear faq notebook output

* remove dask_profiling.py

* fix spelling errors

* update Dask guide

* Dask cleanup (#964)

* initial clean + revert unused

* lint

* more reverts

* remove unused check

* Run unit tests on pandas and dask entitysets (#999)

* rename es to pd_es and parameterize es

* initial work on synthesis tests

* more synthesis test updates

* update utils_test

* update mock_customer_es fixture

* start primitive_tests updates

* update primitive_tests

* parameterize diamond_es

* parameterize games_es fixture

* update primitive_tests

* fix dfs tests with compose

* synthesis test updates

* first pass, failing tests need investigation

* lint and add missing dask tests to test_es.py

* update tests in test_last_time_index.py

* fix test_dask_primitives.py

* use dd.to_numeric for dask type conversions

* update test_es to xfail

* xfail synthesis tests

* update entityset_tests with xfail

* xfail primitive_tests

* xfail computational backend tests

* fix failing dask test

* lint

* fix tests and clean up

* cleanup and xfail plotting

* small fixes

Co-authored-by: Nate Parsons <nate.parsons@alteryx.com>

* changelog

* changelog

* revert changes

* Dask reverts for performance (#1008)

* reverts for performance

* update compose tests

* remove unused code and update docs (#1012)

* Uncomment Future Release section

* fix docs build

* Dask documentation improvements (#1015)

* improve Dask docs

* combine parallel computation and performance guides

* more doc updates

* fix note text

* Update setup.cfg

Co-authored-by: Frances Hartwell <frances.hartwell@alteryx.com>
Co-authored-by: Roy Wedge <rwedge@featurelabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants