Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update set_time_index code path for Dask dataframes and impacted test #914

Merged
merged 5 commits into from
Apr 20, 2020

Conversation

rwedge
Copy link
Contributor

@rwedge rwedge commented Apr 17, 2020

Fixes #913

This PR updates Entity.set_time_index so that Dask dataframes do not exit the function early. Specific points in the code where costly validation checks occur have different logic for Dask frames.

This is necessary to fix the test_build_es_from_scratch_and_run_dfs test.
For that test, I also changed the logic to get the two dataframe indexes to match and removed two primitives that have the uses_full_entity attribute. Features using these primitives were returning different values, which I think is happening because the dask entity data is not sorted

@rwedge rwedge requested a review from frances-h April 17, 2020 20:08
else:
time_to_check = self.df[variable_id].head(1).iloc[0]
if len(self.df) == 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may be able to simplify further by changing this line to

if isinstance(self.df, dd.core.DataFrame) or len(self.df) == 0:

and removing the conditional logic about old_vtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do that, won't you always force a compute() to get the length if self.df is a Dask dataframe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the first statement of the or is true, the second is not evaluated

https://docs.python.org/3/reference/expressions.html#boolean-operations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm understanding correctly, but we could get into trouble calling .head on a dask dataframe since it by default grabs the first partition which could be empty. Since we're requiring vtypes when creating an entity, using the vtype provided to figure out the index type seems like the safer route.

Copy link
Contributor Author

@rwedge rwedge Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing the logic look like this

        if isinstance(self.df, dd.core.DataFrame) or len(self.df) == 0:
            time_to_check = vtypes.DEFAULT_DTYPE_VALUES[self[variable_id]._default_pandas_dtype]
        else:
            time_to_check = self.df[variable_id].head(1).iloc[0]

Which, if the dataframe is a dask df, will use the vtype to figure out the index type

Copy link
Contributor

@frances-h frances-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@rwedge rwedge merged commit 3111a1d into dask-entity Apr 20, 2020
@rwedge rwedge deleted the dask-set-time-index branch April 20, 2020 14:59
thehomebrewnerd added a commit that referenced this pull request Jun 4, 2020
* allow dask dataframe for entity creation

* update create entity from dask df test

* update test to specify variable types

* add dask entityset relationships and dfs

* multiple dask updates and add simple dask test

* multiple updates for dask dataframes

* update dask requirements

* update dask test for test_hackathon

* fixes after merging in changes from master

* initial updates for aggregation with dask dataframes

* update dask tests

* updated dask tests

* dask multipartition tests

* update requirements to fix circleci featuretools installation

* generalize path to hackathon dataset

* add hackathon data to manifest

* bump circleci resource size for unit tests

* fix test for py35, try bumping circleci resources again

* remove hackathon test from circleci, reset resources

* use pd.testing.assert_frame_equal with check_like=True for different column orderings

* additional test fixes

* performance testing improvements

* add profiling script

* add checking for df types when creating entityset

* add test for training_window and fix text for cutoff time df

* update dask tests for consistency

* add test for approximate (doesn't pass currently)

* add test for adding last_time_index to dask entityset

* add dask test for secondary_time_index

* fix issue with TimeSince primitive with dask entityset

* update hackathon test

* remove some easy to remove computes

* fix Pandas 1.0 issues

* various updates for dask entities

* lint fix plus missing_ids change

* fix hackathon test

* update requirements.txt

* fixes for windows tests

* dask dfs fixes

* update aggregation primitives to use dask aggregation

* add temp tests directory

* update temporary tests

* update agg test file

* update encode_features for dask

* featuretools/dask-tests-tmp/test_instacart.py

* update dask tests

* update entity creation code

* lint and test updates

* instacart test updates

* lint fix

* remove leftover head() call

* fix encode features inplace test

* fix some issues with dask aggregations

* various dask updates

* update instacart test files

* instacart test updates

* instacart test updates

* cutoff time updates in cfm

* entity updates for _handle_time

* instacart test update

* update add_last_time_index to use dask

* instacart test updates

* add dask test for time_window

* update add_last_time_indeices

* improve Entity.query_by_values() implementation for Dask

* update dask tests

* lint fix

* revert entityset __repr__ code back to master code

* Fix issue with make_index and Dask entities (#895)

* start test for dask with make_index

* update dask test for make_index

* remove unnecessary code in _create_index()

* update dask make index test

* Update set_time_index code path for Dask dataframes and impacted test (#914)

* set_time_index converts variable type in dask

* remove uses_full_entity primitives

* use groupby_trans_primitives for uses_full_entity primitives

* remove groupby features (currently unsupported in dask)

* simplify logic for getting time_type

* Update dask tests (#920)

* remove unsupported primitives, update tests

* update test_aggregation to remove ambiguity

* don't run windows tests in parallel to find failing

* skip dask-tmp-tests

* revert circleci config

* skip tmp dask tests on circleci for windows

* lint

* move ignore dask-temp-tests to setup.cfg

* Compose compatability for Dask (#909)

* convert pass_through df to pandas for dask

* add compose to test_instacart, lint

* add test to check compose label_times accepted

* lint and add composeml to test requirements

* remove force to pandas, add dask compose test

* use >=0.2.0 for composeml

* test fixes

* add tests

* Refactor update_feature_columns (#924)

* refactor update_feature_columns

* update primitives to work with new update_feature_columns code

* update dask tests to skip unsupported primitives

* lint fix

* fix test_aggregation

* lint-fix

* remove check for list input in NumWords and NumCharacters

* remove check for list input from binary transform primitives

* Dask DFS errors with unsupported primitives (#925)

* add dask_compatible flag to primitives

* add tests

* remove unsupported from default, fix tests

* lint

* percentile needs full entity

* update error message

* Error if dask dataframe used for cutoff_time (#931)

* error if dask dataframe used for cutoff_time

* dfs compute with warning

* split out test

Co-authored-by: Roy Wedge <rwedge@featurelabs.com>

* Error if no vtypes given for Dask entity (#929)

* error if no vtypes supplied

* lint

Co-authored-by: Roy Wedge <rwedge@featurelabs.com>

* Restore len() call for Pandas in EntitySets.add_relationships (#943)

* restore len check for pandas and add test

* add dtype check to test

* error if feature_matrix is not Pandas df (#955)

* error if approximate or training window used with dask (#954)

* Revert changes in infer_variable_types (#957)

* update infer variable types

* remove unnecessary change

* Updates for running home credit example with Dask (#953)

* home credit tests

* update home credit test

* Improve column assignment for trans features

* update home credit test

* testing updates

* home credit test updates

* home credit notebook update

* home credit test updates

* home credit test updates

* home credit test updates

* home credit test updates

* home credit updates

* update feature_set_calculator.py

* remove unnecessary repartitioning from notebook

* update notebook text

* update notebook to use os.path.join

* Update list_primitives to indicate Dask compatibility (#963)

* update ft.list_primitives to include dask_compatible column

* fix merge mess

* Add Dask support to EqualScalar and NotEqualScalar primitives (#967)

* add dask support to EqualScalar and NotEqualScalar primitives

* remove pd.Series cast

* Add demo notebook for using Dask with Instacart dataset (#956)

* add instacart with dask notebook example

* update notebook text

* remove %%time from notebook cells

* Dask Test Updates (#973)

* remove dask hackathon test

* reorganize and remove unnecessary tests

* remove dask worker files

* remove dask-tests-tmp directory

* remove dask_profiling.py

* update Makefile and MANIFEST.in

* Dask entityset serialization/deserialization (#981)

* add deserialize support for dask entities

* add tests

* error if to_pickle, fix tests

* add to_pickle errors test

* bump schema version

* fix merge issue

* Support numeric time index for Dask entityset (#992)

* support numeric time index for Dask entityset

* remove unused test fixture

* refactor pass through cols merging

* fix dask test with cutoff times

* Update docs for using Dask entitysets (#965)

* initial doc updates for dask

* update parallel computation guide

* finish initial docs for using dask entitysets

* update EntitySet styling in docs

* various doc additions and improvements

* wording updates

* label dask entityset support as beta and add link for reporting issues

* clear faq notebook output

* remove dask_profiling.py

* fix spelling errors

* update Dask guide

* Dask cleanup (#964)

* initial clean + revert unused

* lint

* more reverts

* remove unused check

* Run unit tests on pandas and dask entitysets (#999)

* rename es to pd_es and parameterize es

* initial work on synthesis tests

* more synthesis test updates

* update utils_test

* update mock_customer_es fixture

* start primitive_tests updates

* update primitive_tests

* parameterize diamond_es

* parameterize games_es fixture

* update primitive_tests

* fix dfs tests with compose

* synthesis test updates

* first pass, failing tests need investigation

* lint and add missing dask tests to test_es.py

* update tests in test_last_time_index.py

* fix test_dask_primitives.py

* use dd.to_numeric for dask type conversions

* update test_es to xfail

* xfail synthesis tests

* update entityset_tests with xfail

* xfail primitive_tests

* xfail computational backend tests

* fix failing dask test

* lint

* fix tests and clean up

* cleanup and xfail plotting

* small fixes

Co-authored-by: Nate Parsons <nate.parsons@alteryx.com>

* changelog

* changelog

* revert changes

* Dask reverts for performance (#1008)

* reverts for performance

* update compose tests

* remove unused code and update docs (#1012)

* Uncomment Future Release section

* fix docs build

* Dask documentation improvements (#1015)

* improve Dask docs

* combine parallel computation and performance guides

* more doc updates

* fix note text

* Update setup.cfg

Co-authored-by: Frances Hartwell <frances.hartwell@alteryx.com>
Co-authored-by: Roy Wedge <rwedge@featurelabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants