Allow dask dataframe during entity creation #783

thehomebrewnerd · 2019-10-24T15:28:38Z

Allows users to supply either a pandas dataframe or a dask dataframe when creating an entity with es.entity_from_dataframe.

Summary of Changes

Updated df[col].tolist() syntax to list(df[col]) since dask dataframes do not support .tolist()
Changed frame.shape[0] syntax to len(frame) as calling .shape() on a dask dataframe returns a delay object for the number of rows. Must use len(frame) to get the total number of rows.
Changed if df.empty syntax to if len(df) == 0 as dask dataframes to not have a .empty attribute.
Updated syntax because dask dataframes do not support inplace parameter for dropping variables or renaming columns.
In order to determine if a dask dataframe index has unique values, you must first call .compute() before .is_unique, so added logic to add this step for dask dataframes.
You cannot assign a list or np.array to a dask dataframe column as you can with pandas, so you must first create a dask dataframe from a dask array and then perform the assignment
Dask dataframes do not support using .iloc to select rows, so in order to get the first row of data, you must first call .head() or .compute() on the dataframe. .head() is better if you only want the first row, as .compute() will compute the full dataframe, which isn't needed.
Calling .set_index() on a dask dataframe generates an error if the index column passed is of type categorical. This can be fixed by calling .cat.as_ordered() on the index column passed in or by casting the column to type object with .astype(object) first. On a simple test, there was only a slight difference in performance but .astype(object) was 4% faster.
Updated sampling process in utils/entity_utils.py as dask dataframes do not support sampling by specifying the number of samples.
Many more...

…ools into dask-entity

…column orderings

…ools into dask-entity

codecov-io · 2020-01-24T20:22:59Z

Codecov Report

Merging #783 into master will decrease coverage by 0.00%.
The diff coverage is 98.64%.

@@            Coverage Diff             @@
##           master     #783      +/-   ##
==========================================
- Coverage   98.27%   98.27%   -0.01%     
==========================================
  Files         119      121       +2     
  Lines       11078    12338    +1260     
==========================================
+ Hits        10887    12125    +1238     
- Misses        191      213      +22

Impacted Files	Coverage Δ
...mputational_backend/test_feature_set_calculator.py	`97.96% <92.40%> (-2.04%)`	⬇️
...utational_backend/test_calculate_feature_matrix.py	`98.13% <95.19%> (-1.30%)`	⬇️
...s/computational_backends/feature_set_calculator.py	`98.69% <98.57%> (+0.14%)`	⬆️
...ools/tests/entityset_tests/test_last_time_index.py	`99.53% <98.59%> (-0.47%)`	⬇️
featuretools/tests/entityset_tests/test_es.py	`99.84% <99.49%> (-0.16%)`	⬇️
...computational_backends/calculate_feature_matrix.py	`98.71% <100.00%> (+0.10%)`	⬆️
featuretools/entityset/deserialize.py	`100.00% <100.00%> (ø)`
featuretools/entityset/entity.py	`96.16% <100.00%> (+0.22%)`	⬆️
featuretools/entityset/entityset.py	`97.22% <100.00%> (+0.43%)`	⬆️
featuretools/entityset/serialize.py	`100.00% <100.00%> (ø)`
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f8adb6...1f4fe5a. Read the comment docs.

featuretools/computational_backends/calculate_feature_matrix.py

* reverts for performance * update compose tests

featuretools/tests/primitive_tests/test_transform_features.py

featuretools/computational_backends/feature_set_calculator.py

docs/source/changelog.rst

rwedge · 2020-06-03T19:47:33Z

@kmax12 https://docs.featuretools.com/en/dask-entity/

kmax12

here are some comments on the docs. overall looking good and well done with documenting all the details. I wanted to get these comments over ASAP to get the release out, so let me know if anything is confusing and I can clarify or talk through live.

docs/source/guides/dfs_with_dask_entitysets.rst

kmax12 · 2020-06-03T19:54:47Z

docs/source/guides/parallel.rst

@@ -62,3 +62,11 @@ The dashboard requires an additional python package, bokeh, to work. Once bokeh
 Parallel Computation by Partitioning Data
 -----------------------------------------
 As an alternative to Featuretool's parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large ``EntitySet`` because the current parallel implementation sends the entire ``EntitySet`` to each worker which may exhaust the worker memory. For more information on partitioning the data and using Dask or Spark, see :doc:`/guides/performance`. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.
+
+Computation with a Dask EntitySet (BETA)


i'd move this section up to the top and we need to also edit the "Running Featuretools with Spark and Dask" section to align with this

Want to make sure I fully understand this comment...

So, you want this new Dask section to be the very first section in the document instead of the last?

Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?

So, you want this new Dask section to be the very first section in the document instead of the last?

Yep

Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?

Right now that section says "If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing this simple request form."

However, if they want to use dask, they dont need to fill out the form since dask is now released. They should still fill out the form for spark though

@kmax12 If we remove the reference to Dask in this section (and the corresponding section in Improving Computational Performance - we also need to update the linked request form which still mentions Dask. Can you update that?

yep, ill do that right after we release! thanks for the reminder

* improve Dask docs * combine parallel computation and performance guides * more doc updates * fix note text

rwedge · 2020-06-04T18:48:04Z

setup.cfg

@@ -1,7 +1,7 @@
 [metadata]
 description-file = README.md
 [tool:pytest]
-addopts = --doctest-modules --ignore=featuretools/tests/plugin_tests/featuretools_plugin
+addopts = --doctest-modules --ignore=featuretools/tests/plugin_tests/featuretools_plugin  --ignore=featuretools/dask-tests-tmp/


this ignore flag is no longer needed (the dask-tests-tmp one)

rwedge

kmax12

Amazing work. Excited to get this merged and released. Nice job!

Nate Parsons added 3 commits October 24, 2019 10:15

allow dask dataframe for entity creation

0e3ba63

update create entity from dask df test

adbc1e8

update test to specify variable types

e91c564

thehomebrewnerd requested a review from rwedge October 24, 2019 15:59

Nate Parsons added 2 commits October 24, 2019 15:48

add dask entityset relationships and dfs

805abed

multiple dask updates and add simple dask test

789551f

thehomebrewnerd changed the title ~~Allow dask dataframe during entity creation~~ [WIP] Allow dask dataframe during entity creation Nov 6, 2019

Nate Parsons and others added 22 commits December 19, 2019 14:39

multiple updates for dask dataframes

280bb6c

update dask requirements

d722058

update dask test for test_hackathon

1e0f3a4

Merge branch 'master' into dask-entity

c35c839

Merge branch 'dask-entity' of https://github.com/FeatureLabs/featuret…

f567379

…ools into dask-entity

fixes after merging in changes from master

d74a83d

initial updates for aggregation with dask dataframes

93eacb7

update dask tests

db0f63e

updated dask tests

b2ce33f

dask multipartition tests

6ec74ff

update requirements to fix circleci featuretools installation

b9aa184

generalize path to hackathon dataset

84bce89

add hackathon data to manifest

b056bd7

bump circleci resource size for unit tests

bf19155

fix test for py35, try bumping circleci resources again

21632d5

remove hackathon test from circleci, reset resources

1507f81

use pd.testing.assert_frame_equal with check_like=True for different …

ee280fa

…column orderings

additional test fixes

36a6ad8

performance testing improvements

bc4103a

Merge branch 'dask-entity' of https://github.com/FeatureLabs/featuret…

245d5eb

…ools into dask-entity

add profiling script

7e896e0

add checking for df types when creating entityset

859e5ac