New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow dask dataframe during entity creation #783
Conversation
Codecov Report
@@ Coverage Diff @@
## master #783 +/- ##
==========================================
- Coverage 98.27% 98.27% -0.01%
==========================================
Files 119 121 +2
Lines 11078 12338 +1260
==========================================
+ Hits 10887 12125 +1238
- Misses 191 213 +22
Continue to review full report at Codecov.
|
featuretools/computational_backends/calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
featuretools/computational_backends/calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
* reverts for performance * update compose tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are some comments on the docs. overall looking good and well done with documenting all the details. I wanted to get these comments over ASAP to get the release out, so let me know if anything is confusing and I can clarify or talk through live.
docs/source/guides/parallel.rst
Outdated
@@ -62,3 +62,11 @@ The dashboard requires an additional python package, bokeh, to work. Once bokeh | |||
Parallel Computation by Partitioning Data | |||
----------------------------------------- | |||
As an alternative to Featuretool's parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large ``EntitySet`` because the current parallel implementation sends the entire ``EntitySet`` to each worker which may exhaust the worker memory. For more information on partitioning the data and using Dask or Spark, see :doc:`/guides/performance`. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster. | |||
|
|||
Computation with a Dask EntitySet (BETA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd move this section up to the top and we need to also edit the "Running Featuretools with Spark and Dask" section to align with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to make sure I fully understand this comment...
So, you want this new Dask section to be the very first section in the document instead of the last?
Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you want this new Dask section to be the very first section in the document instead of the last?
Yep
Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?
Right now that section says "If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing this simple request form."
However, if they want to use dask, they dont need to fill out the form since dask is now released. They should still fill out the form for spark though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmax12 If we remove the reference to Dask in this section (and the corresponding section in Improving Computational Performance
- we also need to update the linked request form which still mentions Dask. Can you update that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, ill do that right after we release! thanks for the reminder
* improve Dask docs * combine parallel computation and performance guides * more doc updates * fix note text
setup.cfg
Outdated
@@ -1,7 +1,7 @@ | |||
[metadata] | |||
description-file = README.md | |||
[tool:pytest] | |||
addopts = --doctest-modules --ignore=featuretools/tests/plugin_tests/featuretools_plugin | |||
addopts = --doctest-modules --ignore=featuretools/tests/plugin_tests/featuretools_plugin --ignore=featuretools/dask-tests-tmp/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ignore flag is no longer needed (the dask-tests-tmp one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work. Excited to get this merged and released. Nice job!
Allows users to supply either a pandas dataframe or a dask dataframe when creating an entity with
es.entity_from_dataframe
.Summary of Changes
df[col].tolist()
syntax tolist(df[col])
since dask dataframes do not support.tolist()
frame.shape[0]
syntax tolen(frame)
as calling.shape()
on a dask dataframe returns a delay object for the number of rows. Must uselen(frame)
to get the total number of rows.if df.empty
syntax toif len(df) == 0
as dask dataframes to not have a.empty
attribute.inplace
parameter for dropping variables or renaming columns..compute()
before.is_unique
, so added logic to add this step for dask dataframes..iloc
to select rows, so in order to get the first row of data, you must first call.head()
or.compute()
on the dataframe..head()
is better if you only want the first row, as.compute()
will compute the full dataframe, which isn't needed..set_index()
on a dask dataframe generates an error if the index column passed is of typecategorical
. This can be fixed by calling.cat.as_ordered()
on the index column passed in or by casting the column to typeobject
with.astype(object)
first. On a simple test, there was only a slight difference in performance but.astype(object)
was 4% faster.utils/entity_utils.py
as dask dataframes do not support sampling by specifying the number of samples.