Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dask dataframe during entity creation #783

Merged
merged 116 commits into from
Jun 4, 2020
Merged
Show file tree
Hide file tree
Changes from 113 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
0e3ba63
allow dask dataframe for entity creation
Oct 24, 2019
adbc1e8
update create entity from dask df test
Oct 24, 2019
e91c564
update test to specify variable types
Oct 24, 2019
805abed
add dask entityset relationships and dfs
Oct 24, 2019
789551f
multiple dask updates and add simple dask test
Nov 6, 2019
280bb6c
multiple updates for dask dataframes
Dec 19, 2019
d722058
update dask requirements
Dec 19, 2019
1e0f3a4
update dask test for test_hackathon
Dec 19, 2019
c35c839
Merge branch 'master' into dask-entity
thehomebrewnerd Dec 20, 2019
f567379
Merge branch 'dask-entity' of https://github.com/FeatureLabs/featuret…
Dec 20, 2019
d74a83d
fixes after merging in changes from master
Dec 20, 2019
93eacb7
initial updates for aggregation with dask dataframes
Jan 7, 2020
db0f63e
update dask tests
Jan 8, 2020
b2ce33f
updated dask tests
Jan 8, 2020
6ec74ff
dask multipartition tests
Jan 8, 2020
b9aa184
update requirements to fix circleci featuretools installation
frances-h Jan 9, 2020
84bce89
generalize path to hackathon dataset
frances-h Jan 9, 2020
b056bd7
add hackathon data to manifest
frances-h Jan 9, 2020
bf19155
bump circleci resource size for unit tests
frances-h Jan 9, 2020
21632d5
fix test for py35, try bumping circleci resources again
frances-h Jan 10, 2020
1507f81
remove hackathon test from circleci, reset resources
frances-h Jan 10, 2020
ee280fa
use pd.testing.assert_frame_equal with check_like=True for different …
frances-h Jan 10, 2020
36a6ad8
additional test fixes
frances-h Jan 10, 2020
bc4103a
performance testing improvements
Jan 10, 2020
245d5eb
Merge branch 'dask-entity' of https://github.com/FeatureLabs/featuret…
Jan 10, 2020
7e896e0
add profiling script
Jan 10, 2020
859e5ac
add checking for df types when creating entityset
Jan 24, 2020
4b477c8
add test for training_window and fix text for cutoff time df
Jan 27, 2020
021423f
update dask tests for consistency
Jan 28, 2020
5dbe6d3
add test for approximate (doesn't pass currently)
Jan 30, 2020
815cc47
add test for adding last_time_index to dask entityset
Jan 30, 2020
0472448
add dask test for secondary_time_index
Jan 30, 2020
f868120
fix issue with TimeSince primitive with dask entityset
Jan 31, 2020
ffc3dd3
update hackathon test
Feb 13, 2020
2bd87a9
remove some easy to remove computes
frances-h Feb 13, 2020
a4cdff3
Merge branch 'master' into dask-entity
frances-h Feb 21, 2020
c9a6c65
fix Pandas 1.0 issues
frances-h Feb 21, 2020
f4ee1cd
various updates for dask entities
Feb 24, 2020
2970842
Merge branch 'dask-entity' of https://github.com/FeatureLabs/featuret…
Feb 24, 2020
806c563
lint fix plus missing_ids change
Feb 24, 2020
673db93
fix hackathon test
Feb 24, 2020
e23ab59
Merge branch 'master' into dask-entity
thehomebrewnerd Feb 24, 2020
fea42c6
update requirements.txt
Feb 24, 2020
33cad04
fixes for windows tests
Feb 25, 2020
c98526a
dask dfs fixes
Feb 25, 2020
b3bbb34
update aggregation primitives to use dask aggregation
frances-h Mar 3, 2020
07d9125
add temp tests directory
Mar 16, 2020
1ec303f
update temporary tests
Mar 18, 2020
7ea1a9f
update agg test file
Mar 19, 2020
f304abc
update encode_features for dask
Mar 24, 2020
2c06b74
featuretools/dask-tests-tmp/test_instacart.py
Mar 24, 2020
756d761
update dask tests
Mar 24, 2020
ce82e19
update entity creation code
Mar 24, 2020
970c228
lint and test updates
Mar 24, 2020
132e432
instacart test updates
Mar 25, 2020
d3ba08b
lint fix
Mar 25, 2020
ad0a1c3
remove leftover head() call
Mar 25, 2020
1c270fe
fix encode features inplace test
frances-h Mar 31, 2020
ac3a1b4
fix some issues with dask aggregations
frances-h Mar 31, 2020
5ebd81e
various dask updates
Apr 1, 2020
05a25a1
update instacart test files
Apr 3, 2020
c7edc0b
instacart test updates
Apr 6, 2020
9501916
instacart test updates
Apr 9, 2020
5403297
cutoff time updates in cfm
Apr 9, 2020
11e9696
entity updates for _handle_time
Apr 9, 2020
f55133c
instacart test update
Apr 9, 2020
c129cff
update add_last_time_index to use dask
frances-h Apr 9, 2020
c581bc2
Merge branch 'dask-entity' of https://github.com/FeatureLabs/featuret…
Apr 9, 2020
ff6fce3
instacart test updates
Apr 10, 2020
fc62201
add dask test for time_window
Apr 13, 2020
38150d8
update add_last_time_indeices
frances-h Apr 13, 2020
35159b2
improve Entity.query_by_values() implementation for Dask
Apr 13, 2020
32edf90
update dask tests
Apr 13, 2020
277a7bb
lint fix
Apr 13, 2020
d1aa368
revert entityset __repr__ code back to master code
Apr 14, 2020
e6d2418
Fix issue with make_index and Dask entities (#895)
thehomebrewnerd Apr 15, 2020
3111a1d
Update set_time_index code path for Dask dataframes and impacted test…
rwedge Apr 20, 2020
2fd10cf
Merge branch 'master' into dask-entity
thehomebrewnerd Apr 22, 2020
77904c3
Update dask tests (#920)
frances-h Apr 23, 2020
43ec29c
Compose compatability for Dask (#909)
frances-h Apr 24, 2020
c7fc449
Refactor update_feature_columns (#924)
thehomebrewnerd Apr 29, 2020
01d2ecd
Dask DFS errors with unsupported primitives (#925)
frances-h Apr 29, 2020
a1c57e5
Merge branch 'master' into dask-entity
frances-h May 1, 2020
07cc538
Error if dask dataframe used for cutoff_time (#931)
frances-h May 4, 2020
d941445
Error if no vtypes given for Dask entity (#929)
frances-h May 4, 2020
e413378
Restore len() call for Pandas in EntitySets.add_relationships (#943)
frances-h May 5, 2020
308c41e
error if feature_matrix is not Pandas df (#955)
frances-h May 8, 2020
8bc603a
error if approximate or training window used with dask (#954)
frances-h May 8, 2020
e7de404
Revert changes in infer_variable_types (#957)
thehomebrewnerd May 11, 2020
8a30486
Updates for running home credit example with Dask (#953)
thehomebrewnerd May 12, 2020
9350cdd
Merge branch 'master' of https://github.com/FeatureLabs/featuretools …
May 12, 2020
d0cbd2c
Update list_primitives to indicate Dask compatibility (#963)
thehomebrewnerd May 13, 2020
2d3230c
Add Dask support to EqualScalar and NotEqualScalar primitives (#967)
thehomebrewnerd May 15, 2020
b8570b0
Add demo notebook for using Dask with Instacart dataset (#956)
thehomebrewnerd May 18, 2020
572acc8
Dask Test Updates (#973)
thehomebrewnerd May 20, 2020
f901a75
Dask entityset serialization/deserialization (#981)
frances-h May 20, 2020
17f7dc3
Merge branch 'master' of https://github.com/FeatureLabs/featuretools …
May 22, 2020
a742656
Merge branch 'master' of https://github.com/FeatureLabs/featuretools …
May 26, 2020
af0a552
fix merge issue
May 26, 2020
223d154
Support numeric time index for Dask entityset (#992)
thehomebrewnerd May 27, 2020
e809e11
Update docs for using Dask entitysets (#965)
thehomebrewnerd May 27, 2020
cd1d58c
Dask cleanup (#964)
frances-h May 29, 2020
f5b38c1
Run unit tests on pandas and dask entitysets (#999)
frances-h May 29, 2020
a22dd75
Merge branch 'master' into dask-entity
frances-h May 29, 2020
9ef79e2
changelog
frances-h May 29, 2020
71d45fb
changelog
frances-h May 29, 2020
b07d0de
revert changes
May 29, 2020
87d91a6
Dask reverts for performance (#1008)
frances-h Jun 2, 2020
f030ed8
Merge branch 'master' into dask-entity
rwedge Jun 2, 2020
1338439
remove unused code and update docs (#1012)
thehomebrewnerd Jun 3, 2020
200eb10
Merge branch 'master' into dask-entity
rwedge Jun 3, 2020
9fa22bb
Uncomment Future Release section
rwedge Jun 3, 2020
2ab61a9
fix docs build
rwedge Jun 3, 2020
2663622
Dask documentation improvements (#1015)
thehomebrewnerd Jun 4, 2020
cd275a2
Merge branch 'master' into dask-entity
thehomebrewnerd Jun 4, 2020
1f4fe5a
Update setup.cfg
rwedge Jun 4, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ featuretools/tests/integration_data/products.gzip
featuretools/tests/integration_data/regions.gzip
featuretools/tests/integration_data/sessions.gzip
featuretools/tests/integration_data/stores.gzip
dask-worker-space/*
**/dask-worker-space/*
*.dirlock
*.~lock*

Expand Down
6 changes: 4 additions & 2 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,17 @@

Changelog
---------
.. **Future Release**
**Future Release**
* Enhancements
* Support use of Dask DataFrames in entitysets (:pr:`783`)
* Add ``make_index`` when initializing an EntitySet by passing in an ``entities`` dictionary (:pr:`1010`)
* Fixes
* Changes
* Documentation Changes
* Testing Changes

Thanks to the following people for contributing to this release:
:user:`gsheni`
:user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`

**v0.15.0 May 29, 2020**
* Enhancements
Expand Down
30 changes: 29 additions & 1 deletion docs/source/frequently_asked_questions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,21 @@
"feature_matrix[[\"COUNT(sessions WHERE product_id_device = 5 and tablet)\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Can I create an `EntitySet` using Dask dataframes? (BETA)\n",
"\n",
"Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
"\n",
"Yes! Featuretools supports creating an `EntitySet` from Dask dataframes. You can simply follow the same process you would when creating an `EntitySet` from pandas dataframes.\n",
"\n",
"There are some limitations to be aware of when using Dask dataframes. When creating an `Entity` from a Dask dataframe, variable type inference is not performed as it is for pandas entities, so the user must supply a list of variable types during creation. Also, other quality checks are not performed, such as checking for unique index values. An `EntitySet` must be created entirely of Dask entities or pandas entities - you cannot mix pandas entities with Dask entitites in the same `EntitySet`.\n",
"\n",
"For more information on creating an `EntitySet` from Dask dataframes, see the [DFS with Dask EntitySets](https://docs.featuretools.com/en/stable/guides/dfs_with_dask_entitysets.html) guide."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -1509,7 +1524,7 @@
"source": [
"### How do I get a list of all Aggregation and Transform primitives?\n",
"\n",
"You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a Dataframe with the names, type, and description of the primitives. You can also visit [primitives.featurelabs.com](https://primitives.featurelabs.com/) to obtain a list of all available primitives."
"You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a Dataframe with the names, type, and description of the primitives, and if the primitive can be used with entitysets created from Dask dataframes. You can also visit [primitives.featurelabs.com](https://primitives.featurelabs.com/) to obtain a list of all available primitives."
]
},
{
Expand All @@ -1531,6 +1546,19 @@
"df_primitives.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What primitives can I use when creating a feature matrix from a Dask `EntitySet`? (BETA)\n",
"\n",
"Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
"\n",
"When creating a feature matrix from a Dask `EntitySet`, only certain primitives can be used. Computation of certain features is quite expensive in a distributed environment, and as a result only a subset of Featuretools primitives are currently supported when using a Dask `EntitySet`.\n",
"\n",
"The table returned by `featuretools.list_primitives()` will contain a column labeled `dask_compatible`. Any primitive that has a value of `True` in this column can be used safely when computing a feature matrix from a Dask `EntitySet`."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
121 changes: 121 additions & 0 deletions docs/source/guides/dfs_with_dask_entitysets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
DFS with Dask EntitySets (BETA)
thehomebrewnerd marked this conversation as resolved.
Show resolved Hide resolved
===============================
Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a `new issue on Github <https://github.com/FeatureLabs/featuretools/issues>`_.
thehomebrewnerd marked this conversation as resolved.
Show resolved Hide resolved

Creating a feature matrix from a very large dataset can be problematic if the underlying pandas dataframes that make up the entities cannot easily fit in memory. To help get around this issue, Featuretools supports creating ``Entity`` and ``EntitySet`` objects from Dask dataframes. A Dask ``EntitySet`` can then be passed to ``featuretools.dfs`` or ``featuretools.calculate_feature_matrix`` to create a feature matrix, which will be returned as a Dask dataframe. In addition to working on larger than memory datasets, this approach also allows users to take advantage of the parallel processing capabilities offered by Dask.
thehomebrewnerd marked this conversation as resolved.
Show resolved Hide resolved

This guide will provide an overview of how to create a Dask ``EntitySet`` and then generate a feature matrix from it. If you are already familiar with creating a feature matrix starting from pandas dataframes, this process will seem quite familiar, as there are no differences in the process. There are, however, some limitations when using Dask dataframes, and those limitations are reviewed in more detail below.

Creating Entities and EntitySets
--------------------------------
For this example, we will create a very small pandas dataframe and then convert this into a Dask dataframe to use in the remainder of the process. Normally when using Dask, you would just read your data directly into a Dask dataframe without the intermediate step of using pandas.

.. ipython:: python

import featuretools as ft
import pandas as pd
import dask.dataframe as dd
id = [0, 1, 2, 3, 4]
values = [12, -35, 14, 103, -51]
df = pd.DataFrame({"id": id, "values": values})
dask_df = dd.from_pandas(df, npartitions=2)
dask_df


Now that we have our Dask dataframe, we can start to create the ``EntitySet``. The current implementation does not support variable type inference for Dask entities, so we must pass a dictionary of variable types using the ``variable_types`` parameter when calling ``es.entity_from_dataframe()``. Aside from needing to supply the variable types, the rest of the process of creating an ``EntitySet`` is the same as if we were using pandas dataframes.

.. ipython:: python

es = ft.EntitySet(id="dask_es")
es = es.entity_from_dataframe(entity_id="dask_entity",
dataframe=dask_df,
index="id",
variable_types={"id": ft.variable_types.Id,
"values": ft.variable_types.Numeric})
es


Notice that when we print our ``EntitySet``, the number of rows for the ``dask_entity`` entity is returned as a Dask ``Delayed`` object. This is because obtaining the length of a Dask dataframe requires an expensive compute operation to sum up the lengths of all the individual partitions that make up the dataframe and that operation is not performed by default.
thehomebrewnerd marked this conversation as resolved.
Show resolved Hide resolved


Running DFS
-----------
We can pass the ``EntitySet`` we created above to ``featuretools.dfs`` in order to create a feature matrix. If the ``EntitySet`` we pass to ``dfs`` is made of Dask entities, the feature matrix we get back will be a Dask dataframe.

.. ipython:: python

feature_matrix, features = ft.dfs(entityset=es,
target_entity="dask_entity",
trans_primitives=["negate"])
feature_matrix


This feature matrix can be saved to disk or computed and brought into memory, using the appropriate Dask dataframe methods.

.. ipython:: python

fm_computed = feature_matrix.compute()
fm_computed


While this is a simple example to illustrate the process of using Dask dataframes with Featuretools, this process will also work with an ``EntitySet`` containing multiple entities, as well as with aggregation primitives.

Limitations
-----------
There are many parts of Featuretools that are difficult to implement in a distributed environment and several primitives that are not well suited to operate on distributed dataframes. As a result, there are some limitations when creating a Dask ``Entityset`` and then using it to generate a feature matrix. The most significant limitations are reviewed in more detail in this section.

Supported Primitives
********************
When creating a feature matrix from a Dask ``EntitySet``, only certain primitives can be used. Primitives that rely on the order of the entire dataframe or require an entire column to be computed are inefficient and difficult to support in a distributed environment. As a result, only a subset of Featuretools primitives are currently supported when using a Dask ``EntitySet``.
thehomebrewnerd marked this conversation as resolved.
Show resolved Hide resolved

To obtain a list of the primitives that can be used with a Dask ``EntitySet``, you can call ``featuretools.list_primitives()``. This will return a table of all primitives. Any primitive that can be used with a Dask ``EntitySet`` will have a value of ``True`` in the ``dask_compatible`` column.


.. ipython:: python

primitives_df = ft.list_primitives()
dask_compatible_df = primitives_df[primitives_df["dask_compatible"] == True]
dask_compatible_df.head()
dask_compatible_df.tail()

Primitive Limitations
*********************
At this time, custom primitives created with ``featuretools.primitives.make_trans_primitive()`` or ``featuretools.primitives.make_agg_primitive()`` cannot be used for running deep feature synthesis on a Dask ``EntitySet``. Additionally, multivariable and time-dependent aggregation primitives are not currently supported. While it is possible to create custom primitives for use with a Dask ``EntitySet`` by extending the proper primitive class, there are several potential problems in doing so, and those issues are beyond the scope of this guide.

Entity Limitations
******************
When creating a Featuretools ``Entity`` from Dask dataframes, variable type inference is not performed as it is when creating entities from pandas dataframes. This is done to improve speed as sampling the data to infer the variable types would require an expensive compute operation on the underlying Dask dataframe. As a consequence, users must define the variable types for each column in the supplied Dataframe. This step is needed so that the deep feature synthesis process can build the proper features based on the column types. A list of available variable types can be obtained by running ``featuretools.variable_types.find_variable_types()``.

By default, Featuretools checks that entities created from pandas dataframes have unique index values. Because performing this same check with Dask would require an expensive compute operation, this check is not performed when creating an entity from a Dask dataframe. When using Dask dataframes, users must ensure that the supplied index values are unique.

When an ``Entity`` is created from a pandas dataframe, the ordering of the underlying dataframe rows is maintained. For a Dask ``Entity``, the ordering of the dataframe rows is not guaranteed, and Featuretools does not attempt to maintain row order in a Dask ``Entity``. If ordering is important, close attention must be paid to any output to avoid issues.

The ``Entity.add_interesting_values()`` method is not supported when using a Dask ``Entity``. If needed, users can manually set ``interesing_values`` on entities by assigning them directly with syntax similar to this: ``es["entity_name"]["variable_name"].interesting_values = ["Value 1", "Value 2"]``.

EntitySet Limitations
*********************
When creating a Featuretools ``EntitySet`` that will be made of Dask entities, all of the entities used to create the ``EntitySet`` must be of the same type, either all Dask entities or all pandas entities. Featuretools does not support creating an ``EntitySet`` containing a mix of Dask and pandas entities.

Additionally, the ``EntitySet.add_interesting_values()`` method is not supported when using a Dask ``EntitySet``. Users can manually set ``interesing_values`` on entities, as described above.

DFS Limitations
***************
There are a few key limitations when generating a feature matrix from a Dask ``EntitySet``.

If a ``cutoff_time`` parammeter is passed to ``featuretools.dfs()`` it must either be a single cutoff time value, or a pandas dataframe. The current implementation does not support the use of a Dask dataframe for cutoff time values.

Additionally, Featuretools does not currently support the use of the ``approximate`` or ``training_window`` parameters when working with Dask entitiysets, but should in future releases.

Finally, if the output feature matrix contains a boolean column with ``NaN`` values included, the column type may have a different datatype than the same feature matrix generated from a pandas ``EntitySet``. If feature matrix column data types are critical, the feature matrix should be inspected to make sure the types are of the proper types, and recast as necessary.

Other Limitations
*****************
In some instances, generating a feature matrix with a large number of features has resulted in memory issues on Dask workers. The underlying reason for this is that the partition size of the feature matrix grows too large for Dask to handle as the number of feature columns grows large. This issue is most prevalent when the feature matrix contains a large number of columns compared to the dataframes that make up the entities. Possible solutions to this problem include reducing the partition size used when creating the entity dataframes or increasing the memory available on Dask workers.

Currently ``featuretools.encode_features()`` does not work with a Dask dataframe as input. This will hopefully be resolved in a future release of Featuretools.

The utility function ``featuretools.make_temporal_cutoffs()`` will not work properly with Dask inputs for ``instance_ids`` or ``cutoffs``. However, as noted above, if a ``cutoff_time`` dataframe is supplied to ``dfs``, the supplied dataframe must be a pandas dataframe, and this can be generated by supplying pandas inputs to ``make_temporal_cutoffs()``.

The use of ``featuretools.remove_low_information_features()`` cannot currently be used with a Dask feature matrix.

When manually defining a ``Feature``, the ``use_previous`` parameter cannot be used if this feature will be applied to calculate a feature matrix from a Dask ``EntitySet``.
8 changes: 8 additions & 0 deletions docs/source/guides/parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,11 @@ The dashboard requires an additional python package, bokeh, to work. Once bokeh
Parallel Computation by Partitioning Data
-----------------------------------------
As an alternative to Featuretool's parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large ``EntitySet`` because the current parallel implementation sends the entire ``EntitySet`` to each worker which may exhaust the worker memory. For more information on partitioning the data and using Dask or Spark, see :doc:`/guides/performance`. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.

Computation with a Dask EntitySet (BETA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd move this section up to the top and we need to also edit the "Running Featuretools with Spark and Dask" section to align with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure I fully understand this comment...

So, you want this new Dask section to be the very first section in the document instead of the last?

Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?

Copy link
Contributor

@kmax12 kmax12 Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you want this new Dask section to be the very first section in the document instead of the last?

Yep

Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?

Right now that section says "If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing this simple request form."

However, if they want to use dask, they dont need to fill out the form since dask is now released. They should still fill out the form for spark though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 If we remove the reference to Dask in this section (and the corresponding section in Improving Computational Performance - we also need to update the linked request form which still mentions Dask. Can you update that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, ill do that right after we release! thanks for the reminder

----------------------------------------
Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a `new issue on Github <https://github.com/FeatureLabs/featuretools/issues>`_.

A final approach that can be used is to create a Featuretools ``EntitySet`` directly from Dask dataframes instead of using pandas dataframes. The other methods discussed above may not work with very large datasets because of the memory required to load or partition the pandas dataframes. By creating an ``EntitySet`` directly from Dask dataframes, Featuretools can be used to generate a larger-than-memory feature matrix in a parallel manner. When computing a feature matrix from an ``EntitySet`` created from Dask dataframes, the resulting feature matrix will be returned as a Dask dataframe.

This method does have some limitations in terms of the primitives that are available and the optional parameters that can be used when calculating the feature matrix. For more information on generating a feature matrix with this approach, refer to the guide :doc:`/guides/dfs_with_dask_entitysets`.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@ Table of contents
guides/specifying_primitive_options
guides/performance
guides/parallel
guides/dfs_with_dask_entitysets
guides/deployment
guides/advanced_custom_primitives

Expand Down
Loading