alteryx · thehomebrewnerd · Jun 4, 2020 · Oct 24, 2019 · Oct 24, 2019 · Oct 24, 2019
diff --git a/.gitignore b/.gitignore
@@ -16,7 +16,7 @@ featuretools/tests/integration_data/products.gzip
 featuretools/tests/integration_data/regions.gzip
 featuretools/tests/integration_data/sessions.gzip
 featuretools/tests/integration_data/stores.gzip
-dask-worker-space/*
+**/dask-worker-space/*
 *.dirlock
 *.~lock*
 

diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst
@@ -2,15 +2,17 @@
 
 Changelog
 ---------
-.. **Future Release**
+**Future Release**
     * Enhancements
+        * Support use of Dask DataFrames in entitysets (:pr:`783`)
         * Add ``make_index`` when initializing an EntitySet by passing in an ``entities`` dictionary (:pr:`1010`)
     * Fixes
     * Changes
     * Documentation Changes
     * Testing Changes
+
     Thanks to the following people for contributing to this release:
-    :user:`gsheni`
+    :user:`frances-h`,  :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`
 
 **v0.15.0 May 29, 2020**
     * Enhancements

diff --git a/docs/source/frequently_asked_questions.ipynb b/docs/source/frequently_asked_questions.ipynb
@@ -419,6 +419,21 @@
     "feature_matrix[[\"COUNT(sessions WHERE product_id_device = 5 and tablet)\"]]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Can I create an `EntitySet` using Dask dataframes? (BETA)\n",
+    "\n",
+    "Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
+    "\n",
+    "Yes! Featuretools supports creating an `EntitySet` from Dask dataframes. You can simply follow the same process you would when creating an `EntitySet` from pandas dataframes.\n",
+    "\n",
+    "There are some limitations to be aware of when using Dask dataframes. When creating an `Entity` from a Dask dataframe, variable type inference is not performed as it is for pandas entities, so the user must supply a list of variable types during creation. Also, other quality checks are not performed, such as checking for unique index values. An `EntitySet` must be created entirely of Dask entities or pandas entities - you cannot mix pandas entities with Dask entitites in the same `EntitySet`.\n",
+    "\n",
+    "For more information on creating an `EntitySet` from Dask dataframes, see the [DFS with Dask EntitySets](https://docs.featuretools.com/en/stable/guides/dfs_with_dask_entitysets.html) guide."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1509,7 +1524,7 @@
    "source": [
     "### How do I get a list of all Aggregation and Transform primitives?\n",
     "\n",
-    "You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a Dataframe with the names, type, and description of the primitives. You can also visit [primitives.featurelabs.com](https://primitives.featurelabs.com/) to obtain a list of all available primitives."
+    "You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a Dataframe with the names, type, and description of the primitives, and if the primitive can be used with entitysets created from Dask dataframes. You can also visit [primitives.featurelabs.com](https://primitives.featurelabs.com/) to obtain a list of all available primitives."
    ]
   },
   {
@@ -1531,6 +1546,19 @@
     "df_primitives.tail()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### What primitives can I use when creating a feature matrix from a Dask `EntitySet`? (BETA)\n",
+    "\n",
+    "Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
+    "\n",
+    "When creating a feature matrix from a Dask `EntitySet`, only certain primitives can be used. Computation of certain features is quite expensive in a distributed environment, and as a result only a subset of Featuretools primitives are currently supported when using a Dask `EntitySet`.\n",
+    "\n",
+    "The table returned by `featuretools.list_primitives()` will contain a column labeled `dask_compatible`. Any primitive that has a value of `True` in this column can be used safely when computing a feature matrix from a Dask `EntitySet`."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/docs/source/guides/dfs_with_dask_entitysets.rst b/docs/source/guides/dfs_with_dask_entitysets.rst
@@ -0,0 +1,121 @@
+DFS with Dask EntitySets (BETA)
+===============================
+Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a `new issue on Github <https://github.com/FeatureLabs/featuretools/issues>`_.
+
+Creating a feature matrix from a very large dataset can be problematic if the underlying pandas dataframes that make up the entities cannot easily fit in memory. To help get around this issue, Featuretools supports creating ``Entity`` and ``EntitySet`` objects from Dask dataframes. A Dask ``EntitySet`` can then be passed to ``featuretools.dfs`` or ``featuretools.calculate_feature_matrix`` to create a feature matrix, which will be returned as a Dask dataframe. In addition to working on larger than memory datasets, this approach also allows users to take advantage of the parallel processing capabilities offered by Dask.
+
+This guide will provide an overview of how to create a Dask ``EntitySet`` and then generate a feature matrix from it. If you are already familiar with creating a feature matrix starting from pandas dataframes, this process will seem quite familiar, as there are no differences in the process. There are, however, some limitations when using Dask dataframes, and those limitations are reviewed in more detail below.
+
+Creating Entities and EntitySets
+--------------------------------
+For this example, we will create a very small pandas dataframe and then convert this into a Dask dataframe to use in the remainder of the process. Normally when using Dask, you would just read your data directly into a Dask dataframe without the intermediate step of using pandas.
+
+.. ipython:: python
+
+    import featuretools as ft
+    import pandas as pd
+    import dask.dataframe as dd
+    id = [0, 1, 2, 3, 4]
+    values = [12, -35, 14, 103, -51]
+    df = pd.DataFrame({"id": id, "values": values})
+    dask_df = dd.from_pandas(df, npartitions=2)
+    dask_df
+
+
+Now that we have our Dask dataframe, we can start to create the ``EntitySet``. The current implementation does not support variable type inference for Dask entities, so we must pass a dictionary of variable types using the ``variable_types`` parameter when calling ``es.entity_from_dataframe()``. Aside from needing to supply the variable types, the rest of the process of creating an ``EntitySet`` is the same as if we were using pandas dataframes.
+
+.. ipython:: python
+
+    es = ft.EntitySet(id="dask_es")
+    es = es.entity_from_dataframe(entity_id="dask_entity",
+                                  dataframe=dask_df,
+                                  index="id",
+                                  variable_types={"id": ft.variable_types.Id,
+                                                  "values": ft.variable_types.Numeric})
+    es
+
+
+Notice that when we print our ``EntitySet``, the number of rows for the ``dask_entity`` entity is returned as a Dask ``Delayed`` object. This is because obtaining the length of a Dask dataframe requires an expensive compute operation to sum up the lengths of all the individual partitions that make up the dataframe and that operation is not performed by default.
+
+
+Running DFS
+-----------
+We can pass the ``EntitySet`` we created above to ``featuretools.dfs`` in order to create a feature matrix. If the ``EntitySet`` we pass to ``dfs`` is made of Dask entities, the feature matrix we get back will be a Dask dataframe.
+
+.. ipython:: python
+
+    feature_matrix, features = ft.dfs(entityset=es,
+                                      target_entity="dask_entity",
+                                      trans_primitives=["negate"])
+    feature_matrix
+
+
+This feature matrix can be saved to disk or computed and brought into memory, using the appropriate Dask dataframe methods.
+
+.. ipython:: python
+
+    fm_computed = feature_matrix.compute()
+    fm_computed
+
+
+While this is a simple example to illustrate the process of using Dask dataframes with Featuretools, this process will also work with an ``EntitySet`` containing multiple entities, as well as with aggregation primitives.
+
+Limitations
+-----------
+There are many parts of Featuretools that are difficult to implement in a distributed environment and several primitives that are not well suited to operate on distributed dataframes. As a result, there are some limitations when creating a Dask ``Entityset`` and then using it to generate a feature matrix. The most significant limitations are reviewed in more detail in this section.
+
+Supported Primitives
+********************
+When creating a feature matrix from a Dask ``EntitySet``, only certain primitives can be used. Primitives that rely on the order of the entire dataframe or require an entire column to be computed are inefficient and difficult to support in a distributed environment. As a result, only a subset of Featuretools primitives are currently supported when using a Dask ``EntitySet``.
+
+To obtain a list of the primitives that can be used with a Dask ``EntitySet``, you can call ``featuretools.list_primitives()``. This will return a table of all primitives. Any primitive that can be used with a Dask ``EntitySet`` will have a value of ``True`` in the ``dask_compatible`` column.
+
+
+.. ipython:: python
+
+    primitives_df = ft.list_primitives()
+    dask_compatible_df = primitives_df[primitives_df["dask_compatible"] == True]
+    dask_compatible_df.head()
+    dask_compatible_df.tail()
+
+Primitive Limitations
+*********************
+At this time, custom primitives created with ``featuretools.primitives.make_trans_primitive()`` or ``featuretools.primitives.make_agg_primitive()`` cannot be used for running deep feature synthesis on a Dask ``EntitySet``. Additionally, multivariable and time-dependent aggregation primitives are not currently supported. While it is possible to create custom primitives for use with a Dask ``EntitySet`` by extending the proper primitive class, there are several potential problems in doing so, and those issues are beyond the scope of this guide.
+
+Entity Limitations
+******************
+When creating a Featuretools ``Entity`` from Dask dataframes, variable type inference is not performed as it is when creating entities from pandas dataframes. This is done to improve speed as sampling the data to infer the variable types would require an expensive compute operation on the underlying Dask dataframe. As a consequence, users must define the variable types for each column in the supplied Dataframe. This step is needed so that the deep feature synthesis process can build the proper features based on the column types. A list of available variable types can be obtained by running ``featuretools.variable_types.find_variable_types()``.
+
+By default, Featuretools checks that entities created from pandas dataframes have unique index values. Because performing this same check with Dask would require an expensive compute operation, this check is not performed when creating an entity from a Dask dataframe. When using Dask dataframes, users must ensure that the supplied index values are unique.
+
+When an ``Entity`` is created from a pandas dataframe, the ordering of the underlying dataframe rows is maintained. For a Dask ``Entity``, the ordering of the dataframe rows is not guaranteed, and Featuretools does not attempt to maintain row order in a Dask ``Entity``. If ordering is important, close attention must be paid to any output to avoid issues.
+
+The ``Entity.add_interesting_values()`` method is not supported when using a Dask ``Entity``.  If needed, users can manually set ``interesing_values`` on entities by assigning them directly with syntax similar to this: ``es["entity_name"]["variable_name"].interesting_values = ["Value 1", "Value 2"]``.
+
+EntitySet Limitations
+*********************
+When creating a Featuretools ``EntitySet`` that will be made of Dask entities, all of the entities used to create the ``EntitySet`` must be of the same type, either all Dask entities or all pandas entities. Featuretools does not support creating an ``EntitySet`` containing a mix of Dask and pandas entities.
+
+Additionally, the ``EntitySet.add_interesting_values()`` method is not supported when using a Dask ``EntitySet``. Users can manually set ``interesing_values`` on entities, as described above.
+
+DFS Limitations
+***************
+There are a few key limitations when generating a feature matrix from a Dask ``EntitySet``.
+
+If a ``cutoff_time`` parammeter is passed to ``featuretools.dfs()`` it must either be a single cutoff time value, or a pandas dataframe. The current implementation does not support the use of a Dask dataframe for cutoff time values.
+
+Additionally, Featuretools does not currently support the use of the ``approximate`` or ``training_window`` parameters when working with Dask entitiysets, but should in future releases.
+
+Finally, if the output feature matrix contains a boolean column with ``NaN`` values included, the column type may have a different datatype than the same feature matrix generated from a pandas ``EntitySet``.  If feature matrix column data types are critical, the feature matrix should be inspected to make sure the types are of the proper types, and recast as necessary.
+
+Other Limitations
+*****************
+In some instances, generating a feature matrix with a large number of features has resulted in memory issues on Dask workers. The underlying reason for this is that the partition size of the feature matrix grows too large for Dask to handle as the number of feature columns grows large. This issue is most prevalent when the feature matrix contains a large number of columns compared to the dataframes that make up the entities. Possible solutions to this problem include reducing the partition size used when creating the entity dataframes or increasing the memory available on Dask workers.
+
+Currently ``featuretools.encode_features()`` does not work with a Dask dataframe as input. This will hopefully be resolved in a future release of Featuretools.
+
+The utility function ``featuretools.make_temporal_cutoffs()`` will not work properly with Dask inputs for ``instance_ids`` or ``cutoffs``. However, as noted above, if a ``cutoff_time`` dataframe is supplied to ``dfs``, the supplied dataframe must be a pandas dataframe, and this can be generated by supplying pandas inputs to ``make_temporal_cutoffs()``.
+
+The use of ``featuretools.remove_low_information_features()`` cannot currently be used with a Dask feature matrix.
+
+When manually defining a ``Feature``, the ``use_previous`` parameter cannot be used if this feature will be applied to calculate a feature matrix from a Dask ``EntitySet``.
diff --git a/docs/source/guides/parallel.rst b/docs/source/guides/parallel.rst
@@ -62,3 +62,11 @@ The dashboard requires an additional python package, bokeh, to work. Once bokeh
 Parallel Computation by Partitioning Data
 -----------------------------------------
 As an alternative to Featuretool's parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large ``EntitySet`` because the current parallel implementation sends the entire ``EntitySet`` to each worker which may exhaust the worker memory. For more information on partitioning the data and using Dask or Spark, see :doc:`/guides/performance`. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.
+
+Computation with a Dask EntitySet (BETA)
+----------------------------------------
+Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a `new issue on Github <https://github.com/FeatureLabs/featuretools/issues>`_.
+
+A final approach that can be used is to create a Featuretools ``EntitySet`` directly from Dask dataframes instead of using pandas dataframes. The other methods discussed above may not work with very large datasets because of the memory required to load or partition the pandas dataframes. By creating an ``EntitySet`` directly from Dask dataframes, Featuretools can be used to generate a larger-than-memory feature matrix in a parallel manner. When computing a feature matrix from an ``EntitySet`` created from Dask dataframes, the resulting feature matrix will be returned as a Dask dataframe.
+
+This method does have some limitations in terms of the primitives that are available and the optional parameters that can be used when calculating the feature matrix. For more information on generating a feature matrix with this approach, refer to the guide :doc:`/guides/dfs_with_dask_entitysets`.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -165,6 +165,7 @@ Table of contents
    guides/specifying_primitive_options
    guides/performance
    guides/parallel
+   guides/dfs_with_dask_entitysets
    guides/deployment
    guides/advanced_custom_primitives