Skip to content

Commit

Permalink
Add example of using Dask to parallelize to docs (#221)
Browse files Browse the repository at this point in the history
* Update parallel.rst

* Update performance.rst

* Update parallel.rst

* Update performance.rst

Added link to Towards Data Science blog post.

* Update parallel.rst

Removed link to notebook and added in sentence about when this partitioning may be necessary.

* Update performance.rst

Reference is now to Feature Labs engineering blog version of article instead of Towards Data Science.
  • Loading branch information
WillKoehrsen authored and kmax12 committed Aug 21, 2018
1 parent 440bac8 commit 7b5ddf1
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 2 deletions.
7 changes: 5 additions & 2 deletions docs/source/guides/parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ Featuretools can optionally compute features on multiple cores. The simplest way
n_jobs=2,
verbose=True)

The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entity set, so memory use will be proportional to the number of parallel processes. Because the entity set has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to ``calculate_feature_matrix``, read the section below on using a persistent cluster.

The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entity set, so memory use will be proportional to the number of parallel processes. Because the entity set has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to ``calculate_feature_matrix``, read the section below on using a persistent cluster.

Using persistent cluster
------------------------
Expand Down Expand Up @@ -50,3 +49,7 @@ The dashboard requires an additional python package, bokeh, to work. Once bokeh
n_jobs=2,
dask_kwargs={'diagnostics_port': 8787}
verbose=True)
Parallel Computation by Partioning Data
-------------------------------
As an alternative to Featuretool's parallelization, the data can be partitioned and run on multiple cores or a cluster using Dask or PySpark. This approach may be necessary with a large `Entityset` because the current parallel implementation sends the entire `EntitySet` to each worker which may exhaust the worker memory. For more information on partitioning the data and using Dask, see :doc:`/guides/performance`. Dask allows Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.
2 changes: 2 additions & 0 deletions docs/source/guides/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@ When an entire dataset is not required to calculate the features for a given set

An example of this approach can be seen in the `Predict Next Purchase demo notebook <https://github.com/featuretools/predict_next_purchase>`_. In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using `Dask <https://dask.pydata.org/>`_, which could also be used to scale the computation to a cluster of computers. A framework like `Spark <https://spark.apache.org/>`_ could be used similarly.

An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. This approach is detailed in the `Parallelizing Feature Engineering with Dask article <https://medium.com/feature-labs-engineering/scaling-featuretools-with-dask-ce46f9774c7d>`_ on the Feature Labs engineering blog. Dask allows us to easily scale to multiple cores on a single computer or multiple machines on a cluster.

Feature Labs
------------
`Feature Labs <http://featurelabs.com>`_ provides tools and support to organizations that want to scale their usage of Featuretools.

0 comments on commit 7b5ddf1

Please sign in to comment.