New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example of using Dask to parallelize to docs #221

Merged
merged 6 commits into from Aug 21, 2018

Conversation

Projects
None yet
3 participants
@WillKoehrsen
Contributor

WillKoehrsen commented Aug 20, 2018

Update of the documentation to point to the Featuretools on Dask notebook. Also added in section to parallel about partitioning data and running on multiple cores.

WillKoehrsen added some commits Aug 20, 2018

Parallel Computation by Partioning Data
-------------------------------
As an alternative to Featuretool's parallelization, the data can be partitioned and run on multiple cores or a cluster using Dask or PySpark. For more information on partitioning the data and using Dask, see :doc:`/guides/performance`. An example of this approach can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. Dask allows Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.

This comment has been minimized.

@kmax12

kmax12 Aug 20, 2018

Member

just link to the other place in docs, no need to have link twice. that also means we once have to keep once place up to date.

@@ -84,6 +84,8 @@ When an entire dataset is not required to calculate the features for a given set
An example of this approach can be seen in the `Predict Next Purchase demo notebook <https://github.com/featuretools/predict_next_purchase>`_. In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using `Dask <https://dask.pydata.org/>`_, which could also be used to scale the computation to a cluster of computers. A framework like `Spark <https://spark.apache.org/>`_ could be used similarly.
An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. Dask allows us to easily scale to multiple cores on a single computer or multiple machines on a cluster.

This comment has been minimized.

@kmax12

kmax12 Aug 20, 2018

Member

let's also mention there is a blog post about it and link to the TDS blog as well

This comment has been minimized.

@WillKoehrsen

WillKoehrsen Aug 21, 2018

Contributor

Link to the TDS blog or the Feature Labs engineering blog?

This comment has been minimized.

@kmax12

kmax12 Aug 21, 2018

Member

actually, let's do Feature Labs engineering blog

@codecov-io

This comment has been minimized.

codecov-io commented Aug 20, 2018

Codecov Report

Merging #221 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #221   +/-   ##
=======================================
  Coverage   93.52%   93.52%           
=======================================
  Files          71       71           
  Lines        7749     7749           
=======================================
  Hits         7247     7247           
  Misses        502      502

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f6b4e3...4a8fffc. Read the comment docs.

WillKoehrsen added some commits Aug 21, 2018

Update performance.rst
Added link to Towards Data Science blog post.
Update parallel.rst
Removed link to notebook and added in sentence about when this partitioning may be necessary.
Update performance.rst
Reference is now to Feature Labs engineering blog version of article instead of Towards Data Science.

@kmax12 kmax12 changed the title from Docs dask update to Add example of using Dask to parallelize to docs Aug 21, 2018

@kmax12

This comment has been minimized.

Member

kmax12 commented Aug 21, 2018

Looks good. Merging

@kmax12 kmax12 merged commit 7b5ddf1 into master Aug 21, 2018

1 of 2 checks passed

ci/circleci CircleCI is running your tests
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge referenced this pull request Aug 28, 2018

Merged

v0.3.0 #235

@WillKoehrsen WillKoehrsen deleted the docs-dask-update branch Oct 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment