Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example of using Dask to parallelize to docs #221

Merged
merged 6 commits into from
Aug 21, 2018
Merged

Conversation

WillKoehrsen
Copy link
Contributor

Update of the documentation to point to the Featuretools on Dask notebook. Also added in section to parallel about partitioning data and running on multiple cores.


Parallel Computation by Partioning Data
-------------------------------
As an alternative to Featuretool's parallelization, the data can be partitioned and run on multiple cores or a cluster using Dask or PySpark. For more information on partitioning the data and using Dask, see :doc:`/guides/performance`. An example of this approach can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. Dask allows Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just link to the other place in docs, no need to have link twice. that also means we once have to keep once place up to date.

@@ -84,6 +84,8 @@ When an entire dataset is not required to calculate the features for a given set

An example of this approach can be seen in the `Predict Next Purchase demo notebook <https://github.com/featuretools/predict_next_purchase>`_. In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using `Dask <https://dask.pydata.org/>`_, which could also be used to scale the computation to a cluster of computers. A framework like `Spark <https://spark.apache.org/>`_ could be used similarly.

An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. Dask allows us to easily scale to multiple cores on a single computer or multiple machines on a cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also mention there is a blog post about it and link to the TDS blog as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the TDS blog or the Feature Labs engineering blog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, let's do Feature Labs engineering blog

@codecov-io
Copy link

codecov-io commented Aug 20, 2018

Codecov Report

Merging #221 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #221   +/-   ##
=======================================
  Coverage   93.52%   93.52%           
=======================================
  Files          71       71           
  Lines        7749     7749           
=======================================
  Hits         7247     7247           
  Misses        502      502

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f6b4e3...4a8fffc. Read the comment docs.

Added link to Towards Data Science blog post.
Removed link to notebook and added in sentence about when this partitioning may be necessary.
Reference is now to Feature Labs engineering blog version of article instead of Towards Data Science.
@kmax12 kmax12 changed the title Docs dask update Add example of using Dask to parallelize to docs Aug 21, 2018
@kmax12
Copy link
Contributor

kmax12 commented Aug 21, 2018

Looks good. Merging

@kmax12 kmax12 merged commit 7b5ddf1 into master Aug 21, 2018
@rwedge rwedge mentioned this pull request Aug 28, 2018
@WillKoehrsen WillKoehrsen deleted the docs-dask-update branch October 2, 2018 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants