-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example of using Dask to parallelize to docs #221
Conversation
docs/source/guides/parallel.rst
Outdated
|
||
Parallel Computation by Partioning Data | ||
------------------------------- | ||
As an alternative to Featuretool's parallelization, the data can be partitioned and run on multiple cores or a cluster using Dask or PySpark. For more information on partitioning the data and using Dask, see :doc:`/guides/performance`. An example of this approach can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. Dask allows Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just link to the other place in docs, no need to have link twice. that also means we once have to keep once place up to date.
docs/source/guides/performance.rst
Outdated
@@ -84,6 +84,8 @@ When an entire dataset is not required to calculate the features for a given set | |||
|
|||
An example of this approach can be seen in the `Predict Next Purchase demo notebook <https://github.com/featuretools/predict_next_purchase>`_. In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using `Dask <https://dask.pydata.org/>`_, which could also be used to scale the computation to a cluster of computers. A framework like `Spark <https://spark.apache.org/>`_ could be used similarly. | |||
|
|||
An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. Dask allows us to easily scale to multiple cores on a single computer or multiple machines on a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also mention there is a blog post about it and link to the TDS blog as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the TDS blog or the Feature Labs engineering blog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, let's do Feature Labs engineering blog
Codecov Report
@@ Coverage Diff @@
## master #221 +/- ##
=======================================
Coverage 93.52% 93.52%
=======================================
Files 71 71
Lines 7749 7749
=======================================
Hits 7247 7247
Misses 502 502 Continue to review full report at Codecov.
|
Added link to Towards Data Science blog post.
Removed link to notebook and added in sentence about when this partitioning may be necessary.
Reference is now to Feature Labs engineering blog version of article instead of Towards Data Science.
Looks good. Merging |
Update of the documentation to point to the Featuretools on Dask notebook. Also added in section to parallel about partitioning data and running on multiple cores.