Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create EntitySet with Koalas #1031

Merged
merged 39 commits into from Sep 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
586dcca
created an entityset with koalas
Jun 15, 2020
a179054
Merge branch 'master' into koalas-entity
tuethan1999 Jun 15, 2020
984c5fb
added preprocessing pandas to koalas into it's own file in testing utils
Jun 15, 2020
4cc2dc6
Merge branch 'koalas-entity' of https://github.com/FeatureLabs/featur…
Jun 15, 2020
0a7bd35
renamed koalas named variables to use ks abbreviation. Added another …
Jun 17, 2020
8fa7a07
make changes to accomodate for test_add_last_time_indexes, and test_c…
Jun 18, 2020
9503150
added support to pass test_single_table_ks_entityset_with_instance_ids
Jun 18, 2020
14eb489
changed test cases to use to_pandas instead of compute and to ignore …
Jun 18, 2020
9f8d73f
add support for all the test_koalas_es and test_calculate_feature_mat…
Jun 23, 2020
31210dc
Merge branch 'master' into koalas-entity
tuethan1999 Jun 23, 2020
20eb904
fix lint errors
Jun 23, 2020
89b7d60
Merge branch 'koalas-entity' of https://github.com/FeatureLabs/featur…
Jun 23, 2020
4753316
add support for test_es
Jun 24, 2020
e224b7c
add support for test_entity
Jun 24, 2020
ed8afa4
update most tests
frances-h Jul 10, 2020
52c7bac
updates for serializing/deserializing entitysets
frances-h Jul 17, 2020
bb2eb04
update make_index, add to_pandas test util, update more tests
frances-h Jul 24, 2020
5ff18d8
Merge branch 'main' into koalas-entity
frances-h Jul 27, 2020
88b84c0
fix for numeric time index
frances-h Jul 29, 2020
4c8ec61
update circleci config
frances-h Jul 29, 2020
1ca787b
Merge branch 'main' into koalas-entity
frances-h Jul 30, 2020
b8a1d67
update circleci config
frances-h Jul 30, 2020
fec2da0
circleci windows updates
frances-h Aug 4, 2020
2b12d4e
updates for circleci windows tests
frances-h Aug 10, 2020
6141607
skip tests for windows
frances-h Aug 10, 2020
0ed10f8
Merge branch 'main' into koalas-entity
frances-h Aug 11, 2020
af852ea
skip test for koalas
frances-h Aug 11, 2020
394698e
Fix Koalas entity creation performance issue (#1114)
thehomebrewnerd Aug 18, 2020
8eb49ab
[Koalas] Make Koalas optional dependency (#1111)
frances-h Aug 26, 2020
a48d796
Updates to improve Entity.query_by_values performance (#1121)
thehomebrewnerd Aug 26, 2020
ddd3000
Various updates to improve test coverage (#1137)
thehomebrewnerd Aug 31, 2020
29da1b0
Update docs for using Koalas entitysets (#1138)
frances-h Sep 2, 2020
5cc611a
Update primitive compatibility flags and get_function (#1136)
frances-h Sep 4, 2020
969cd3f
Merge branch 'main' into koalas-entity
frances-h Sep 4, 2020
e4249c8
changelog and small updates
frances-h Sep 4, 2020
252995e
add tuethan1999 and thehomebrewnerd to changelog
frances-h Sep 4, 2020
f9e199f
Merge branch 'main' into koalas-entity
frances-h Sep 4, 2020
1660b0a
Merge branch 'main' into koalas-entity
frances-h Sep 4, 2020
94009a5
Merge branch 'main' into koalas-entity
frances-h Sep 4, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
53 changes: 44 additions & 9 deletions .circleci/config.yml
Expand Up @@ -83,6 +83,8 @@ jobs:
parameters:
image_tag:
type: string
optional_libraries:
type: string
executor:
name: python
image_tag: << parameters.image_tag >>
Expand All @@ -96,8 +98,22 @@ jobs:
source venv/bin/activate
python -m pip config --site set global.progress_bar off
python -m pip install --upgrade pip
python -m pip install -e unpacked_sdist/
python -m pip install -r unpacked_sdist/test-requirements.txt
- when:
condition:
equal: [ "optional", << parameters.optional_libraries >> ]
steps:
- run: |
source venv/bin/activate
python -m pip install -e unpacked_sdist/[koalas]
python -m pip install -r unpacked_sdist/test-requirements.txt
- unless:
condition:
equal: ["optional", << parameters.optional_libraries >> ]
steps:
- run: |
source venv/bin/activate
python -m pip install -e unpacked_sdist/
python -m pip install -r unpacked_sdist/test-requirements.txt
- persist_to_workspace:
root: ~/featuretools
paths:
Expand All @@ -119,21 +135,33 @@ jobs:
command: "curl -u ${PP_K}: -d build_parameters[CIRCLE_JOB]=python-36-ft-release https://circleci.com/api/v1.1/project/github/FeatureLabs/premium-primitives/tree/main"

unit_tests:
resource_class: large
working_directory: ~/featuretools
parameters:
image_tag:
type: string
optional_libraries:
type: string
executor:
name: python
image_tag: << parameters.image_tag >>
steps:
- run: sudo apt update && sudo apt install -y graphviz
- when:
condition:
equal: ["optional", << parameters.optional_libraries >>]
steps:
- run: |
sudo apt install -y openjdk-11-jre-headless
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
- checkout
- attach_workspace:
at: ~/featuretools
- when:
condition:
equal: [ "3.6", << parameters.image_tag >> ]
and:
- equal: [ "3.6", << parameters.image_tag >> ]
- equal: ["optional", << parameters.optional_libraries >>]
steps:
- run: |
source venv/bin/activate
Expand All @@ -147,7 +175,9 @@ jobs:
codecov --required
- unless:
condition:
equal: [ "3.6", << parameters.image_tag >> ]
and:
- equal: [ "3.6", << parameters.image_tag >> ]
- equal: ["optional", << parameters.optional_libraries >>]
steps:
- run: |
source venv/bin/activate
Expand Down Expand Up @@ -182,6 +212,9 @@ jobs:
steps:
- run: sudo apt update && sudo apt install -y pandoc
- run: sudo apt install -y graphviz
- run: |
sudo apt install -y openjdk-11-jre-headless
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
- checkout
- attach_workspace:
at: ~/featuretools
Expand Down Expand Up @@ -241,28 +274,30 @@ workflows:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: << matrix.image_tag >> install featuretools
optional_libraries: ["minimal", "optional"]
name: << matrix.image_tag >> install featuretools << matrix.optional_libraries >>
- unit_tests:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: << matrix.image_tag >> unit tests
optional_libraries: ["minimal", "optional"]
name: << matrix.image_tag >> unit tests << matrix.optional_libraries >>
requires:
- << matrix.image_tag >> install featuretools
- << matrix.image_tag >> install featuretools << matrix.optional_libraries >>
- lint_test:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: << matrix.image_tag >> lint test
requires:
- << matrix.image_tag >> install featuretools
- << matrix.image_tag >> install featuretools optional
- build_docs:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: build docs << matrix.image_tag >>
requires:
- << matrix.image_tag >> install featuretools
- << matrix.image_tag >> install featuretools optional
- install_ft_complete:
matrix:
parameters:
Expand Down
1 change: 1 addition & 0 deletions .coveragerc
Expand Up @@ -12,3 +12,4 @@ exclude_lines =
if self._verbose:
if verbose:
if profile:
pytest.skip
2 changes: 2 additions & 0 deletions dev-requirements.txt
Expand Up @@ -13,3 +13,5 @@ Sphinx==3.0.3
sphinx_rtd_theme==0.4.3
nlp_primitives>=0.3.0
autonormalize >= 1.0.1
pyspark >= 3.0.0
koalas >= 1.1.0
4 changes: 3 additions & 1 deletion docs/source/changelog.rst
Expand Up @@ -4,6 +4,7 @@ Changelog
---------
**Future Release**
* Enhancements
* Support use of Koalas DataFrames in entitysets (:pr:`1031`)
* Add feature selection functions for null, correlated, and single value features (:pr:`1126`)
* Fixes
* Fix ``encode_features`` converting excluded feature columns to a numeric dtype (:pr:`1123`)
Expand All @@ -21,7 +22,8 @@ Changelog
to ``dfs`` to ensure a consistent ordering of features.

Thanks to the following people for contributing to this release:
:user:`tamargrey`, :user:`gsheni`, :user:`rwedge`
:user:`tamargrey`, :user:`gsheni`, :user:`rwedge`, :user:`frances-h`, :user:`tuethan1999`, :user:`thehomebrewnerd`


**Breaking Changes**

Expand Down
10 changes: 5 additions & 5 deletions docs/source/frequently_asked_questions.ipynb
Expand Up @@ -423,15 +423,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Can I create an `EntitySet` using Dask dataframes? (BETA)\n",
"### Can I create an `EntitySet` using Dask or Koalas dataframes? (BETA)\n",
"\n",
"Support for Dask EntitySets is still in Beta - if you encounter any errors using this approach, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
"Support for Dask EntitySets and Koalas EntitySets is still in Beta - if you encounter any errors using either of these approaches, please let us know by creating a [new issue on Github](https://github.com/FeatureLabs/featuretools/issues).\n",
"\n",
"Yes! Featuretools supports creating an `EntitySet` from Dask dataframes. You can simply follow the same process you would when creating an `EntitySet` from pandas dataframes.\n",
"Yes! Featuretools supports creating an `EntitySet` from Dask dataframes or from Koalas dataframes. You can simply follow the same process you would when creating an `EntitySet` from pandas dataframes.\n",
"\n",
"There are some limitations to be aware of when using Dask dataframes. When creating an `Entity` from a Dask dataframe, variable type inference is not performed as it is for pandas entities, so the user must supply a list of variable types during creation. Also, other quality checks are not performed, such as checking for unique index values. An `EntitySet` must be created entirely of Dask entities or pandas entities - you cannot mix pandas entities with Dask entitites in the same `EntitySet`.\n",
"There are some limitations to be aware of when using Dask or Koalas dataframes. When creating an `Entity`, variable type inference is not performed as it is for pandas entities, so the user must supply a list of variable types during creation. Also, other quality checks are not performed, such as checking for unique index values. An `EntitySet` must be created entirely of one type of entity (Dask, Koalas, or pandas) - you cannot mix pandas entities, Dask entities, and Koalas entitites with each other in the same `EntitySet`.\n",
"\n",
"For more information on creating an `EntitySet` from Dask dataframes, see the [Using Dask EntitySets](https://docs.featuretools.com/en/stable/guides/using_dask_entitysets.html) guide."
"For more information on creating an `EntitySet` from Dask dataframes or from Koalas dataframes, see the [Using Dask EntitySets](./guides/using_dask_entitysets.rst) and the [Using Koalas EntitySets](./guides/using_koalas_entitysets.rst) guides."
]
},
{
Expand Down
16 changes: 6 additions & 10 deletions docs/source/guides/performance.rst
Expand Up @@ -19,16 +19,16 @@ Parallel Feature Computation
----------------------------
Computational performance can often be improved by parallelizing the feature calculation process. There are several different approaches that can be used to perform parallel feature computation with Featuretools. An overview of the most commonly used approaches is provided below.

Computation with Dask EntitySets (BETA)
***************************************
Computation with Dask and Koalas EntitySets (BETA)
**************************************************
.. note::
Support for Dask EntitySets is still in Beta. While the key functionality has been implemented, development is ongoing to add the remaining functionality.
Support for Dask EntitySets and Koalas EntitySets is still in Beta. While the key functionality has been implemented, development is ongoing to add the remaining functionality.

All planned improvements to the Featuretools/Dask integration are `documented on Github <https://github.com/FeatureLabs/featuretools/issues?q=is%3Aopen+is%3Aissue+label%3ADask>`_. If you see an open issue that is important for your application, please let us know by upvoting or commenting on the issue. If you encounter any errors using Dask entities, or find missing functionality that does not yet have an open issue, please create a `new issue on Github <https://github.com/FeatureLabs/featuretools/issues>`_.
All planned improvements to the Featuretools/Dask and Featuretools/Koalas integration are documented on Github (`Dask issues <https://github.com/FeatureLabs/featuretools/issues?q=is%3Aopen+is%3Aissue+label%3ADask>`_, `Koalas issues <https://github.com/FeatureLabs/featuretools/issues?q=is%3Aopen+is%3Aissue+label%3AKoalas>`_). If you see an open issue that is important for your application, please let us know by upvoting or commenting on the issue. If you encounter any errors using Dask or Koalas entities, or find missing functionality that does not yet have an open issue, please create a `new issue on Github <https://github.com/FeatureLabs/featuretools/issues>`_.

Dask can be used with Featuretools to perform parallel feature computation with virtually no changes to the workflow required. Featuretools supports creating an ``EntitySet`` directly from Dask dataframes instead of using pandas dataframes, enabling the parallel and distributed computation capabilities of Dask to be used. By creating an ``EntitySet`` directly from Dask dataframes, Featuretools can be used to generate a larger-than-memory feature matrix, something that may be difficult with other approaches. When computing a feature matrix from an ``EntitySet`` created from Dask dataframes, the resulting feature matrix will be returned as a Dask dataframe.
Dask or Koalas can be used with Featuretools to perform parallel feature computation with virtually no changes to the workflow required. Featuretools supports creating an ``EntitySet`` directly from Dask or Koalas dataframes instead of using pandas dataframes, enabling the parallel and distributed computation capabilities of Dask or Spark to be used. By creating an ``EntitySet`` directly from Dask or Koalas dataframes, Featuretools can be used to generate a larger-than-memory feature matrix, something that may be difficult with other approaches. When computing a feature matrix from an ``EntitySet`` created from Dask or Koalas dataframes, the resulting feature matrix will be returned as a Dask or Koalas dataframe depending on which type was used.

This method does have some limitations in terms of the primitives that are available and the optional parameters that can be used when calculating the feature matrix. For more information on generating a feature matrix with this approach, refer to the guide :doc:`/guides/using_dask_entitysets`.
These methods do have some limitations in terms of the primitives that are available and the optional parameters that can be used when calculating the feature matrix. For more information on generating a feature matrix with this approach, refer to the guides :doc:`/guides/using_dask_entitysets` and :doc:`/guides/using_koalas_entitysets`.

Simple Parallel Feature Computation
***********************************
Expand Down Expand Up @@ -116,7 +116,3 @@ An example of this approach can be seen in the `Predict Next Purchase demo noteb
An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the `Featuretools on Dask notebook <https://github.com/Featuretools/Automated-Manual-Comparison/blob/main/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb>`_. This approach is detailed in the `Parallelizing Feature Engineering with Dask article <https://medium.com/feature-labs-engineering/scaling-featuretools-with-dask-ce46f9774c7d>`_ on the Feature Labs engineering blog. Dask allows for simple scaling to multiple cores on a single computer or multiple machines on a cluster.

For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the `Feature Engineering on Spark notebook <https://github.com/Featuretools/predict-customer-churn/blob/main/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb>`_. This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework. A write-up of this approach is described in the `Featuretools on Spark article <https://blog.featurelabs.com/featuretools-on-spark-2/>`_ on the Feature Labs engineering blog.

Running Featuretoools with Spark
********************************
The Featuretools development team is continually working to improve integration with Spark for performing feature engineering at scale. If you have a big data problem and are interested in testing our latest integrations for free, please let us know by completing `this simple request form <https://forms.office.com/Pages/ResponsePage.aspx?id=2TkvUj0wj0id66bXfx6v2ASd4JAap6pFigRj7EKGsuBUNDI4WDlGSzI1VVRHTUdMS0gyR1EyMkdJVi4u>`__.
4 changes: 2 additions & 2 deletions docs/source/guides/using_dask_entitysets.rst
Expand Up @@ -96,13 +96,13 @@ By default, Featuretools checks that entities created from pandas dataframes hav

When an ``Entity`` is created from a pandas dataframe, the ordering of the underlying dataframe rows is maintained. For a Dask ``Entity``, the ordering of the dataframe rows is not guaranteed, and Featuretools does not attempt to maintain row order in a Dask ``Entity``. If ordering is important, close attention must be paid to any output to avoid issues.

The ``Entity.add_interesting_values()`` method is not supported when using a Dask ``Entity``. If needed, users can manually set ``interesing_values`` on entities by assigning them directly with syntax similar to this: ``es["entity_name"]["variable_name"].interesting_values = ["Value 1", "Value 2"]``.
The ``Entity.add_interesting_values()`` method is not supported when using a Dask ``Entity``. If needed, users can manually set ``interesting_values`` on entities by assigning them directly with syntax similar to this: ``es["entity_name"]["variable_name"].interesting_values = ["Value 1", "Value 2"]``.

EntitySet Limitations
*********************
When creating a Featuretools ``EntitySet`` that will be made of Dask entities, all of the entities used to create the ``EntitySet`` must be of the same type, either all Dask entities or all pandas entities. Featuretools does not support creating an ``EntitySet`` containing a mix of Dask and pandas entities.

Additionally, the ``EntitySet.add_interesting_values()`` method is not supported when using a Dask ``EntitySet``. Users can manually set ``interesing_values`` on entities, as described above.
Additionally, the ``EntitySet.add_interesting_values()`` method is not supported when using a Dask ``EntitySet``. Users can manually set ``interesting_values`` on entities, as described above.

DFS Limitations
***************
Expand Down