Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create EntitySet with Koalas #1031

Merged
merged 39 commits into from Sep 4, 2020
Merged

Create EntitySet with Koalas #1031

merged 39 commits into from Sep 4, 2020

Conversation

tuethan1999
Copy link
Contributor

@tuethan1999 tuethan1999 commented Jun 15, 2020

Pull Request Description

Add support for creating entitysets with Koalas dataframes and using dfs


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request.

@rwedge rwedge changed the base branch from master to main July 2, 2020 18:31
@frances-h frances-h changed the title Create EntitySet with Koalas [WIP] Create EntitySet with Koalas Jul 13, 2020
if any(isinstance(entity.df, dd.DataFrame) for entity in mock_customer.entities):
pytest.xfail('Dask version fails, returned feature matrix is empty')
if any(isinstance(entity.df, (dd.DataFrame, ks.DataFrame)) for entity in mock_customer.entities):
pytest.xfail("Dask fails because returned feature matrix is empty; Koalas doesn't support custom agg functions")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well.. actually Koalas support custom aggregate functions through PySpark - it's way more flexible than Dask in this regard
https://spark.apache.org/docs/3.0.0/sql-ref-functions-udf-aggregate.html

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requirements.txt Outdated Show resolved Hide resolved
featuretools/entityset/entity.py Outdated Show resolved Hide resolved
lti_df['last_time'] = ks.to_datetime(lti_df['last_time'])
lti_df['last_time_old'] = ks.to_datetime(lti_df['last_time_old'])
# TODO: Figure out a workaround for fillna and replace
lti_df = lti_df.max(axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing on the instacart dataset for the predict next purchase problem, this line appears to be causing a performance issue.

if isinstance(instance_vals, dd.Series) or isinstance(instance_vals, ks.Series):
df = self.df.merge(instance_vals.to_frame(), how="inner", on=variable_id)
else:
df = self.df[self.df[variable_id].isin(instance_vals)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing on the instacart dataset for the predict next purchase problem, this code block appears to be causing a performance issue.

thehomebrewnerd and others added 5 commits August 18, 2020 16:40
* fix entity creation performance issue

* fix test
* initial pass of making koalas optional

* add optional for make_ecommerce_entityset

* add koalas to extra_require, add error, revert accidental change

* bump koalas version

* make koalas optional for tests, add minimal circleci test runs

* lint and update circleci config

* update circleci config

* fix spark defaults in conftest

* update circleci config

* use pytest.importorskip

* update missing koalas error message

* update koalas import error message

* rename circleci tests
* updates to improve Entity.query_by_values perf

* update for optional koalas install
* test coverage updates

* test coverage updates

* use set for explored to speed up test

* misc test cleanup
* update docs

* fix typo

* add java to circleci docs build

* fix typos

* remove whitespace and add java requirement info
* update flags and get_function for agg primitives

* update list-primitives

* use enum instead of string keywords

* update enum name to Library

* fix typo

* revert accidental changes

* add pandas to compatibility

* add comment

* update bad_primitives check

* split line
docs/source/frequently_asked_questions.ipynb Outdated Show resolved Hide resolved
featuretools/entityset/serialize.py Outdated Show resolved Hide resolved
@rwedge rwedge changed the title [WIP] Create EntitySet with Koalas Create EntitySet with Koalas Sep 4, 2020
Copy link
Collaborator

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time we add a changelog entry 😄

rwedge
rwedge approved these changes Sep 4, 2020
Copy link
Collaborator

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@frances-h frances-h merged commit ea1f50d into main Sep 4, 2020
1 check passed
@frances-h frances-h deleted the koalas-entity branch September 4, 2020 23:38
@tamargrey tamargrey mentioned this pull request Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants