-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create EntitySet with Koalas #1031
Conversation
if any(isinstance(entity.df, dd.DataFrame) for entity in mock_customer.entities): | ||
pytest.xfail('Dask version fails, returned feature matrix is empty') | ||
if any(isinstance(entity.df, (dd.DataFrame, ks.DataFrame)) for entity in mock_customer.entities): | ||
pytest.xfail("Dask fails because returned feature matrix is empty; Koalas doesn't support custom agg functions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well.. actually Koalas support custom aggregate functions through PySpark - it's way more flexible than Dask in this regard
https://spark.apache.org/docs/3.0.0/sql-ref-functions-udf-aggregate.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use .groupby(...).apply(...)
, see https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.apply.html#databricks.koalas.groupby.GroupBy.apply
4270e04
to
fec2da0
Compare
5f27a0c
to
6141607
Compare
lti_df['last_time'] = ks.to_datetime(lti_df['last_time']) | ||
lti_df['last_time_old'] = ks.to_datetime(lti_df['last_time_old']) | ||
# TODO: Figure out a workaround for fillna and replace | ||
lti_df = lti_df.max(axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testing on the instacart dataset for the predict next purchase problem, this line appears to be causing a performance issue.
featuretools/entityset/entity.py
Outdated
if isinstance(instance_vals, dd.Series) or isinstance(instance_vals, ks.Series): | ||
df = self.df.merge(instance_vals.to_frame(), how="inner", on=variable_id) | ||
else: | ||
df = self.df[self.df[variable_id].isin(instance_vals)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testing on the instacart dataset for the predict next purchase problem, this code block appears to be causing a performance issue.
* fix entity creation performance issue * fix test
* initial pass of making koalas optional * add optional for make_ecommerce_entityset * add koalas to extra_require, add error, revert accidental change * bump koalas version * make koalas optional for tests, add minimal circleci test runs * lint and update circleci config * update circleci config * fix spark defaults in conftest * update circleci config * use pytest.importorskip * update missing koalas error message * update koalas import error message * rename circleci tests
* updates to improve Entity.query_by_values perf * update for optional koalas install
* test coverage updates * test coverage updates * use set for explored to speed up test * misc test cleanup
* update docs * fix typo * add java to circleci docs build * fix typos * remove whitespace and add java requirement info
* update flags and get_function for agg primitives * update list-primitives * use enum instead of string keywords * update enum name to Library * fix typo * revert accidental changes * add pandas to compatibility * add comment * update bad_primitives check * split line
b65a00b
to
969cd3f
Compare
featuretools/computational_backends/calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's time we add a changelog entry 😄
c30f132
to
252995e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Description
Add support for creating entitysets with Koalas dataframes and using dfs
After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of
docs/source/changelog.rst
to include this pull request.