Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor infer_variable_types for Dask entities #952

Closed
thehomebrewnerd opened this issue May 7, 2020 · 1 comment
Closed

Refactor infer_variable_types for Dask entities #952

thehomebrewnerd opened this issue May 7, 2020 · 1 comment
Assignees

Comments

@thehomebrewnerd
Copy link
Contributor

In infer_variable_types() in entity_utils.py there is a len() call to get the dataframe length. This call causes the entity creation process to be slow for Dask dataframes. Additionally, this function also contains a .compute() call on the sample dataframe, but this computed sample dataframe is never used for Dask as the user must specify the datatypes for Dask entities.

This code could be refactored with these changes:

  • Only perform the len() call if the input dataframe is a Pandas dataframe
  • Remove the code block that computes the sample_df if the input df is a Dask dataframe as this sample is never used for a Dask entity
  • Revert the code for selecting the sample to match the code on master. This code was updated to work with Dask dataframes, but since sample_df is no longer needed for Dask, this can be reverted to its original form.
@thehomebrewnerd
Copy link
Contributor Author

Closed by #957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant