-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(perf): Distribute data types inference #1692
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
for object_ref in block_object_refs | ||
] | ||
) | ||
return {k: v for d in result for k, v in d.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risk: The last block(s) are likely to overwrite data types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make use of pyarrow.unify_schemas function. If the data types are incompatible, it will just fail. But catching the exception will at least give us a chance to log the error along with details.
The log can then be something like:
Data schemas across data blocks are incompatible. Sampling to one of the schemas
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
# Dictionaries in list_col_types might not be equal (i.e. different col types in different blocks) | ||
# In which case we return the most frequent value for each key | ||
# More details here: https://github.com/aws/aws-sdk-pandas/pull/1692 | ||
keys = set().union(*(d.keys() for d in list_col_types)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to still use the pyarrow.unify_schemas function? It would be good just to check that the schemas are different, and to log a warning.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Awesome! Real nice to see how easy it is to hook up alternative distributed functions |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Feature or Bugfix
Current data types inference is done on the entire dataframe causing modin to pull the data into the driver and then loop over each column which is extremely inefficient.
Two options are available:
The current PR implements option 2
Detail
Relates
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.