-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Optimize distributed s3.read_text to load data in chunks #1607
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM left a couple of comments
path=path, | ||
version_id=version_id, | ||
mode=mode, | ||
s3_block_size=10 * 1024 * 1024, # 10 MB (10 * 2**20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So was the READER_ROW_BATCH_SIZE
chosen to be 10Mb in order to match this s3_block_size or is it just a coincidence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mostly about matching the batch size used for the S3 Parquet reader, which also uses 10MiB. And I saw that the S3 block size was also 10 MiB, so it seemed like a good number to use. I'm not sure if there's any more specific guidance for how to choose this number.
def _get_version_id_for(version_id: Optional[Union[str, Dict[str, str]]], path: str) -> Optional[str]: | ||
if isinstance(version_id, dict): | ||
return version_id.get(path, None) | ||
class _ReadingStrategy(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, I like this refactoring, thanks
@@ -171,25 +140,40 @@ def _read_text( | |||
version_id_dict = {path: _get_version_id_for(version_id, path) for path in paths} | |||
|
|||
if chunksize is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have been having this debate on whether using the chunksize
argument makes sense in the distributed
scenario or not. In the current implementation of read_parquet, chunksize is never reached because it would always hit the first condition first (i.e. config.distributed == True).
My rationale was that if you are trying to read distributed, then you want to leverage ray datasets and reading in chunk does not make sense. But based on the debate we've had with Anton on making decisions on behalf of the user, it's probably not a fair assumption. So I will follow the same order of conditions in read_parquet too, thanks
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
||
@property | ||
@abc.abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL abc
nice. This seems great.
Feature or Bugfix
Detail
lines=True
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.