Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Optimize distributed s3.read_text to load data in chunks #1607

Merged
merged 14 commits into from
Sep 21, 2022

Conversation

LeonLuttenberger
Copy link
Contributor

@LeonLuttenberger LeonLuttenberger commented Sep 13, 2022

Feature or Bugfix

  • Feature

Detail

  • When using the distributed option, rather than loading every single file using Pandas in one chunk, the data will be loaded in approx. 10 MiB chunks.
  • The data chunks will then be reorganized into Ray blocks by Ray itself
  • For JSON data, this chunked loading of data will only be supported when lines=True

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@aws aws deleted a comment from malachi-constant Sep 13, 2022
@aws aws deleted a comment from malachi-constant Sep 13, 2022
@aws aws deleted a comment from malachi-constant Sep 13, 2022
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 03b790a
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 03b790a
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: 03b790a
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: 03b790a
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@LeonLuttenberger LeonLuttenberger marked this pull request as ready for review September 13, 2022 19:21
Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM left a couple of comments

path=path,
version_id=version_id,
mode=mode,
s3_block_size=10 * 1024 * 1024, # 10 MB (10 * 2**20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So was the READER_ROW_BATCH_SIZE chosen to be 10Mb in order to match this s3_block_size or is it just a coincidence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly about matching the batch size used for the S3 Parquet reader, which also uses 10MiB. And I saw that the S3 block size was also 10 MiB, so it seemed like a good number to use. I'm not sure if there's any more specific guidance for how to choose this number.

def _get_version_id_for(version_id: Optional[Union[str, Dict[str, str]]], path: str) -> Optional[str]:
if isinstance(version_id, dict):
return version_id.get(path, None)
class _ReadingStrategy(abc.ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, I like this refactoring, thanks

@@ -171,25 +140,40 @@ def _read_text(
version_id_dict = {path: _get_version_id_for(version_id, path) for path in paths}

if chunksize is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have been having this debate on whether using the chunksize argument makes sense in the distributed scenario or not. In the current implementation of read_parquet, chunksize is never reached because it would always hit the first condition first (i.e. config.distributed == True).

My rationale was that if you are trying to read distributed, then you want to leverage ray datasets and reading in chunk does not make sense. But based on the debate we've had with Anton on making decisions on behalf of the user, it's probably not a fair assumption. So I will follow the same order of conditions in read_parquet too, thanks

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: e3e5125
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: e3e5125
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: e3e5125
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: bd31c75
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: bd31c75
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: bd31c75
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository


@property
@abc.abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL abc nice. This seems great.

@jaidisido jaidisido merged commit 90cbf10 into release-3.0.0 Sep 21, 2022
@jaidisido jaidisido deleted the optimize-s3-read-text branch September 21, 2022 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants