feature: Optimize distributed s3.read_text to load data in chunks #1607

LeonLuttenberger · 2022-09-13T17:51:08Z

Feature or Bugfix

Feature

Detail

When using the distributed option, rather than loading every single file using Pandas in one chunk, the data will be loaded in approx. 10 MiB chunks.
The data chunks will then be reorganized into Ray blocks by Ray itself
For JSON data, this chunked loading of data will only be supported when lines=True

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2022-09-13T18:22:28Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 03b790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-13T18:31:40Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 03b790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-13T18:37:52Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 03b790a
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-13T19:21:05Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 03b790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido

LGTM left a couple of comments

jaidisido · 2022-09-14T13:15:55Z

awswrangler/s3/_read_text_core.py

+        path=path,
+        version_id=version_id,
+        mode=mode,
+        s3_block_size=10 * 1024 * 1024,  # 10 MB (10 * 2**20)


So was the READER_ROW_BATCH_SIZE chosen to be 10Mb in order to match this s3_block_size or is it just a coincidence?

It was mostly about matching the batch size used for the S3 Parquet reader, which also uses 10MiB. And I saw that the S3 block size was also 10 MiB, so it seemed like a good number to use. I'm not sure if there's any more specific guidance for how to choose this number.

jaidisido · 2022-09-14T13:16:19Z

awswrangler/s3/_read_text.py

-def _get_version_id_for(version_id: Optional[Union[str, Dict[str, str]]], path: str) -> Optional[str]:
-    if isinstance(version_id, dict):
-        return version_id.get(path, None)
+class _ReadingStrategy(abc.ABC):


Great stuff, I like this refactoring, thanks

jaidisido · 2022-09-14T13:22:22Z

awswrangler/s3/_read_text.py

@@ -171,25 +140,40 @@ def _read_text(
    version_id_dict = {path: _get_version_id_for(version_id, path) for path in paths}

    if chunksize is not None:


So I have been having this debate on whether using the chunksize argument makes sense in the distributed scenario or not. In the current implementation of read_parquet, chunksize is never reached because it would always hit the first condition first (i.e. config.distributed == True).

My rationale was that if you are trying to read distributed, then you want to leverage ray datasets and reading in chunk does not make sense. But based on the debate we've had with Anton on making decisions on behalf of the user, it's probably not a fair assumption. So I will follow the same order of conditions in read_parquet too, thanks

malachi-constant · 2022-09-14T13:30:25Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: e3e5125
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-14T13:42:19Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: e3e5125
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-14T13:52:51Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: e3e5125
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-14T21:45:15Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: bd31c75
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-14T21:55:03Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: bd31c75
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-14T22:06:56Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: bd31c75
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-14T23:10:00Z

awswrangler/s3/_read_text.py


+    @property
+    @abc.abstractmethod


TIL abc nice. This seems great.

LeonLuttenberger added 10 commits September 8, 2022 13:47

Move S3 read_text functions to Ray data source

5573409

Rename PandasTextDatasource

2c888b2

Fix imports

c0f66a1

Fix boto3 session

94d4e73

Add ability to load data in separate blocks

6cea7c3

Merge branch 'release-3.0.0' into optimize-s3-read-text

0b90b2b

Add ray_max_block_size

da394c2

Rename param

10e34dc

Refactor datasource since JSON does not support chunking

0cedc51

Fix formatting

c93af0d

LeonLuttenberger requested review from kukushking, malachi-constant, jaidisido and cnfait September 13, 2022 17:51

LeonLuttenberger added 2 commits September 13, 2022 12:52

Reset timeout value for test_s3_read_json_simple

ea5edfb

Fix order of property and abstractmethod

03b790a

aws deleted a comment from malachi-constant Sep 13, 2022

LeonLuttenberger marked this pull request as ready for review September 13, 2022 19:21

jaidisido reviewed Sep 14, 2022

View reviewed changes

Merge branch 'release-3.0.0' into optimize-s3-read-text

e3e5125

Refactor Pandas data sources

bd31c75

malachi-constant reviewed Sep 14, 2022

View reviewed changes

awswrangler/s3/_read_text.py

@property

@abc.abstractmethod

Copy link

Contributor

malachi-constant Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL abc nice. This seems great.

jaidisido merged commit 90cbf10 into release-3.0.0 Sep 21, 2022

jaidisido deleted the optimize-s3-read-text branch September 21, 2022 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Optimize distributed s3.read_text to load data in chunks #1607

feature: Optimize distributed s3.read_text to load data in chunks #1607

LeonLuttenberger commented Sep 13, 2022 •

edited

malachi-constant commented Sep 13, 2022

malachi-constant commented Sep 13, 2022

malachi-constant commented Sep 13, 2022

malachi-constant commented Sep 13, 2022

jaidisido left a comment

jaidisido Sep 14, 2022

LeonLuttenberger Sep 14, 2022

jaidisido Sep 14, 2022

jaidisido Sep 14, 2022

malachi-constant commented Sep 14, 2022

malachi-constant commented Sep 14, 2022

malachi-constant commented Sep 14, 2022

malachi-constant commented Sep 14, 2022

malachi-constant commented Sep 14, 2022

malachi-constant commented Sep 14, 2022

malachi-constant Sep 14, 2022

		@@ -171,25 +140,40 @@ def _read_text(
		version_id_dict = {path: _get_version_id_for(version_id, path) for path in paths}

		if chunksize is not None:

feature: Optimize distributed s3.read_text to load data in chunks #1607

feature: Optimize distributed s3.read_text to load data in chunks #1607

Conversation

LeonLuttenberger commented Sep 13, 2022 • edited

Feature or Bugfix

Detail

malachi-constant commented Sep 13, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 13, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 13, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 13, 2022

AWS CodeBuild CI Report

jaidisido left a comment

Choose a reason for hiding this comment

jaidisido Sep 14, 2022

Choose a reason for hiding this comment

LeonLuttenberger Sep 14, 2022

Choose a reason for hiding this comment

jaidisido Sep 14, 2022

Choose a reason for hiding this comment

jaidisido Sep 14, 2022

Choose a reason for hiding this comment

malachi-constant commented Sep 14, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 14, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 14, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 14, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 14, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 14, 2022

AWS CodeBuild CI Report

malachi-constant Sep 14, 2022

Choose a reason for hiding this comment

LeonLuttenberger commented Sep 13, 2022 •

edited