(perf): Distribute data types inference #1692

jaidisido · 2022-10-17T14:58:18Z

Feature or Bugfix

Performance

Current data types inference is done on the entire dataframe causing modin to pull the data into the driver and then loop over each column which is extremely inefficient.

Two options are available:

Infer the data types from the first block or sample the data. It's efficient but risky because it would be based on a smaller sample size leading to incorrect inference (e.g. sparse data)
Infer the data types from each modin block and combine them. It's thorough however we also run the risk of infering different data types for a column in two different blocks.

The current PR implements option 2

Detail

Distribute data types inference based on Modin blocks

Relates

(bug) modin pyarrow type inference #1681

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2022-10-17T15:03:54Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: b7eaa31
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2022-10-17T15:12:41Z

awswrangler/distributed/ray/modin/_data_types.py

+            for object_ref in block_object_refs
+        ]
+    )
+    return {k: v for d in result for k, v in d.items()}


Risk: The last block(s) are likely to overwrite data types

We could make use of pyarrow.unify_schemas function. If the data types are incompatible, it will just fail. But catching the exception will at least give us a chance to log the error along with details.

The log can then be something like:

Data schemas across data blocks are incompatible. Sampling to one of the schemas

malachi-constant · 2022-10-17T15:15:00Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: b7eaa31
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-17T15:18:07Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: b7eaa31
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-17T18:47:19Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 232ed71
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-17T18:55:30Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 232ed71
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-17T18:58:13Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 232ed71
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

LeonLuttenberger · 2022-10-17T19:58:14Z

awswrangler/distributed/ray/modin/_data_types.py

+    # Dictionaries in list_col_types might not be equal (i.e. different col types in different blocks)
+    # In which case we return the most frequent value for each key
+    # More details here: https://github.com/aws/aws-sdk-pandas/pull/1692
+    keys = set().union(*(d.keys() for d in list_col_types))


Do we want to still use the pyarrow.unify_schemas function? It would be good just to check that the schemas are different, and to log a warning.

malachi-constant · 2022-10-18T14:14:45Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 5b5795a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-18T14:25:59Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 5b5795a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-18T14:30:46Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 5b5795a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-18T15:46:24Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 2ae5ca3
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-18T15:48:21Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 2ae5ca3
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

kukushking · 2022-10-18T15:53:15Z

Awesome! Real nice to see how easy it is to hook up alternative distributed functions

malachi-constant · 2022-10-18T16:00:10Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 2ae5ca3
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-10-18T16:27:32Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 2ae5ca3
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

(perf): Distribute data types inference

b7eaa31

jaidisido self-assigned this Oct 17, 2022

jaidisido added this to the 3.0.0 milestone Oct 17, 2022

jaidisido requested review from cnfait, kukushking, LeonLuttenberger and malachi-constant October 17, 2022 14:58

jaidisido added enhancement New feature or request major release Will be addressed in the next major release labels Oct 17, 2022

jaidisido linked an issue Oct 17, 2022 that may be closed by this pull request

(bug) modin pyarrow type inference #1681

Closed

jaidisido added the performance label Oct 17, 2022

jaidisido added this to In Review in AWS SDK for pandas roadmap Oct 17, 2022

jaidisido commented Oct 17, 2022

View reviewed changes

PR feedback: logic to pick most frequent datatypes

232ed71

LeonLuttenberger reviewed Oct 17, 2022

View reviewed changes

Fix - ensure order in list of keys

5b5795a

PR feedback: infer datatype based on first block

2ae5ca3

kukushking approved these changes Oct 18, 2022

View reviewed changes

jaidisido merged commit 875e2ce into release-3.0.0 Oct 18, 2022

jaidisido deleted the perf/infer-data-types-to-parquet branch October 18, 2022 16:35

kukushking moved this from In Review to Done in AWS SDK for pandas roadmap Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(perf): Distribute data types inference #1692

(perf): Distribute data types inference #1692

jaidisido commented Oct 17, 2022

malachi-constant commented Oct 17, 2022

jaidisido Oct 17, 2022

LeonLuttenberger Oct 17, 2022 •

edited

malachi-constant commented Oct 17, 2022

malachi-constant commented Oct 17, 2022

malachi-constant commented Oct 17, 2022

malachi-constant commented Oct 17, 2022

malachi-constant commented Oct 17, 2022

LeonLuttenberger Oct 17, 2022

malachi-constant commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

kukushking commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

(perf): Distribute data types inference #1692

(perf): Distribute data types inference #1692

Conversation

jaidisido commented Oct 17, 2022

Feature or Bugfix

Detail

Relates

malachi-constant commented Oct 17, 2022

AWS CodeBuild CI Report

jaidisido Oct 17, 2022

Choose a reason for hiding this comment

LeonLuttenberger Oct 17, 2022 • edited

Choose a reason for hiding this comment

malachi-constant commented Oct 17, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 17, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 17, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 17, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 17, 2022

AWS CodeBuild CI Report

LeonLuttenberger Oct 17, 2022

Choose a reason for hiding this comment

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

kukushking commented Oct 18, 2022

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

malachi-constant commented Oct 18, 2022

AWS CodeBuild CI Report

LeonLuttenberger Oct 17, 2022 •

edited