Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(perf): Distribute timestream write with executor #1715

Merged
merged 5 commits into from
Oct 27, 2022

Conversation

jaidisido
Copy link
Contributor

@jaidisido jaidisido commented Oct 25, 2022

Feature or Bugfix

  • Feature
  • Refactoring

Detail

  • Distribute timestream write with executor

In non-distributed case, DataFrame is split into smaller sub-dfs based on # of threads.
In distributed case, sub-dfs are obtained from ray object reference ids which are then submitted to the two ray remote methods (_write_df and _write_batch)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido jaidisido self-assigned this Oct 25, 2022
@jaidisido jaidisido added this to the 3.0.0 milestone Oct 25, 2022
@jaidisido jaidisido added feature major release Will be addressed in the next major release refactoring labels Oct 25, 2022
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 1cd70c2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: 1cd70c2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 1cd70c2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: b2a3312
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

awswrangler/distributed/ray/modin/_utils.py Outdated Show resolved Hide resolved
awswrangler/timestream.py Show resolved Hide resolved
)
return [item for sublist in res for item in sublist]
)
return _flatten_list(ray_get(errors))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two _flatten_list(ray_get()) where required here because of the imbricated ray remote methods (_write_df and _write_batch). This is not needed in S3 select for instance because we feed the ray reference ids from the first ray_get to a Ray dataset

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: b2a3312
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: b2a3312
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@@ -43,6 +44,11 @@ def _to_modin(
)


def _split_modin_frame(df: modin_pd.DataFrame, splits: int) -> List[ObjectRef[Any]]: # pylint: disable=unused-argument
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% convinced that this is the best way to split a modin dataframe

version: int,
boto3_session: Optional[boto3.Session] = None,
) -> List[Dict[str, str]]:
batches: List[List[Any]] = _utils.chunkify(lst=_df2list(df=df), max_length=100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the split modin dataframe block reference id is received. I assume modin/ray is smart enough to avoid a shuffle (i.e. pulling a block from one worker to another) and would instead run the remote functions (_write_df and _write_batch) in the worker where the block already exists...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocks would be broken down into batches and sent to workers so unfortunately some shuffle or rather copy will inevitably happen. One thing I'm afraid of is max_length=100 - these would be too fine-grained tasks, might not be worth it because of the overhead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, the load test on 64,000 rows was fine but let me check with an even larger one tomorrow

Copy link
Contributor

@kukushking kukushking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall 👍
Consider increasing batch size to avoid too fine-grained tasks as per the comment above

@jaidisido jaidisido changed the base branch from release-3.0.0 to main October 27, 2022 09:23
@jaidisido jaidisido changed the base branch from main to release-3.0.0 October 27, 2022 09:23
@jaidisido jaidisido merged commit fdc7bef into release-3.0.0 Oct 27, 2022
@jaidisido jaidisido deleted the timestream-write branch October 27, 2022 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature major release Will be addressed in the next major release refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants