Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Improve Spark config to better support EMR/EMRS, small optimizations #838

Merged
merged 5 commits into from
May 29, 2024

Conversation

thvasilo
Copy link
Contributor

Issue #, if available:

Description of changes:

  • We change the way we configure the Spark env, to only create our own config for SageMaker, as EMR/EMRS will have pre-configured defaults.
  • We introduce some enum classes to better communicate execution environment and filesystem type
  • We try to cache DFs that are used multiple times to avoid re-computation
  • Do not use : in the data paths, as it can cause errors when moving the data to local filesystems.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@thvasilo thvasilo added draft label only to be used by dev team - skips CI for small changes 0.3 gsprocessing For issues and PRs related the the GSProcessing library labels May 14, 2024
@thvasilo thvasilo self-assigned this May 14, 2024
@thvasilo thvasilo added the ready able to trigger the CI label May 14, 2024
@thvasilo thvasilo changed the title [GSProcessing] Improve Spark config to better support EMR/EMRS, small optimizations [Draft][GSProcessing] Improve Spark config to better support EMR/EMRS, small optimizations May 15, 2024
…tadata.json when only Parquet metadata is changed
@thvasilo thvasilo changed the title [Draft][GSProcessing] Improve Spark config to better support EMR/EMRS, small optimizations [GSProcessing] Improve Spark config to better support EMR/EMRS, small optimizations May 17, 2024
@thvasilo thvasilo removed the draft label only to be used by dev team - skips CI for small changes label May 17, 2024
@thvasilo thvasilo requested a review from jalencato May 17, 2024 23:29
Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments

@thvasilo thvasilo requested a review from jalencato May 28, 2024 23:31
@thvasilo
Copy link
Contributor Author

Should have resolved all comments @jalencato, let me know if there's any more work left to merge.

Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thvasilo thvasilo merged commit 1e556f9 into awslabs:main May 29, 2024
3 checks passed
@thvasilo thvasilo deleted the gsp-spark-config branch May 29, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.3 gsprocessing For issues and PRs related the the GSProcessing library ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants