Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] add GiB autosharding #32612

Merged
merged 8 commits into from
Oct 4, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Oct 1, 2024

Adds auto-sharding to Iceberg streaming writes using GroupIntoBatches.

In streaming writes, bundles are often very small and can even be single-elements. We write each bundle to a file, so this can behavior can lead to many small files.

To solve this, we group records into batches set by a triggering frequency (as well as record and byte size limits). Now, the number of written data files is more easily controlled. Essentially, every triggering frequency duration, roughly N data files are written, where N is the number of concurrent DoFns. To decrease the number of written files, one can increase their triggering frequency or reduce their parallelism.

@github-actions github-actions bot added the build label Oct 1, 2024
Copy link
Contributor

github-actions bot commented Oct 1, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

github-actions bot commented Oct 1, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.
R: @Abacn for label build.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.


static final long DEFAULT_MAX_BYTES_PER_FILE = (1L << 40); // 1TB
// Used for auto-sharding in streaming. Limits number of records per batch/file
private static final int FILE_TRIGGERING_RECORD_COUNT = 100_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants were determined by experimentation or by looking at another sink implementation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's taken from WriteFiles:

// The record count and buffering duration to trigger flushing records to a tmp file. Mainly used
// for writing unbounded data to avoid generating too many small files.
public static final int FILE_TRIGGERING_RECORD_COUNT = 100000;
public static final int FILE_TRIGGERING_BYTE_COUNT = 64 * 1024 * 1024; // 64MiB as of now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigQuery batch loads is similar but has a greater record count limit (500,000):

// If user triggering is supplied, we will trigger the file write after this many records are
// written.
static final int FILE_TRIGGERING_RECORD_COUNT = 500000;
// If user triggering is supplied, we will trigger the file write after this many bytes are
// written.
static final int DEFAULT_FILE_TRIGGERING_BYTE_COUNT =
AsyncWriteChannelOptions.UPLOAD_CHUNK_SIZE_DEFAULT; // 64MiB as of now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea in a follow up PR to expose record and byte count, in case the user wants more flexibility. Also not sure if we want this current default of 100000 to be different from the old default of TRIGGERING_RECORD_COUNT=50,000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for ManagedIO in general, it might be good to limit the number of knobs we expose. The idea is for Beam/runner to find reasonable optimal values and manage it on behalf of users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not exposing it (at least not from the get-go)

Also not sure if we want this current default of 100000 to be different from the old default of TRIGGERING_RECORD_COUNT=50,000

Before a recent PR (#32451), the old default actually wasn't used anywhere. This IO is still pretty new and we haven't stress tested it yet to see what's most optimal. I figured a good starting point would be to follow WriteFiles (100,000) because it's essentially the same function.

@chamikaramj
Copy link
Contributor

cc: @Naireen @dustin12 in case Dataflow streaming team has additional comments on this.

@chamikaramj
Copy link
Contributor

BTW this should block the release IMO since it's an update incompatible change on top of another unreleased update incompatible change.

@liferoad
Copy link
Collaborator

liferoad commented Oct 1, 2024

Let us update CHANGES.md

@robertwb
Copy link
Contributor

robertwb commented Oct 1, 2024

Does this introduce an additional shuffle (and if so, are we OK with that)?

@ahmedabu98
Copy link
Contributor Author

@robertwb does GroupIntoBatches count as a shuffle? if so then yeah that's the cost here

I think it's a better alternative than writing a huge amount of small files though -- the difference is pretty noticeable. We also use GroupIntoBatches for other performant IOs (all BigQueryIO writes, FileIO, TextIO). Specifically for Iceberg, the table format's query planning is sensitive to the number of files. Some references: [1], [2], [3].

I was thinking of instead adding a step (after file writes) that merges files together. Iceberg does provide a Spark operation that merges files across the entire table (compaction), but I couldn't find anything more light-weight.

@chamikaramj
Copy link
Contributor

I don't think Beam "GroupIntoBatches" introduces a shuffle but I suspect Dataflow would introduce a shuffle/re-shard to make auto-sharding work (not sure how costly that is). Agree with Ahmed that benefits here seems to outweigh the associated cost. We did something very similar to BQ streaming sink to reduce the number of output streams (went from a manually configured 50 shards to auto-sharding). In practice I think, either we would introduce the sharding here or customers would have to add that manually to their pipelines. I prefer the former.

@robertwb
Copy link
Contributor

robertwb commented Oct 3, 2024

OK, we can go with that.

@ahmedabu98 ahmedabu98 merged commit d84cfff into apache:master Oct 4, 2024
22 checks passed
ahmedabu98 added a commit to ahmedabu98/beam that referenced this pull request Oct 4, 2024
* [Managed Iceberg] add GiB autosharding

* trigger iceberg integration tests

* fix test

* add to CHANGES.md

* increase GiB limits

* increase GiB limits

* data file size distribution metric; max file size 512mb
Abacn pushed a commit that referenced this pull request Oct 4, 2024
* [Managed Iceberg] add GiB autosharding

* trigger iceberg integration tests

* fix test

* add to CHANGES.md

* increase GiB limits

* increase GiB limits

* data file size distribution metric; max file size 512mb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants