[Managed Iceberg] add GiB autosharding #32612

ahmedabu98 · 2024-10-01T13:28:47Z

Adds auto-sharding to Iceberg streaming writes using GroupIntoBatches.

In streaming writes, bundles are often very small and can even be single-elements. We write each bundle to a file, so this can behavior can lead to many small files.

To solve this, we group records into batches set by a triggering frequency (as well as record and byte size limits). Now, the number of written data files is more easily controlled. Essentially, every triggering frequency duration, roughly N data files are written, where N is the number of concurrent DoFns. To decrease the number of written files, one can increase their triggering frequency or reduce their parallelism.

github-actions · 2024-10-01T14:06:10Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ahmedabu98 · 2024-10-01T17:05:47Z

assign set of reviewers

github-actions · 2024-10-01T17:06:57Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.
R: @Abacn for label build.
R: @chamikaramj for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

chamikaramj

Thanks. LGTM.

chamikaramj · 2024-10-01T17:46:23Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteToDestinations.java


-  static final long DEFAULT_MAX_BYTES_PER_FILE = (1L << 40); // 1TB
+  // Used for auto-sharding in streaming. Limits number of records per batch/file
+  private static final int FILE_TRIGGERING_RECORD_COUNT = 100_000;


These constants were determined by experimentation or by looking at another sink implementation ?

It's taken from WriteFiles:

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java

Lines 149 to 152 in 6a7ffa5

// The record count and buffering duration to trigger flushing records to a tmp file. Mainly used

// for writing unbounded data to avoid generating too many small files.

public static final int FILE_TRIGGERING_RECORD_COUNT = 100000;

public static final int FILE_TRIGGERING_BYTE_COUNT = 64 * 1024 * 1024; // 64MiB as of now

BigQuery batch loads is similar but has a greater record count limit (500,000):

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java

Lines 120 to 126 in 6a7ffa5

// If user triggering is supplied, we will trigger the file write after this many records are

// written.

static final int FILE_TRIGGERING_RECORD_COUNT = 500000;

// If user triggering is supplied, we will trigger the file write after this many bytes are

// written.

static final int DEFAULT_FILE_TRIGGERING_BYTE_COUNT =

AsyncWriteChannelOptions.UPLOAD_CHUNK_SIZE_DEFAULT; // 64MiB as of now

It might be a good idea in a follow up PR to expose record and byte count, in case the user wants more flexibility. Also not sure if we want this current default of 100000 to be different from the old default of TRIGGERING_RECORD_COUNT=50,000

I think for ManagedIO in general, it might be good to limit the number of knobs we expose. The idea is for Beam/runner to find reasonable optimal values and manage it on behalf of users.

+1 to not exposing it (at least not from the get-go)

Also not sure if we want this current default of 100000 to be different from the old default of TRIGGERING_RECORD_COUNT=50,000

Before a recent PR (#32451), the old default actually wasn't used anywhere. This IO is still pretty new and we haven't stress tested it yet to see what's most optimal. I figured a good starting point would be to follow WriteFiles (100,000) because it's essentially the same function.

chamikaramj · 2024-10-01T18:10:51Z

cc: @Naireen @dustin12 in case Dataflow streaming team has additional comments on this.

chamikaramj · 2024-10-01T18:18:09Z

BTW this should block the release IMO since it's an update incompatible change on top of another unreleased update incompatible change.

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteToDestinations.java

…erg_autosharding

liferoad · 2024-10-01T20:07:28Z

Let us update CHANGES.md

robertwb · 2024-10-01T20:57:40Z

Does this introduce an additional shuffle (and if so, are we OK with that)?

ahmedabu98 · 2024-10-01T22:15:26Z

@robertwb does GroupIntoBatches count as a shuffle? if so then yeah that's the cost here

I think it's a better alternative than writing a huge amount of small files though -- the difference is pretty noticeable. We also use GroupIntoBatches for other performant IOs (all BigQueryIO writes, FileIO, TextIO). Specifically for Iceberg, the table format's query planning is sensitive to the number of files. Some references: [1], [2], [3].

I was thinking of instead adding a step (after file writes) that merges files together. Iceberg does provide a Spark operation that merges files across the entire table (compaction), but I couldn't find anything more light-weight.

chamikaramj · 2024-10-01T22:53:16Z

I don't think Beam "GroupIntoBatches" introduces a shuffle but I suspect Dataflow would introduce a shuffle/re-shard to make auto-sharding work (not sure how costly that is). Agree with Ahmed that benefits here seems to outweigh the associated cost. We did something very similar to BQ streaming sink to reduce the number of output streams (went from a manually configured 50 shards to auto-sharding). In practice I think, either we would introduce the sharding here or customers would have to add that manually to their pipelines. I prefer the former.

robertwb · 2024-10-03T22:46:43Z

OK, we can go with that.

* [Managed Iceberg] add GiB autosharding * trigger iceberg integration tests * fix test * add to CHANGES.md * increase GiB limits * increase GiB limits * data file size distribution metric; max file size 512mb

[Managed Iceberg] add GiB autosharding

384d8f6

github-actions bot added java io labels Oct 1, 2024

trigger iceberg integration tests

a930233

github-actions bot added the build label Oct 1, 2024

fix test

7a8f305

github-actions bot added the Next Action: Reviewers label Oct 1, 2024

chamikaramj added this to the 2.60.0 Release milestone Oct 1, 2024

chamikaramj reviewed Oct 1, 2024

View reviewed changes

Naireen reviewed Oct 1, 2024

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteToDestinations.java Show resolved Hide resolved

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

b1bd36c

…erg_autosharding

add to CHANGES.md

00817bc

ahmedabu98 added 3 commits October 2, 2024 18:13

increase GiB limits

2c674a3

increase GiB limits

47436f0

data file size distribution metric; max file size 512mb

9955868

ahmedabu98 merged commit d84cfff into apache:master Oct 4, 2024
22 checks passed

ahmedabu98 mentioned this pull request Oct 4, 2024

Cherrypick iceberg autosharding #32655

Closed

ahmedabu98 mentioned this pull request Oct 4, 2024

CP iceberg autosharding #32663

Merged

ahmedabu98 mentioned this pull request Oct 11, 2024

[Bug]: IcebergIO - Write performance issues #32746

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Managed Iceberg] add GiB autosharding #32612

[Managed Iceberg] add GiB autosharding #32612

ahmedabu98 commented Oct 1, 2024 •

edited

Loading

github-actions bot commented Oct 1, 2024

ahmedabu98 commented Oct 1, 2024

github-actions bot commented Oct 1, 2024

chamikaramj left a comment

chamikaramj Oct 1, 2024

ahmedabu98 Oct 1, 2024

ahmedabu98 Oct 1, 2024

Naireen Oct 1, 2024

chamikaramj Oct 1, 2024

ahmedabu98 Oct 1, 2024

chamikaramj commented Oct 1, 2024

chamikaramj commented Oct 1, 2024

liferoad commented Oct 1, 2024

robertwb commented Oct 1, 2024

ahmedabu98 commented Oct 1, 2024

chamikaramj commented Oct 1, 2024

robertwb commented Oct 3, 2024

	// The record count and buffering duration to trigger flushing records to a tmp file. Mainly used
	// for writing unbounded data to avoid generating too many small files.
	public static final int FILE_TRIGGERING_RECORD_COUNT = 100000;
	public static final int FILE_TRIGGERING_BYTE_COUNT = 64 * 1024 * 1024; // 64MiB as of now

	// If user triggering is supplied, we will trigger the file write after this many records are
	// written.
	static final int FILE_TRIGGERING_RECORD_COUNT = 500000;
	// If user triggering is supplied, we will trigger the file write after this many bytes are
	// written.
	static final int DEFAULT_FILE_TRIGGERING_BYTE_COUNT =
	AsyncWriteChannelOptions.UPLOAD_CHUNK_SIZE_DEFAULT; // 64MiB as of now

[Managed Iceberg] add GiB autosharding #32612

[Managed Iceberg] add GiB autosharding #32612

Conversation

ahmedabu98 commented Oct 1, 2024 • edited Loading

github-actions bot commented Oct 1, 2024

ahmedabu98 commented Oct 1, 2024

github-actions bot commented Oct 1, 2024

chamikaramj left a comment

Choose a reason for hiding this comment

chamikaramj Oct 1, 2024

Choose a reason for hiding this comment

ahmedabu98 Oct 1, 2024

Choose a reason for hiding this comment

ahmedabu98 Oct 1, 2024

Choose a reason for hiding this comment

Naireen Oct 1, 2024

Choose a reason for hiding this comment

chamikaramj Oct 1, 2024

Choose a reason for hiding this comment

ahmedabu98 Oct 1, 2024

Choose a reason for hiding this comment

chamikaramj commented Oct 1, 2024

chamikaramj commented Oct 1, 2024

liferoad commented Oct 1, 2024

robertwb commented Oct 1, 2024

ahmedabu98 commented Oct 1, 2024

chamikaramj commented Oct 1, 2024

robertwb commented Oct 3, 2024

ahmedabu98 commented Oct 1, 2024 •

edited

Loading