Enforce a size limit on StringSetData #32650

Abacn · 2024-10-04T02:03:08Z

Fix #32649

Make StringSetData set mutable. This avoids copy and create new ImutableSet every time

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Abacn · 2024-10-04T04:03:31Z

Tested TextIOIT: write then read 100,000 files

user counter elements_read 3,754,869, should be 100,000,000, counter still dropped after overall stringset grows
"Elements added 100,000" and job progress both works when job running
It has

WARNING 2024-10-04T03:25:05.938Z StringSetData reaches capacity. Current size: 100023, last element size: 77. Further incoming elements will be dropped.

warning log.

Dataflow job id: 2024-10-03_20_11_47-11842727745111768860

In contrast, same TextIOIT running on master job stuck (for 20-25 min) at add StringSetData (see also #32649)

then during read, Dataflow UI counters not updating (other than throughput chart)

job id 2024-10-03_20_13_23-15614223084919276642

Abacn · 2024-10-04T13:42:35Z

After more testing changed the cap to be 1 MB. This should be good for ~10k number of elements.

also tested on a simple pipeline GenerateSequence->Report LIneage on every element

HEAD (jobid: 2024-10-04_06_15_24-8505055598897377194)

throughput, several hundreds elements / s, cloud profiler shows stringset add takes most of time:

This PR (job id: 2024-10-04_06_23_39-4373466717598467131)

pipeline finishes immediately:

Saw log:

StringSetData reaches capacity. Current size: 1000004, last element size: 58. Further incoming elements will be dropped.

confirm the cap is effective.

user counter work: elements_read 1,000,000

* Make StringSetData set mutable. This avoids copy and create new ImutableSet every time

github-actions · 2024-10-04T17:35:27Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

tvalentyn · 2024-10-04T18:03:13Z

runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/StringSetData.java

+        // check capacity both before insert and after insert one, so the warning only emit once.
+        if (currentSize > STRING_SET_SIZE_LIMIT) {
+          LOG.warn(
+              "StringSetData reaches capacity. Current size: {}, last element size: {}. Further "


this sounds a bit cryptic for the user, and dropping elements sounds scary - could we have a message that would be informative and actionable from the user's perspective?

changed the language such that this is about a metrics (not the actual data in processing). This is not quite actionable from end user. This happens when there are lots of source/sink in the pipeline and Beam metrics system cannot record all of them. For the current occurance (file-based-io) it should be largely mitigated by #32662. The remaining case -- read lots of file from match that does not have a common directory -- will need this guard to avoid mess up the metrics system.

I am little worried about this specific change (silent drop) and possible consequences this can have. The silent drop can cause more confusion as it can go unnoticed (as it is now, most customers don't look at job logs as long as job is working fine which will be the case here). Without the drop the job metrics degrades which is bad in itself but noticeable by customer and actionable.

Also StringSet metric is core metric in Beam Lineage just build on top of that uses it. So lineage specific changes should not be pushed down the StringSet metric itself as a customer can use StringSet metric to have their own metric of this type for something else.

The root of the issue here is the size limit:

20 MB ReportWorkItem limit

32 Mb GFE limit (even if we bump up the ReportWorkItem).

This limit gets hit whenever metric (irrespective of type of metric) size become too large. Generally it is in the case when metrics are emitted per-record (leading to increase in size of data). See previous case where counter metrics caused this: https://buganizer.corp.google.com/issues/195841708#comment42 in that case customer was recommended not to emit per-record.

The challenge here is that in above case and in case where customer does this we can suggest them to change the usage pattern (limiting, but the approach we have for now) but in case of Lineage which uses StringSet metric it is being done by the system itself and customer does not have control over it.

As you said the current occurrence (file-based-io) it should be largely mitigated by #32662 (thank you for the prompt change). We still need to solve "read lots of file from match that does not have a common directory" see below.

Beyond the above change for file-based-io my thoughts are:

We should not limit metric size of a particular metric like StringSet. If we decide to do it to protect customers from issues arising from above size limit then this should be a limit on size of metric being reported and applicable to all metrics for consistent behavior rather than a particular metric type.

For lineage specifically if we want to limit the cardinality of reported lineage information to fit in size limit the limit should be applied to the Lineage being reported. For example, a possible approach can be a) limit the unique Dataplex FQN which can be reported as source/sinks in a job. b) we can also do some grouping based on fqn pattern like . total size > n then report at on level top etc.

Also cc: @robertwb if he has any specific thoughts/views on this.

Also,
If we just want to enforce a stop-gap size check for lineage rather than doing at StringSet metric level we can do at Lineage class to enforce that number of sources/sink and/or their data size is under x.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/Lineage.java#L33

Thanks for the comments. The trigger of the issue was large Lineage but the cause is Beam metrics implementation cannot handle large metrics (in total size), so the change is proposed in StringSetCell (the actual place worker store the metrics before reporting them). We can of course revisit the change when the limitation is fixed in implementations. For now the goal is make sure the whole metrics system does not break due to large stringset.

rohitsinha54

Beside the ongoing discussion about limiting stringset size other changes LGTM.

Thanks for prompt fix. Appreciate a ton!

rohitsinha54

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

github-actions · 2024-10-04T20:06:48Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Abacn · 2024-10-04T20:17:17Z

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set

rohitsinha54 · 2024-10-04T20:47:03Z

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set

Thank you for considering and exploring this. This seem like correct description of current implementation. We can incrementally add more to support the above need.

Here is what I am thinking,
StringSet can expose a an API size() which gives the current size of StringSet metric. This will of course be backed by DelegateStringSet and runner's StringSetData. This will allow Lineage class to get visibility into the current size. Based on the current size Lineage class enforce a limit. Generically other consumer/user who uses StringSet metric might want to know the current size of the Set.

At large beside the two points mentioned in (#32650 (comment)) I believe a better design will be to have StringSet metric expose APIs and feature enabling consumers such as Lineage to use the metric for their use cases (in this case size limit) (bottom up) rather than top down wherein the use case of consumer is built in the metric itself. We want the beam's core metric to be generic.

Also the size limit of metric is a Dataflow specific limitation based of limit of ReportWorkItem/GFE. Other runner might or might not have it. We should avoid coupling it with Beam's core metric.

Abacn · 2024-10-04T20:54:46Z

Other runner might or might not have it. We should avoid coupling it with Beam's core metric

Though didn't get into it, I suspect there is a limit not limited to Dataflow. Under portable framework where runner API backed by grpc messages, then when client query for job status, the response size should have certain limit

rohitsinha54 · 2024-10-04T21:08:40Z

Other runner might or might not have it. We should avoid coupling it with Beam's core metric

Though didn't get into it, I suspect there is a limit not limited to Dataflow. Under portable framework where runner API backed by grpc messages, then when client query for job status, the response size should have certain limit

That makes sense.

rohitsinha54 · 2024-10-04T21:09:40Z

WDYT of breaking this PR in two parts:

Making StringSetData set mutable
Size limit enforcement (This can go in later separately)

Abacn · 2024-10-04T21:52:14Z

WDYT of breaking this PR in two parts:

Making StringSetData set mutable

Size limit enforcement (This can go in later separately)

Yeah this PR essentially contains two fix, corresponding to two related issues in #32649. Mutable set will fix slowness.
However we still need limit enforcement to fix metrics breakage. Even with #32662 in the metrics system will still break for FileIO read patterns like gs://bucket/folder/**/some_data.txt so I think both fix is needed for the current release

rohitsinha54 · 2024-10-04T22:03:02Z

WDYT of breaking this PR in two parts:

Making StringSetData set mutable

Size limit enforcement (This can go in later separately)

Yeah this PR essentially contains two fix, corresponding to two related issues in #32649. Mutable set will fix slowness. However we still need limit enforcement to fix metrics breakage. Even with #32662 in the metrics system will still break for FileIO read patterns like gs://bucket/folder/**/some_data.txt so I think both fix is needed for the current release

Yes correct, I was thinking we can submit Making StringSetData set mutable today but it is upto you if you want to keep them together.

liferoad · 2024-10-08T13:16:17Z

Can we do the minimal here to unblock Beam 2.60.0? Later, we can keep improving. @rohitsinha54 @tvalentyn

liferoad · 2024-10-08T13:17:49Z

Please update CHANGES.md to mention the impacted Beam versions by this issue. @Abacn

Abacn · 2024-10-08T14:56:41Z

Please update CHANGES.md to mention the impacted Beam versions by this issue. @Abacn

Did in #32664. The issue was introduced in Beam 2.59.0

rohitsinha54 · 2024-10-08T17:27:36Z

Can we move the size enforcement to Lineage class rather than forcing it for a particular metric type only which makes it inconsistent with others?

Found actually this is impossible. Lineage is a wrapper of StringSet, whose actual implementation is DelegateStringSet, backed by runner's StringSetData. In Lineage class there is no visibility of the current stringset size, and even if we add track (e.g. a static int) of how many strings we put into Lineage, we do not know the current set size as we do not know if the newly added element already exist in the set

Thank you for considering and exploring this. This seem like correct description of current implementation. We can incrementally add more to support the above need.

Here is what I am thinking, StringSet can expose a an API size() which gives the current size of StringSet metric. This will of course be backed by DelegateStringSet and runner's StringSetData. This will allow Lineage class to get visibility into the current size. Based on the current size Lineage class enforce a limit. Generically other consumer/user who uses StringSet metric might want to know the current size of the Set.

At large beside the two points mentioned in (#32650 (comment)) I believe a better design will be to have StringSet metric expose APIs and feature enabling consumers such as Lineage to use the metric for their use cases (in this case size limit) (bottom up) rather than top down wherein the use case of consumer is built in the metric itself. We want the beam's core metric to be generic.

Also the size limit of metric is a Dataflow specific limitation based of limit of ReportWorkItem/GFE. Other runner might or might not have it. We should avoid coupling it with Beam's core metric.

We discussed this offline.

Here is the summary of the discussion for future.
If size is exposed based on the StringSetData then the size will be for that specific bundle since one metric may have multiple metrics containers (backed by different stringsetdata). This can confuse user if exposed a api level and also breaks the abstraction which Beam as a programming model strives to provide.

The current size enforcement also has this issue but since it is not exposed to user to consume. the stringsetdata limit happens to work with FileIO because currently FileIO report source and sink metrics in a loop, so it happens to write to single metrics container on a single worker. Hence the size enforcement is tied to FileIO implementation. So this is not an ideal fix but we need to solve the FileIO large file issue somehow.

I will propose to drop the limit enforcement on string set completely in anyway. We will limit the number of reported files in FileIO itself to ensure we do not hit it. This will avoid FileIO implication and implementation being tied to a metric type.

See comment. #32662 (comment)

Abacn · 2024-10-08T20:03:38Z

communicated offline. Decided to keep the size limit, as it is parallel to #32662 . If the known issue fixed in #32662 the size limit could still happen

* Enforce a size limit on StringSetData * Make StringSetData set mutable. This avoids copy and create new ImutableSet every time * adjust warning log

github-actions bot added python runners core labels Oct 4, 2024

Abacn force-pushed the stringsetcap branch from 04d048b to 41f9e03 Compare October 4, 2024 03:55

Abacn force-pushed the stringsetcap branch 3 times, most recently from ad83b4d to d8c1399 Compare October 4, 2024 13:38

Enforce a size limit on StringSetData

07fea59

* Make StringSetData set mutable. This avoids copy and create new ImutableSet every time

Abacn force-pushed the stringsetcap branch from d8c1399 to 07fea59 Compare October 4, 2024 13:51

Abacn marked this pull request as ready for review October 4, 2024 16:58

tvalentyn reviewed Oct 4, 2024

View reviewed changes

adjust warning log

6cecc3c

rohitsinha54 approved these changes Oct 4, 2024

View reviewed changes

rohitsinha54 suggested changes Oct 4, 2024

View reviewed changes

github-actions bot added the Next Action: Reviewers label Oct 4, 2024

rohitsinha54 mentioned this pull request Oct 5, 2024

Report File Lineage on directory #32662

Merged

3 tasks

Abacn mentioned this pull request Oct 8, 2024

Disable blob level Lineage metrics for FileSystems #32642

Closed

3 tasks

Abacn merged commit 8a6f248 into apache:master Oct 8, 2024
105 checks passed

Abacn deleted the stringsetcap branch October 8, 2024 23:31

Abacn added a commit to Abacn/beam that referenced this pull request Oct 8, 2024

Enforce a size limit on StringSetData (apache#32650)

b4af134

* Enforce a size limit on StringSetData * Make StringSetData set mutable. This avoids copy and create new ImutableSet every time * adjust warning log

damccorm pushed a commit that referenced this pull request Oct 9, 2024

Enforce a size limit on StringSetData (#32650) (#32707)

f61c8e1

* Enforce a size limit on StringSetData * Make StringSetData set mutable. This avoids copy and create new ImutableSet every time * adjust warning log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce a size limit on StringSetData #32650

Enforce a size limit on StringSetData #32650

Abacn commented Oct 4, 2024

Abacn commented Oct 4, 2024 •

edited

Loading

Abacn commented Oct 4, 2024

github-actions bot commented Oct 4, 2024

tvalentyn Oct 4, 2024

Abacn Oct 4, 2024 •

edited

Loading

rohitsinha54 Oct 4, 2024

rohitsinha54 Oct 4, 2024

Abacn Oct 4, 2024 •

edited

Loading

rohitsinha54 left a comment

rohitsinha54 left a comment

github-actions bot commented Oct 4, 2024

Abacn commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

Abacn commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

Abacn commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

liferoad commented Oct 8, 2024 •

edited

Loading

liferoad commented Oct 8, 2024

Abacn commented Oct 8, 2024

rohitsinha54 commented Oct 8, 2024

Abacn commented Oct 8, 2024

Enforce a size limit on StringSetData #32650

Enforce a size limit on StringSetData #32650

Conversation

Abacn commented Oct 4, 2024

GitHub Actions Tests Status (on master branch)

Abacn commented Oct 4, 2024 • edited Loading

Abacn commented Oct 4, 2024

github-actions bot commented Oct 4, 2024

tvalentyn Oct 4, 2024

Choose a reason for hiding this comment

Abacn Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

rohitsinha54 Oct 4, 2024

Choose a reason for hiding this comment

rohitsinha54 Oct 4, 2024

Choose a reason for hiding this comment

Abacn Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

rohitsinha54 left a comment

Choose a reason for hiding this comment

rohitsinha54 left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 4, 2024

Abacn commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

Abacn commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

Abacn commented Oct 4, 2024

rohitsinha54 commented Oct 4, 2024

liferoad commented Oct 8, 2024 • edited Loading

liferoad commented Oct 8, 2024

Abacn commented Oct 8, 2024

rohitsinha54 commented Oct 8, 2024

Abacn commented Oct 8, 2024

Abacn commented Oct 4, 2024 •

edited

Loading

Abacn Oct 4, 2024 •

edited

Loading

Abacn Oct 4, 2024 •

edited

Loading

liferoad commented Oct 8, 2024 •

edited

Loading