[FLINK-16497][FLINK-16496][FLINK-16495][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box #12536

wuchong · 2020-06-09T02:59:41Z

What is the purpose of the change

As discussed in the JIRA issue (mainly in FLINK-16497), the current flush behavior is depending on a large buffer size, but no flush interval which lead to no outputs.

Brief change log

JDBC:

sink.buffer-flush.max-rows = 100  // previous: 5000
sink.buffer-flush.interval = 1s // previous: disabled

HBase:

sink.buffer-flush.max-size = 2mb // previous: 2mb
sink.buffer-flush.max-rows = 1000 // previous: disabled
sink.buffer-flush.interval = 1s // previous: disabled

Elasticsearch:

sink.bulk-flush.max-actions = 1000 // previous: 1000
sink.bulk-flush.max-size = 2mb // previous: 5mb
sink.bulk-flush.interval = 1s // previous: disabled

All the options can be disabled by setting to 0.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

wuchong · 2020-06-09T03:00:22Z

cc @dawidwys for the Elasticsearch part.

flinkbot · 2020-06-09T03:03:05Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 64dd9b7 (Tue Jun 09 03:03:05 UTC 2020)

✅no warnings

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-06-09T03:13:29Z

CI report:

4ea8c9c Azure: FAILURE
8555e43 Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

dawidwys

I reviewed the ES part.

dawidwys · 2020-06-09T06:06:21Z

...ava/org/apache/flink/streaming/connectors/elasticsearch/table/Elasticsearch6DynamicSink.java

-			config.getBulkFlushMaxSize().ifPresent(builder::setBulkFlushMaxSizeMb);
-			config.getBulkFlushInterval().ifPresent(builder::setBulkFlushInterval);
+			builder.setBulkFlushMaxActions(config.getBulkFlushMaxActions());
+			builder.setBulkFlushMaxSizeMb((int) (config.getBulkFlushMaxByteSize() >> 20));


I wanted to have any formatting/transformation in one place in the ElasticsearchOptions. Can we return Mb from config.getBulkFlushMaxByteSize?

I though about this, however, I think we should validate the value is in MB granularity. If we remove the remainder, it will be harder to check it.

Then return the MemorySize instead? Without copying the shifting logic which is implemented inside of the MemorySize.

But we have to move the -1 logic out of it then.

right, let's leave it as it is then.

dawidwys · 2020-06-09T06:07:32Z

...va/org/apache/flink/streaming/connectors/elasticsearch/table/ElasticsearchConfiguration.java

+	public long getBulkFlushInterval() {
+		long interval = config.get(BULK_FLUSH_INTERVAL_OPTION).toMillis();
+		// convert 0 to -1, because Elasticsearch client use -1 to disable this configuration.
+		return interval == 0 ? -1 : interval;


Why not return empty in case it is set to 0? I prefer explicitness of Optional instead of magic values. Plus we would not need to change the Elasticsearch connector code for that.

Because Elasticsearch client has a default value for them (1000 rows, 5mb), we have to use -1 to disable them, otherwise the client will use the default value.

The bridge logic seems weird, why not just let user set -1 which is synced with ES way. 0 is not a good candidate for default value.

Unfortunately, Duration and MemorySize ConfigOptions doesn't allow negative values, e.g. -1. That's why I picked zero. I think zero is used as a disabled value in the many places, e.g. Kafka batch.size, and watermark interval.

danny0405 · 2020-06-10T10:04:24Z

...nector-jdbc/src/main/java/org/apache/flink/connector/jdbc/table/JdbcDynamicTableFactory.java

-			checkState(config.getOptional(PASSWORD).isPresent(),
-				"Database username must be provided when database password is provided");
-		}
+		checkAllOrNone(config, new ConfigOption[]{


Validation for URL and TABLE_NAME is not necessary because the find service already does that.

danny0405 · 2020-06-10T10:16:41Z

...onnector-hbase/src/main/java/org/apache/flink/connector/hbase/options/HBaseWriteOptions.java

 		private long bufferFlushMaxSizeInBytes = ConnectionConfiguration.WRITE_BUFFER_SIZE_DEFAULT;
-		private long bufferFlushMaxRows = -1;
-		private long bufferFlushIntervalMillis = -1;
+		private long bufferFlushMaxRows = 0;


I would prefer -1 as the default value, which means the value is not set. 0 seems to mean "never flush" .

I have tried, but Duration and MemorySize ConfigOptions doesn't allow negative values.

wuchong · 2020-06-11T03:14:20Z

Thanks for the review @danny0405 , I have addressed the comments.

danny0405

+1, overall looks good, except for that limitation that ConfigOption does not allow -1 as a fallback value, that would mean we can not setup forbidden flag for all the SQL connectors, which i think is not that user-friendly, maybe we should fire a JIRA issue there and have some discussion.

wuchong · 2020-06-11T04:29:39Z

Agree with you @danny0405 , I created FLINK-18245 to track this. We can allow -1 in the future if it is supported.

…keep the original constraint name

…C sink for better out-of-box The default flush strategy for old JDBC sink is no flush interval and 5000 buffered rows. The new default flush strategy for new JDBC sink is '1s' flush interval and '100' buffered rows.

…ase sink for better out-of-box The default flush strategy for old HBase sink is no flush interval and 2MB buffered size. The new default flush strategy for new HBase sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size.

…r new Elasticsearch sink for better out-of-box The default flush strategy for old Elasticsearch sink is no flush interval and 5MB buffered size and 1000 rows. The new default flush strategy for new Elasticsearch sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size.

wuchong · 2020-06-11T09:00:50Z

Build is passed in my repo: https://dev.azure.com/imjark/Flink/_build/results?buildId=162&view=results
Merging...

…ase sink for better out-of-box The default flush strategy for old HBase sink is no flush interval and 2MB buffered size. The new default flush strategy for new HBase sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size. This closes #12536

…r new Elasticsearch sink for better out-of-box The default flush strategy for old Elasticsearch sink is no flush interval and 5MB buffered size and 1000 rows. The new default flush strategy for new Elasticsearch sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size. This closes #12536

…C sink for better out-of-box The default flush strategy for old JDBC sink is no flush interval and 5000 buffered rows. The new default flush strategy for new JDBC sink is '1s' flush interval and '100' buffered rows. This closes #12536

…ase sink for better out-of-box The default flush strategy for old HBase sink is no flush interval and 2MB buffered size. The new default flush strategy for new HBase sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size. This closes #12536

…r new Elasticsearch sink for better out-of-box The default flush strategy for old Elasticsearch sink is no flush interval and 5MB buffered size and 1000 rows. The new default flush strategy for new Elasticsearch sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size. This closes #12536

…C sink for better out-of-box The default flush strategy for old JDBC sink is no flush interval and 5000 buffered rows. The new default flush strategy for new JDBC sink is '1s' flush interval and '100' buffered rows. This closes apache#12536

…ase sink for better out-of-box The default flush strategy for old HBase sink is no flush interval and 2MB buffered size. The new default flush strategy for new HBase sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size. This closes apache#12536

…r new Elasticsearch sink for better out-of-box The default flush strategy for old Elasticsearch sink is no flush interval and 5MB buffered size and 1000 rows. The new default flush strategy for new Elasticsearch sink is '1s' flush interval and '1000' buffered rows and '2mb' buffered size. This closes apache#12536

rmetzger added the review=description? label Jun 9, 2020

wuchong changed the title ~~[FLINK-16497][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box~~ [FLINK-16497][FLINK-16496][FLINK-16495][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box Jun 9, 2020

wuchong force-pushed the sinks-default-flush branch from 64dd9b7 to 52e932c Compare June 9, 2020 03:03

rmetzger added component=Connectors/ElasticSearch component=TableSQL/Ecosystem labels Jun 9, 2020

dawidwys reviewed Jun 9, 2020

View reviewed changes

wuchong force-pushed the sinks-default-flush branch from c2a2337 to d0602c5 Compare June 10, 2020 06:47

danny0405 reviewed Jun 10, 2020

View reviewed changes

danny0405 approved these changes Jun 11, 2020

View reviewed changes

wuchong added 4 commits June 11, 2020 12:33

[hotfix][table-common] Fix TableSchemaUtils#getPhysicalSchema should …

c7cda60

…keep the original constraint name

wuchong force-pushed the sinks-default-flush branch from 4ea8c9c to 8555e43 Compare June 11, 2020 04:37

wuchong closed this in 6c2ff97 Jun 11, 2020

wuchong deleted the sinks-default-flush branch June 11, 2020 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-16497][FLINK-16496][FLINK-16495][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box #12536

[FLINK-16497][FLINK-16496][FLINK-16495][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box #12536

wuchong commented Jun 9, 2020

wuchong commented Jun 9, 2020

flinkbot commented Jun 9, 2020

flinkbot commented Jun 9, 2020 •

edited

dawidwys left a comment

dawidwys Jun 9, 2020

wuchong Jun 9, 2020

dawidwys Jun 9, 2020

wuchong Jun 9, 2020

dawidwys Jun 9, 2020

dawidwys Jun 9, 2020

wuchong Jun 9, 2020

dawidwys Jun 9, 2020

danny0405 Jun 10, 2020 •

edited

wuchong Jun 10, 2020

danny0405 Jun 10, 2020

danny0405 Jun 10, 2020

wuchong Jun 10, 2020

wuchong commented Jun 11, 2020

danny0405 left a comment

wuchong commented Jun 11, 2020

wuchong commented Jun 11, 2020

[FLINK-16497][FLINK-16496][FLINK-16495][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box #12536

[FLINK-16497][FLINK-16496][FLINK-16495][connectors] Improve default flush strategy for JDBC/Elasticsearch/HBase sink for better out-of-box #12536

Conversation

wuchong commented Jun 9, 2020

What is the purpose of the change

Brief change log

Does this pull request potentially affect one of the following parts:

Documentation

wuchong commented Jun 9, 2020

flinkbot commented Jun 9, 2020

Automated Checks

Review Progress

flinkbot commented Jun 9, 2020 • edited

CI report:

dawidwys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 Jun 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wuchong commented Jun 11, 2020

danny0405 left a comment

Choose a reason for hiding this comment

wuchong commented Jun 11, 2020

wuchong commented Jun 11, 2020

flinkbot commented Jun 9, 2020 •

edited

danny0405 Jun 10, 2020 •

edited