KAFKA-8334 Make sure the thread which tries to complete delayed reque… #8657

chia7712 · 2020-05-12T17:23:47Z

The main changes of this PR are shown below.

replace tryLock by lock for DelayedOperation#maybeTryComplete
complete the delayed requests without holding group lock

BEFORE

test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.num_producers=3.acks=1
status: PASS
run time: 56.718 seconds
{"records_per_sec": 621619.67445, "mb_per_sec": 59.28}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 16.067 seconds
{"records_per_sec": 1565190.1706, "mb_per_sec": 149.2682}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 1 minute 2.486 seconds
{"records_per_sec": 3165558.7211, "mb_per_sec": 301.8912}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 19.929 seconds
{"records_per_sec": 1350621.2858, "mb_per_sec": 128.8053}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 1 minute 3.014 seconds
{"records_per_sec": 3653635.3672, "mb_per_sec": 348.4378}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 58.852 seconds
{"records_per_sec": 3252032.5203, "mb_per_sec": 310.138}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 59.315 seconds
{"records_per_sec": 3825554.7054, "mb_per_sec": 364.8333}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_PLAINTEXT.compression_type=none
status: PASS
run time: 41.012 seconds
{"latency_99th_ms": 6.0, "latency_50th_ms": 0.0, "latency_999th_ms": 16.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_PLAINTEXT.compression_type=snappy
status: PASS
run time: 44.975 seconds
{"latency_99th_ms": 5.0, "latency_50th_ms": 0.0, "latency_999th_ms": 19.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_SSL.compression_type=none
status: PASS
run time: 49.868 seconds
{"latency_99th_ms": 5.0, "latency_50th_ms": 0.0, "latency_999th_ms": 15.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_SSL.compression_type=snappy
status: PASS
run time: 48.454 seconds
{"latency_99th_ms": 5.0, "latency_50th_ms": 0.0, "latency_999th_ms": 19.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 9.145 seconds
{"consumer": {"records_per_sec": 610426.0774, "mb_per_sec": 58.2148}, "producer": {"records_per_sec": 620385.880017, "mb_per_sec": 59.16}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 1 minute 2.140 seconds
{"consumer": {"records_per_sec": 1465845.793, "mb_per_sec": 139.7939}, "producer": {"records_per_sec": 1416831.963729, "mb_per_sec": 135.12}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 10.968 seconds
{"consumer": {"records_per_sec": 599089.3841, "mb_per_sec": 57.1336}, "producer": {"records_per_sec": 626370.184779, "mb_per_sec": 59.74}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 58.237 seconds
{"consumer": {"records_per_sec": 1298532.6581, "mb_per_sec": 123.8377}, "producer": {"records_per_sec": 1315443.304394, "mb_per_sec": 125.45}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 1 minute 0.201 seconds
{"consumer": {"records_per_sec": 997705.2779, "mb_per_sec": 95.1486}, "producer": {"records_per_sec": 957212.596918, "mb_per_sec": 91.29}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 56.187 seconds
{"consumer": {"records_per_sec": 1313025.2101, "mb_per_sec": 125.2198}, "producer": {"records_per_sec": 1363512.407963, "mb_per_sec": 130.03}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 57.195 seconds
{"latency_99th_ms": 3.0, "latency_50th_ms": 0.0, "latency_999th_ms": 11.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 57.311 seconds
{"latency_99th_ms": 3.0, "latency_50th_ms": 0.0, "latency_999th_ms": 8.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 57.756 seconds
{"latency_99th_ms": 3.0, "latency_50th_ms": 0.0, "latency_999th_ms": 11.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 57.291 seconds
{"latency_99th_ms": 3.0, "latency_50th_ms": 0.0, "latency_999th_ms": 8.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 48.981 seconds
{"latency_99th_ms": 3.0, "latency_50th_ms": 0.0, "latency_999th_ms": 15.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 51.503 seconds
{"latency_99th_ms": 3.0, "latency_50th_ms": 0.0, "latency_999th_ms": 9.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 8.161 seconds
{"0": {"records_per_sec": 698421.567258, "mb_per_sec": 66.61}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 56.530 seconds
{"0": {"records_per_sec": 1639881.928501, "mb_per_sec": 156.39}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 4.389 seconds
{"0": {"records_per_sec": 720097.933319, "mb_per_sec": 68.67}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 59.589 seconds
{"0": {"records_per_sec": 1621271.076524, "mb_per_sec": 154.62}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 56.165 seconds
{"0": {"records_per_sec": 1152737.752161, "mb_per_sec": 109.93}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 54.846 seconds
{"0": {"records_per_sec": 1646903.820817, "mb_per_sec": 157.06}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 59.692 seconds
{"records_per_sec": 1794354.545455, "mb_per_sec": 17.11}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 58.774 seconds
{"records_per_sec": 1973499.779444, "mb_per_sec": 18.82}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 57.450 seconds
{"records_per_sec": 325613.051917, "mb_per_sec": 31.05}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 54.134 seconds
{"records_per_sec": 734232.49453, "mb_per_sec": 70.02}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 54.106 seconds
{"records_per_sec": 41259.452813, "mb_per_sec": 39.35}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 52.577 seconds
{"records_per_sec": 51681.555641, "mb_per_sec": 49.29}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 57.747 seconds
{"records_per_sec": 4320.991629, "mb_per_sec": 41.21}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 55.024 seconds
{"records_per_sec": 4223.096287, "mb_per_sec": 40.27}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 53.355 seconds
{"records_per_sec": 817.794028, "mb_per_sec": 77.99}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 53.097 seconds
{"records_per_sec": 797.859691, "mb_per_sec": 76.09}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 57.602 seconds
{"records_per_sec": 1779132.025451, "mb_per_sec": 16.97}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 59.207 seconds
{"records_per_sec": 1935367.267484, "mb_per_sec": 18.46}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 58.127 seconds
{"records_per_sec": 330911.489152, "mb_per_sec": 31.56}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 50.805 seconds
{"records_per_sec": 615677.522936, "mb_per_sec": 58.72}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 51.221 seconds
{"records_per_sec": 40378.158845, "mb_per_sec": 38.51}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 50.578 seconds
{"records_per_sec": 51901.392111, "mb_per_sec": 49.5}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 53.369 seconds
{"records_per_sec": 4363.13394, "mb_per_sec": 41.61}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 53.982 seconds
{"records_per_sec": 4323.775773, "mb_per_sec": 41.23}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 54.736 seconds
{"records_per_sec": 810.386473, "mb_per_sec": 77.28}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 54.867 seconds
{"records_per_sec": 795.023697, "mb_per_sec": 75.82}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-one.acks=1
status: PASS
run time: 48.440 seconds
{"records_per_sec": 701608.468374, "mb_per_sec": 66.91}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.acks=-1
status: PASS
run time: 55.268 seconds
{"records_per_sec": 268274.435339, "mb_per_sec": 25.58}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.acks=1
status: PASS
run time: 50.207 seconds
{"records_per_sec": 467657.491289, "mb_per_sec": 44.6}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=10
status: PASS
run time: 55.395 seconds
{"records_per_sec": 2038543.742406, "mb_per_sec": 19.44}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=100
status: PASS
run time: 52.835 seconds
{"records_per_sec": 479520.185781, "mb_per_sec": 45.73}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=1000
status: PASS
run time: 47.566 seconds
{"records_per_sec": 50609.728507, "mb_per_sec": 48.27}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=10000
status: PASS
run time: 49.949 seconds
{"records_per_sec": 5941.124391, "mb_per_sec": 56.66}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=100000
status: PASS
run time: 48.718 seconds
{"records_per_sec": 1698.734177, "mb_per_sec": 162.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=10
status: PASS
run time: 55.253 seconds
{"records_per_sec": 1946594.923858, "mb_per_sec": 18.56}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=100
status: PASS
run time: 50.712 seconds
{"records_per_sec": 986894.852941, "mb_per_sec": 94.12}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=1000
status: PASS
run time: 58.378 seconds
{"records_per_sec": 112787.394958, "mb_per_sec": 107.56}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=10000
status: PASS
run time: 50.972 seconds
{"records_per_sec": 5747.751606, "mb_per_sec": 54.81}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=100000
status: PASS
run time: 47.419 seconds
{"records_per_sec": 1580.683157, "mb_per_sec": 150.75}
--------------------------------------------------------------------------------

AFTER

test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.num_producers=3.acks=1
status: PASS
run time: 56.262 seconds
{"records_per_sec": 625731.99396, "mb_per_sec": 59.68}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 15.345 seconds
{"records_per_sec": 1458151.0645, "mb_per_sec": 139.0601}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 1 minute 5.284 seconds
{"records_per_sec": 3173595.6839, "mb_per_sec": 302.6577}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 16.265 seconds
{"records_per_sec": 1477759.7163, "mb_per_sec": 140.9301}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 1 minute 4.992 seconds
{"records_per_sec": 2992220.2274, "mb_per_sec": 285.3604}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 1 minute 2.562 seconds
{"records_per_sec": 3987240.8293, "mb_per_sec": 380.2529}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 1 minute 1.068 seconds
{"records_per_sec": 3531073.4463, "mb_per_sec": 336.7494}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_PLAINTEXT.compression_type=none
status: PASS
run time: 43.213 seconds
{"latency_99th_ms": 6.0, "latency_50th_ms": 0.0, "latency_999th_ms": 19.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_PLAINTEXT.compression_type=snappy
status: PASS
run time: 44.302 seconds
{"latency_99th_ms": 6.0, "latency_50th_ms": 0.0, "latency_999th_ms": 17.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_SSL.compression_type=none
status: PASS
run time: 52.117 seconds
{"latency_99th_ms": 6.0, "latency_50th_ms": 0.0, "latency_999th_ms": 17.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=SASL_SSL.compression_type=snappy
status: PASS
run time: 48.599 seconds
{"latency_99th_ms": 6.0, "latency_50th_ms": 0.0, "latency_999th_ms": 15.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 6.347 seconds
{"consumer": {"records_per_sec": 610165.3548, "mb_per_sec": 58.1899}, "producer": {"records_per_sec": 645161.290323, "mb_per_sec": 61.53}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 58.196 seconds
{"consumer": {"records_per_sec": 1365001.365, "mb_per_sec": 130.1767}, "producer": {"records_per_sec": 1315270.288044, "mb_per_sec": 125.43}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 8.056 seconds
{"consumer": {"records_per_sec": 635364.3815, "mb_per_sec": 60.5931}, "producer": {"records_per_sec": 645369.474024, "mb_per_sec": 61.55}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 55.585 seconds
{"consumer": {"records_per_sec": 1396453.0094, "mb_per_sec": 133.1761}, "producer": {"records_per_sec": 1345170.836696, "mb_per_sec": 128.29}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 57.844 seconds
{"consumer": {"records_per_sec": 995024.8756, "mb_per_sec": 94.893}, "producer": {"records_per_sec": 934928.9454, "mb_per_sec": 89.16}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 57.728 seconds
{"consumer": {"records_per_sec": 1442793.2477, "mb_per_sec": 137.5955}, "producer": {"records_per_sec": 1343002.954607, "mb_per_sec": 128.08}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 59.918 seconds
{"latency_99th_ms": 4.0, "latency_50th_ms": 0.0, "latency_999th_ms": 10.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 58.414 seconds
{"latency_99th_ms": 4.0, "latency_50th_ms": 0.0, "latency_999th_ms": 13.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 58.689 seconds
{"latency_99th_ms": 4.0, "latency_50th_ms": 0.0, "latency_999th_ms": 10.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 57.322 seconds
{"latency_99th_ms": 4.0, "latency_50th_ms": 0.0, "latency_999th_ms": 11.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 53.221 seconds
{"latency_99th_ms": 4.0, "latency_50th_ms": 0.0, "latency_999th_ms": 12.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_end_to_end_latency.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 53.012 seconds
{"latency_99th_ms": 4.0, "latency_50th_ms": 0.0, "latency_999th_ms": 13.0}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 9.797 seconds
{"0": {"records_per_sec": 712352.186921, "mb_per_sec": 67.94}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 58.567 seconds
{"0": {"records_per_sec": 1586294.416244, "mb_per_sec": 151.28}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
status: PASS
run time: 1 minute 3.881 seconds
{"0": {"records_per_sec": 730513.551026, "mb_per_sec": 69.67}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
status: PASS
run time: 58.038 seconds
{"0": {"records_per_sec": 1624959.376016, "mb_per_sec": 154.97}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.security_protocol=PLAINTEXT.compression_type=none
status: PASS
run time: 54.064 seconds
{"0": {"records_per_sec": 1184834.123223, "mb_per_sec": 112.99}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_long_term_producer_throughput.security_protocol=PLAINTEXT.compression_type=snappy
status: PASS
run time: 53.514 seconds
{"0": {"records_per_sec": 1647175.094713, "mb_per_sec": 157.09}}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 1 minute 0.244 seconds
{"records_per_sec": 1814733.910222, "mb_per_sec": 17.31}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 59.460 seconds
{"records_per_sec": 2014373.705538, "mb_per_sec": 19.21}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 58.051 seconds
{"records_per_sec": 328240.890193, "mb_per_sec": 31.3}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 52.228 seconds
{"records_per_sec": 747730.91922, "mb_per_sec": 71.31}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 53.001 seconds
{"records_per_sec": 40475.572979, "mb_per_sec": 38.6}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 54.782 seconds
{"records_per_sec": 70752.24038, "mb_per_sec": 67.47}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 57.875 seconds
{"records_per_sec": 4461.768617, "mb_per_sec": 42.55}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 57.089 seconds
{"records_per_sec": 4282.386726, "mb_per_sec": 40.84}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 54.228 seconds
{"records_per_sec": 824.324324, "mb_per_sec": 78.61}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.2.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 51.462 seconds
{"records_per_sec": 809.897405, "mb_per_sec": 77.24}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 59.836 seconds
{"records_per_sec": 1812773.095624, "mb_per_sec": 17.29}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 56.087 seconds
{"records_per_sec": 2029604.113111, "mb_per_sec": 19.36}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 55.743 seconds
{"records_per_sec": 348707.976098, "mb_per_sec": 33.26}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 52.794 seconds
{"records_per_sec": 974003.628447, "mb_per_sec": 92.89}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 55.364 seconds
{"records_per_sec": 41284.835435, "mb_per_sec": 39.37}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=1000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 51.486 seconds
{"records_per_sec": 57827.229642, "mb_per_sec": 55.15}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 55.384 seconds
{"records_per_sec": 4502.180476, "mb_per_sec": 42.94}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=10000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 54.028 seconds
{"records_per_sec": 4217.787555, "mb_per_sec": 40.22}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=none
status: PASS
run time: 56.079 seconds
{"records_per_sec": 839.79975, "mb_per_sec": 80.09}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.tls_version=TLSv1.3.message_size=100000.topic=topic-replication-factor-three.security_protocol=SSL.acks=1.compression_type=snappy
status: PASS
run time: 54.970 seconds
{"records_per_sec": 826.35468, "mb_per_sec": 78.81}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-one.acks=1
status: PASS
run time: 46.430 seconds
{"records_per_sec": 746068.371317, "mb_per_sec": 71.15}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.acks=-1
status: PASS
run time: 53.541 seconds
{"records_per_sec": 318277.685558, "mb_per_sec": 30.35}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.acks=1
status: PASS
run time: 49.139 seconds
{"records_per_sec": 487355.482934, "mb_per_sec": 46.48}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=10
status: PASS
run time: 53.150 seconds
{"records_per_sec": 2153686.136072, "mb_per_sec": 20.54}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=100
status: PASS
run time: 51.156 seconds
{"records_per_sec": 455438.411944, "mb_per_sec": 43.43}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=1000
status: PASS
run time: 51.568 seconds
{"records_per_sec": 52820.543093, "mb_per_sec": 50.37}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=10000
status: PASS
run time: 46.992 seconds
{"records_per_sec": 6253.960857, "mb_per_sec": 59.64}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=none.acks=1.message_size=100000
status: PASS
run time: 46.280 seconds
{"records_per_sec": 1669.154229, "mb_per_sec": 159.18}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=10
status: PASS
run time: 54.138 seconds
{"records_per_sec": 1951122.546882, "mb_per_sec": 18.61}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=100
status: PASS
run time: 47.680 seconds
{"records_per_sec": 1021443.683409, "mb_per_sec": 97.41}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=1000
status: PASS
run time: 47.886 seconds
{"records_per_sec": 116104.67128, "mb_per_sec": 110.73}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=10000
status: PASS
run time: 54.126 seconds
{"records_per_sec": 5550.454921, "mb_per_sec": 52.93}
--------------------------------------------------------------------------------
test_id: kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.compression_type=snappy.acks=1.message_size=100000
status: PASS
run time: 46.799 seconds
{"records_per_sec": 1582.54717, "mb_per_sec": 150.92}
--------------------------------------------------------------------------------

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala

core/src/test/scala/unit/kafka/server/DelayedOperationTest.scala

chia7712 · 2020-05-12T17:34:11Z

@rajinisivaram @junrao @windkit please take a look :)

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala

core/src/main/scala/kafka/cluster/Partition.scala

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala

junrao

@chia7712 : Thanks for the PR. Sorry for the delay. Made a pass of non-testing files. Overall, I felt that this approach works. It adds its own complexity, but it's probably better than adding a separate thread pool. A few comments below.

core/src/main/scala/kafka/server/ReplicaManager.scala

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala

core/src/main/scala/kafka/cluster/Partition.scala

core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala

core/src/test/scala/unit/kafka/coordinator/group/DelayedJoinTest.scala

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala

core/src/main/scala/kafka/server/ReplicaManager.scala

chia7712 · 2020-06-02T16:34:08Z

@junrao Could you take a look?

junrao

@chia7712 : Thanks for the updated PR. Made a pass of all files. A few more comments below.

core/src/main/scala/kafka/cluster/Partition.scala

core/src/main/scala/kafka/server/ReplicaManager.scala

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/main/scala/kafka/cluster/Partition.scala

junrao · 2020-06-14T23:27:03Z

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorConcurrencyTest.scala

+        offsetsPartitions.map(_.partition).toSet, isCommit = isCommit)
+      catch {
+        case e: IllegalStateException if isCommit
+          && e.getMessage.contains("though the offset commit record itself hasn't been appended to the log")=>


Hmm, why do we need this logic now?

TestReplicaManager#appendRecords (https://github.com/apache/kafka/blob/trunk/core/src/test/scala/unit/kafka/coordinator/AbstractCoordinatorConcurrencyTest.scala#L207) always complete the delayedProduce immediately so the txn offset is append also. This PR tries to complete the delayedProduce after releasing the group lock so it is possible to cause following execution order.

txn prepare

txn completion (fail)

txn append (this is executed by delayedProduce)

Thanks. I am still not sure that I fully understand this. It seems that by not completing the delayedProduce within the group lock, we are hitting IllegalStateException. That seems a bug. Do you know which code depends on that? It seems that we do hold a group lock when updating the txnOffset.

https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L462

That seems a bug.

The root cause (changed by this PR) is that the "txn initialization" and "txn append" are not executed within same lock.

The test story is shown below.

CommitTxnOffsetsOperation calls GroupMetadata.prepareTxnOffsetCommit to add CommitRecordMetadataAndOffset(None, offsetAndMetadata) to pendingTransactionalOffsetCommits (this is the link you attached).

GroupMetadata.completePendingTxnOffsetCommit called by CompleteTxnOperation throws IllegalStateException if CommitRecordMetadataAndOffset.appendedBatchOffset is None (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadata.scala#L664).

Why it does not cause error before?

CommitRecordMetadataAndOffset.appendedBatchOffset is updated by the callback putCacheCallback (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L407). TestReplicManager always create delayedProduce do handle the putCacheCallback (https://github.com/apache/kafka/blob/trunk/core/src/test/scala/unit/kafka/coordinator/AbstractCoordinatorConcurrencyTest.scala#L188). The condition to complete the delayedProduce is completeAttempts.incrementAndGet() >= 3. And the condition gets true when call both producePurgatory.tryCompleteElseWatch(delayedProduce, producerRequestKeys) and tryCompleteDelayedRequests() since the former calls tryComplete two times and another calls tryComplete once. It means putCacheCallback is always executed by TestReplicManager.appendRecords and noted that TestReplicManager.appendRecords is executed within a group lock (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala#L738) . In short, txn initialization (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L464) and txn append (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L407) are executed with same group lock. Hence, the following execution order is impossible.

txn initialization

txn completion

txn append

However, this PR disable to complete delayed requests within group lock held by caller. The putCacheCallback which used to append txn needs to require group lock again.

Thanks for the great explanation. I understand the issue now. Essentially, this exposed a limitation of the existing test. The existing test happens to work because the producer callbacks are always completed in the same ReplicaManager.appendRecords() call under the group lock. However, this is not necessarily the general case.

Your fix works, but may hide other real problems. I was thinking that another way to fix this is to change the test a bit. For example, we expect CompleteTxnOperation to happen after CommitTxnOffsetsOperation. So, instead of letting them run in parallel, we can change the test to make sure that CompleteTxnOperation only runs after CommitTxnOffsetsOperation completes successfully. JoinGroupOperation and SyncGroupOperation might need a similar consideration.

we expect CompleteTxnOperation to happen after CommitTxnOffsetsOperation. So, instead of letting them run in parallel, we can change the test to make sure that CompleteTxnOperation only runs after CommitTxnOffsetsOperation completes successfully.

will roger that !

JoinGroupOperation and SyncGroupOperation might need a similar consideration.

I didn't notice something interesting. Could you share it with me?

junrao · 2020-06-14T23:31:46Z

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala

@@ -536,6 +537,11 @@ class GroupCoordinatorTest {
    // Make sure the NewMemberTimeout is not still in effect, and the member is not kicked
    assertEquals(1, group.size)

+    // prepare the mock replica manager again since the delayed join is going to complete
+    EasyMock.reset(replicaManager)
+    EasyMock.expect(replicaManager.getMagic(EasyMock.anyObject())).andReturn(Some(RecordBatch.MAGIC_VALUE_V1)).anyTimes()


Hmm, why do we need to mock this since replicaManager.getMagic() is only called through replicaManager.handleWriteTxnMarkersRequest()?

GroupMetadataManager#storeGroup (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L245) also call ReplicaManager.getMagic.

There are delayed ops are completed by timer.advanceClock so we have to mock the replicaManager.getMagic. the mock is same to https://github.com/apache/kafka/blob/trunk/core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala#L3823.

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala

chia7712 · 2020-06-15T04:06:29Z

Another way that doesn't require checking lock.isHeldByCurrentThread is the following. But your approach seems simpler.

So... could we keep it simpler?

chia7712 · 2020-06-15T06:37:38Z

kafka.admin.ReassignPartitionsUnitTest > testModifyBrokerThrottles FAILED

the flaky is traced by #8853

junrao

@chia7712 : Thanks for the updated PR. Just one comment below. Also, there are a few comments not addressed from the previous round.

It will be helpful if you could preserve the commit history in future updates to the PR since that makes it easier to identify the delta changes.

chia7712 · 2020-06-16T04:16:09Z

It will be helpful if you could preserve the commit history in future updates to the PR since that makes it easier to identify the delta changes.

my bad :(

I'll keep that in mind

junrao

@chia7712 : Thanks for the updated PR. Added a few more comments below.

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/main/scala/kafka/server/ReplicaManager.scala

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala

core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala

junrao · 2020-06-16T22:54:36Z

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorConcurrencyTest.scala

+        offsetsPartitions.map(_.partition).toSet, isCommit = isCommit)
+      catch {
+        case e: IllegalStateException if isCommit
+          && e.getMessage.contains("though the offset commit record itself hasn't been appended to the log")=>


Thanks for the great explanation. I understand the issue now. Essentially, this exposed a limitation of the existing test. The existing test happens to work because the producer callbacks are always completed in the same ReplicaManager.appendRecords() call under the group lock. However, this is not necessarily the general case.

Your fix works, but may hide other real problems. I was thinking that another way to fix this is to change the test a bit. For example, we expect CompleteTxnOperation to happen after CommitTxnOffsetsOperation. So, instead of letting them run in parallel, we can change the test to make sure that CompleteTxnOperation only runs after CommitTxnOffsetsOperation completes successfully. JoinGroupOperation and SyncGroupOperation might need a similar consideration.

junrao

@chia7712 : Thanks for the updates PR. Just a few more comments below.

core/src/main/scala/kafka/cluster/Partition.scala

core/src/main/scala/kafka/server/ReplicaManager.scala

junrao · 2020-06-17T22:02:38Z

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorConcurrencyTest.scala

+    groupCoordinator.groupManager.addPartitionOwnership(groupPartitionId)
+    val lock = new ReentrantLock()
+    val producerId = producerIdCount
+    producerIdCount += 1


I think the intention for the test is probably to use the same producerId since it tests more on transactional conflicts.

Got it. However, the same producerId means the group completed by CompleteTxnOperation is possible to be impacted by any CommitTxnOffsetsOperation (since the partitions are same also). Hence, the side-effect is that we need a single lock to control the happen-before of txn completion and commit so the test will get slower.

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorConcurrencyTest.scala

chia7712 · 2020-09-07T19:19:30Z

@junrao Thanks for all reviews again 👍

Do you plan to remove some of the unused methods in DelayedOperations in Partition?

my bad. I forgot this request :(

Expect for checkAndCompleteFetch, the other unused methods (in production scope) are removed by this PR.

Currently, when calling checkAndComplete() for the produce/fetch/deleteRecords purgatory, we still hold replicaStateChangeLock. This doesn't seem to cause any deadlock for now. In the future, we can potentially improve this by calling checkAndComplete() outside of the replicaStateChangeLock by passing leader epoch into those delayed operations and checking if leader epoch has changed in tryComplete().

It seems we can remove delayedOperations fromPartition. That is similar to this PR and Partition SHOULD NOT complete delayed request anymore. I can take over this in separate PR :)

junrao

@chia7712 : Thanks for the updated PR. Just a few more minor comments.

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/test/scala/unit/kafka/coordinator/AbstractCoordinatorConcurrencyTest.scala

core/src/main/scala/kafka/server/DelayedOperation.scala

junrao

@chia7712 : Thanks for the new update. A few more minor comments.

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorConcurrencyTest.scala

core/src/main/scala/kafka/server/DelayedOperation.scala

junrao

@chia7712 : Thanks for the updated PR. A few more minor comments below.

core/src/main/scala/kafka/server/DelayedOperation.scala

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorConcurrencyTest.scala

…sts does NOT hold any group lock

… grammar fix

…lse to DelayedOperation

…omment

junrao

@chia7712 : Thanks for the latest changes. LGTM.

Latest system result has 1 failure.
http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-09-08--001.1599611744--chia7712--fix_8334_avoid_deadlock--fbd46565a/report.html

Also, are the jenkins test failures related to this PR?

junrao · 2020-09-09T01:29:16Z

@ijuma @hachikuji @rajinisivaram : I think this PR is ready to be merged. Any further comments from you?

chia7712 · 2020-09-09T02:17:30Z

Build / JDK 15 / kafka.network.ConnectionQuotasTest.testNoConnectionLimitsByDefault
Build / JDK 11 / kafka.network.DynamicConnectionQuotaTest.testDynamicListenerConnectionCreationRateQuota
Build / JDK 11 / org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta[true]

Module: kafkatest.tests.connect.connect_distributed_test
Class:  ConnectDistributedTest
Method: test_bounce
Arguments:
{
  "clean": true,
  "connect_protocol": "sessioned"
}

On my local, they are flaky on trunk branch.

junrao · 2020-09-09T21:46:20Z

@chia7712 : Thanks a lot for staying on this tricky issue and finding a simpler solution!

chia7712 · 2020-09-10T01:46:16Z

Thanks a lot for staying on this tricky issue and finding a simpler solution!

thanks for all suggestions. I benefit a lot from it.

chia7712 mentioned this pull request May 12, 2020

KAFKA-8334 Executor to retry delayed operations failed to obtain lock #6915

Open

3 tasks

chia7712 commented May 12, 2020

View reviewed changes

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala Outdated Show resolved Hide resolved

chia7712 commented May 12, 2020

View reviewed changes

core/src/main/scala/kafka/server/DelayedOperation.scala Outdated Show resolved Hide resolved

chia7712 commented May 12, 2020

View reviewed changes

core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala Outdated Show resolved Hide resolved

chia7712 commented May 12, 2020

View reviewed changes

core/src/test/scala/unit/kafka/server/DelayedOperationTest.scala Show resolved Hide resolved

chia7712 commented May 12, 2020

View reviewed changes

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala Show resolved Hide resolved

chia7712 force-pushed the fix_8334_avoid_deadlock branch from 507196d to 51884d0 Compare May 14, 2020 03:52

hachikuji reviewed May 19, 2020

View reviewed changes

core/src/main/scala/kafka/cluster/Partition.scala Outdated Show resolved Hide resolved

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala Outdated Show resolved Hide resolved

core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala Outdated Show resolved Hide resolved

junrao reviewed May 26, 2020

View reviewed changes

chia7712 force-pushed the fix_8334_avoid_deadlock branch from 51884d0 to e2f74f9 Compare May 27, 2020 14:58

chia7712 commented May 27, 2020

View reviewed changes

core/src/test/scala/unit/kafka/coordinator/group/DelayedJoinTest.scala Outdated Show resolved Hide resolved

chia7712 force-pushed the fix_8334_avoid_deadlock branch from e2f74f9 to 3ac14e7 Compare May 27, 2020 15:00

chia7712 commented May 27, 2020

View reviewed changes

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala Outdated Show resolved Hide resolved

chia7712 commented May 27, 2020

View reviewed changes

core/src/main/scala/kafka/coordinator/group/DelayedJoin.scala Outdated Show resolved Hide resolved

chia7712 commented May 27, 2020

View reviewed changes

core/src/main/scala/kafka/server/ReplicaManager.scala Outdated Show resolved Hide resolved

chia7712 force-pushed the fix_8334_avoid_deadlock branch from 3ac14e7 to b013041 Compare June 1, 2020 08:51

junrao reviewed Jun 14, 2020

View reviewed changes

chia7712 force-pushed the fix_8334_avoid_deadlock branch 2 times, most recently from 4a9cbc9 to d8beeab Compare June 15, 2020 06:37

junrao reviewed Jun 16, 2020

View reviewed changes

chia7712 force-pushed the fix_8334_avoid_deadlock branch from d8beeab to 8893158 Compare June 16, 2020 05:07

junrao reviewed Jun 16, 2020

View reviewed changes

chia7712 force-pushed the fix_8334_avoid_deadlock branch from 8893158 to 85003c3 Compare June 17, 2020 05:33

junrao reviewed Jun 17, 2020

View reviewed changes

chia7712 force-pushed the fix_8334_avoid_deadlock branch from 85003c3 to 142a6c4 Compare June 18, 2020 03:33

junrao reviewed Sep 8, 2020

View reviewed changes

chia7712 added 17 commits September 9, 2020 00:10

KAFKA-8334 Make sure the thread which tries to complete delayed reque…

c086bd2

…sts does NOT hold any group lock

address review comment

c8f54ad

add more comment

dfb9b4b

address review comments

bb7a9e5

address review comments

3e89824

revert leaderHWIncremented to option

9c981c5

introduce leaderHWChange and consume all delayed actions

eba6df8

revise comment

4a9e49f

revert action queue in per server

c42f464

Incremental -> Increased; LeaderHWChange -> LeaderHwChange; and other…

e1a6044

… grammar fix

fix deadlock in TransactionCoordinatorConcurrencyTest

5da2fab

fix potential deadlock in tryCompleteElseWatch

4e66db0

a bit tweak

b3cd7ad

remove unused methods from DelayedOperations; add safeTryCompleteAndE…

50032c1

…lse to DelayedOperation

rename safeTryCompleteOrElse to safeTryCompleteAndElse; revise comment

28b6854

remove global variable from GroupCoordinatorConcurrencyTest; revise c…

9a49f04

…omment

tweak comment

fbd4656

chia7712 force-pushed the fix_8334_avoid_deadlock branch from 68c39cc to fbd4656 Compare September 8, 2020 16:18

junrao reviewed Sep 9, 2020

View reviewed changes

junrao merged commit c2273ad into apache:trunk Sep 9, 2020

chia7712 mentioned this pull request Sep 10, 2020

MINOR: remove DelayedOperations.checkAndCompleteFetch #9278

Merged

3 tasks

chia7712 deleted the fix_8334_avoid_deadlock branch March 25, 2024 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-8334 Make sure the thread which tries to complete delayed reque… #8657

KAFKA-8334 Make sure the thread which tries to complete delayed reque… #8657

chia7712 commented May 12, 2020 •

edited

chia7712 commented May 12, 2020

junrao left a comment

chia7712 commented Jun 2, 2020

junrao left a comment

junrao Jun 14, 2020

chia7712 Jun 15, 2020

junrao Jun 16, 2020 •

edited

chia7712 Jun 16, 2020

junrao Jun 16, 2020

chia7712 Jun 17, 2020

junrao Jun 14, 2020

chia7712 Jun 15, 2020

chia7712 commented Jun 15, 2020

chia7712 commented Jun 15, 2020

junrao left a comment

chia7712 commented Jun 16, 2020

junrao left a comment

junrao Jun 16, 2020

junrao left a comment

junrao Jun 17, 2020

chia7712 Jun 18, 2020

chia7712 commented Sep 7, 2020

junrao left a comment

junrao left a comment

junrao left a comment

junrao left a comment

junrao commented Sep 9, 2020

chia7712 commented Sep 9, 2020

junrao commented Sep 9, 2020

chia7712 commented Sep 10, 2020

KAFKA-8334 Make sure the thread which tries to complete delayed reque… #8657

KAFKA-8334 Make sure the thread which tries to complete delayed reque… #8657

Conversation

chia7712 commented May 12, 2020 • edited

Committer Checklist (excluded from commit message)

chia7712 commented May 12, 2020

junrao left a comment

Choose a reason for hiding this comment

chia7712 commented Jun 2, 2020

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao Jun 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chia7712 commented Jun 15, 2020

chia7712 commented Jun 15, 2020

junrao left a comment

Choose a reason for hiding this comment

chia7712 commented Jun 16, 2020

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chia7712 commented Sep 7, 2020

junrao left a comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

junrao commented Sep 9, 2020

chia7712 commented Sep 9, 2020

junrao commented Sep 9, 2020

chia7712 commented Sep 10, 2020

chia7712 commented May 12, 2020 •

edited

junrao Jun 16, 2020 •

edited