Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrade to 8.2: Kafka events fail to post #7123

Closed
DmitryLitvintsev opened this issue Apr 14, 2023 · 1 comment
Closed

After upgrade to 8.2: Kafka events fail to post #7123

DmitryLitvintsev opened this issue Apr 14, 2023 · 1 comment

Comments

@DmitryLitvintsev
Copy link
Member

DmitryLitvintsev commented Apr 14, 2023

After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue.

I have tried:

dcache.kafka.topic=ingest.dcache.billing

dcache.kafka.maximum-block = 2000
dcache.kafka.maximum-block.unit = MILLISECONDS

xrootd.kafka.producer.configs!retries = 2
webdav.kafka.producer.configs!retries = 2
nfs.kafka.producer.configs!retries = 2
pool.kafka.producer.configs!retries = 2

xrootd.kafka.producer.configs!enable.idempotence = true
webdav.kafka.producer.configs!enable.idempotence = true
nfs.kafka.producer.configs!enable.idempotence = true
pool.kafka.producer.configs!enable.idempotence = true

xrootd.kafka.producer.configs!acks = all
webdav.kafka.producer.configs!acks = all
nfs.kafka.producer.configs!acks = all
pool.kafka.producer.configs!acks = all

xrootd.kafka.producer.configs!max.in.flight.requests.per.connection = 1
webdav.kafka.producer.configs!max.in.flight.requests.per.connection = 1
nfs.kafka.producer.configs!max.in.flight.requests.per.connection = 1
pool.kafka.producer.configs!max.in.flight.requests.per.connection = 1

xrootd.kafka.producer.configs!request.timeout.ms = 120000
webdav.kafka.producer.configs!request.timeout.ms = 120000
nfs.kafka.producer.configs!request.timeout.ms = 120000
pool.kafka.producer.configs!request.timeout.ms = 120000

xrootd.kafka.producer.configs!reconnect.backoff.max.ms = 1000
webdav.kafka.producer.configs!reconnect.backoff.max.ms = 1000
nfs.kafka.producer.configs!reconnect.backoff.max.ms = 1000
pool.kafka.producer.configs!reconnect.backoff.max.ms = 1000

xrootd.kafka.producer.configs!reconnect.backoff.ms = 50
webdav.kafka.producer.configs!reconnect.backoff.ms = 50
nfs.kafka.producer.configs!reconnect.backoff.ms = 50
pool.kafka.producer.configs!reconnect.backoff.ms = 50

And also I have tried to comment all of the above and use defaults - same results:

}, timestamp=null), exception : {}
org.apache.kafka.common.errors.TimeoutException: Expiring 7 record(s) for ingest.dcache.billing-0:120000 ms has passed since batch creation

I must add that the decision to sprinkle Kafka interaction across services was not the best idea. Trying different configs I have to restart doors and pools ..... Unacceptable in production.
If instead we limited Kafka interactions to billing cell, no transfers would be affected while debugging

Also note, that the version we used has print format error that ends up dumping metric ton of data into log files which are filling up log partitions.

Another bit:

Just doing random restores one at a time on ITB, I see events going to Kafka. Production seem to report some of the events.
So is not a 100% fail. But it mostly fails.

Why is this bad? We want to rely on storage events for data handling. Right now this is completely broken.

This is 8.2.13

I am yet to deploy format fix that Tigran addd recently.

Correction: No. I see no events from any 8.2 installations except one domain - NFS domain. That one successfully reports to Kafka. No config differences between NFS domain and other doors/pools with regard to Kafka.

@DmitryLitvintsev
Copy link
Member Author

I might have understood the problem. I enabled DEBUG and I see:

14 Apr 2023 16:24:26 (WebDAV00-fndcadoor01) [] [Producer clientId=WebDAV00-fndcadoor01@webdav-fndcadoor01Domain-1] Connection with lssrv06.fnal.gov/2620:6a:
0:8430:f0:0:240:103 disconnected
java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50)
        at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:224)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:526)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:481)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)
        at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:328)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
        at java.base/java.lang.Thread.run(Thread.java:829)

so connection fails. Why?

# telnet lssrv06 9092
Trying 2620:6a:0:8430:f0:0:240:103...
telnet: connect to address 2620:6a:0:8430:f0:0:240:103: Connection refused
Trying 131.225.240.103...
Connected to lssrv06.fnal.gov.
Escape character is '^]'.

Looks like IPv6 firewall is blocking port 9092 and since 8.2 dCache prefers IPV6. I just made this hack:

[root@fndcadoor01 ~]# !d
diff -uN  /usr/share/dcache/defaults/dcache.properties /usr/share/dcache/defaults/dcache.properties.orig
--- /usr/share/dcache/defaults/dcache.properties	2023-04-14 16:28:07.965725909 -0500
+++ /usr/share/dcache/defaults/dcache.properties.orig	2023-04-14 16:27:58.498739815 -0500
@@ -237,6 +237,7 @@
     -XX:+StartAttachListener \
     -XX:+UseCompressedOops \
     -javaagent:${dcache.paths.classes}/aspectjweaver-1.9.2.jar \
+    -Djava.net.preferIPv6Addresses=system \
     ${dcache.java.options.common} \
     ${dcache.java.options.extra}=

restarted and I am getting events into Kafka...

da994201 (Paul Millar              2022-02-15 12:07:33 +0100  200)     -Djava.net.preferIPv6Addresses=system \
commit da994201f1e90191030a631c54241c8bbe240565
Author: Paul Millar <paul.millar@desy.de>
Date:   Tue Feb 15 12:07:33 2022 +0100

    scripts: configure JVM to use OS-supplied IP address order
    
    Motivation:
    
    WLCG wishes to see HTTP-TPC transfers use IPv6 addresses when
    transferring data between two dual-stack machines.
    
    Modification:
    
    Update the `java` command-line to specify the system configuration
    property `java.net.preferIPv6Addresses`.  This property controls in
    which order addresses are returned when resolving a hostname.  This, in
    turn, controls which protocol version (IPv4 or IPv6) is used
    preferentially.
    
    There are three valid values: `false`, `true` and `system`.  By
    selecting `system`, the script has Java honour the order provided by the
    OS.  The GNU resolver library adopts RFC 3484, and provides admins with
    the ability to tweak the behaviour (as needed).
    
    This patch also reverts commit 755ee365e5.  That patch provided a
    solution to this problem that was safer to back-port.  With this patch,
    there is no longer any need to have a custom DnsResolver instance.
    
    Result:
    
    dCache now uses the system-supplied ordering of IP addresses when
    resolving a host name.  On GNU-based systems (e.g., Linux systems), the
    order follows RFC 3484 and may be tweaked via the `/etc/gai.conf` file.
    This provides admins with better control over which address is used when
    dCache establishes network connections with other services; e.g.,
    HTTP-TPC pull transfers.  In particular, using IPv6 (instead of IPv4)
    where possible.
    
    Target: master
    Requires-notes: true
    Requires-book: no
    Patch: https://rb.dcache.org/r/13443/
    Acked-by: Tigran Mkrtchyan

mksahakyan added a commit that referenced this issue Apr 26, 2023
Motivation

After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue. (#7123).

The issue is that according to spring-projects/spring-kafka#2251,  the kafka-clients provide no hooks to determine that a send failed because the broker is down (spring-projects/spring-kafka#2250).
This is still not fixed so this should be fixed.

Modification

Change the LoggingProducerListener so that when TimeoutException will be catch, the error message will indicate that there is a connection issue or the broker is down.

Result

Log looks like this

24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (NFS-localhost) [] Producer failed to send the message, the broker is down or the connection was refused

or

24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 17:27:51 (pool_write) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag NFS-localhost PoolAcceptFile 0000C9CFA47686574B43B1EF9CF037A24780] Topic billing not present in metadata after 60000 ms.
24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] TEST Topic billing not present in metadata after 60000 ms. class org.springframework.kafka.KafkaException
24 Apr 2023 17:28:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqZqubA pool_write DoorTransferFinished 00002B30ED198C494F25A31F589AB91F903F] Producer failed to send the message, the broker is down or the co

Target: master
8.2, 9.0
 Require-book: no
 Require-notes: yes
 Patch: https://rb.dcache.org/r/13967/
 Acked-by: Lea Morschel, Abert Rossi, Tigran Mkrtchyan
mksahakyan added a commit to mksahakyan/dcache that referenced this issue May 6, 2023
Motivation

After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue. (dCache#7123).

The issue is that according to spring-projects/spring-kafka#2251,  the kafka-clients provide no hooks to determine that a send failed because the broker is down (spring-projects/spring-kafka#2250).
This is still not fixed so this should be fixed.

Modification

Change the LoggingProducerListener so that when TimeoutException will be catch, the error message will indicate that there is a connection issue or the broker is down.

Result

Log looks like this

24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (NFS-localhost) [] Producer failed to send the message, the broker is down or the connection was refused

or

24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 17:27:51 (pool_write) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag NFS-localhost PoolAcceptFile 0000C9CFA47686574B43B1EF9CF037A24780] Topic billing not present in metadata after 60000 ms.
24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] TEST Topic billing not present in metadata after 60000 ms. class org.springframework.kafka.KafkaException
24 Apr 2023 17:28:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqZqubA pool_write DoorTransferFinished 00002B30ED198C494F25A31F589AB91F903F] Producer failed to send the message, the broker is down or the co

Target: master
8.2, 9.0
 Require-book: no
 Require-notes: yes
 Patch: https://rb.dcache.org/r/13967/
 Acked-by: Lea Morschel, Abert Rossi, Tigran Mkrtchyan
mksahakyan added a commit to mksahakyan/dcache that referenced this issue May 6, 2023
Motivation

After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue. (dCache#7123).

The issue is that according to spring-projects/spring-kafka#2251,  the kafka-clients provide no hooks to determine that a send failed because the broker is down (spring-projects/spring-kafka#2250).
This is still not fixed so this should be fixed.

Modification

Change the LoggingProducerListener so that when TimeoutException will be catch, the error message will indicate that there is a connection issue or the broker is down.

Result

Log looks like this

24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (NFS-localhost) [] Producer failed to send the message, the broker is down or the connection was refused

or

24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 17:27:51 (pool_write) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag NFS-localhost PoolAcceptFile 0000C9CFA47686574B43B1EF9CF037A24780] Topic billing not present in metadata after 60000 ms.
24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] TEST Topic billing not present in metadata after 60000 ms. class org.springframework.kafka.KafkaException
24 Apr 2023 17:28:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqZqubA pool_write DoorTransferFinished 00002B30ED198C494F25A31F589AB91F903F] Producer failed to send the message, the broker is down or the co

Target: master
8.2, 9.0
 Require-book: no
 Require-notes: yes
 Patch: https://rb.dcache.org/r/13967/
 Acked-by: Lea Morschel, Abert Rossi, Tigran Mkrtchyan
lemora pushed a commit that referenced this issue May 12, 2023
Motivation

After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue. (#7123).

The issue is that according to spring-projects/spring-kafka#2251,  the kafka-clients provide no hooks to determine that a send failed because the broker is down (spring-projects/spring-kafka#2250).
This is still not fixed so this should be fixed.

Modification

Change the LoggingProducerListener so that when TimeoutException will be catch, the error message will indicate that there is a connection issue or the broker is down.

Result

Log looks like this

24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (NFS-localhost) [] Producer failed to send the message, the broker is down or the connection was refused

or

24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 17:27:51 (pool_write) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag NFS-localhost PoolAcceptFile 0000C9CFA47686574B43B1EF9CF037A24780] Topic billing not present in metadata after 60000 ms.
24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] TEST Topic billing not present in metadata after 60000 ms. class org.springframework.kafka.KafkaException
24 Apr 2023 17:28:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqZqubA pool_write DoorTransferFinished 00002B30ED198C494F25A31F589AB91F903F] Producer failed to send the message, the broker is down or the co

Target: master
8.2, 9.0
 Require-book: no
 Require-notes: yes
 Patch: https://rb.dcache.org/r/13967/
 Acked-by: Lea Morschel, Abert Rossi, Tigran Mkrtchyan
lemora pushed a commit that referenced this issue May 12, 2023
Motivation

After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue. (#7123).

The issue is that according to spring-projects/spring-kafka#2251,  the kafka-clients provide no hooks to determine that a send failed because the broker is down (spring-projects/spring-kafka#2250).
This is still not fixed so this should be fixed.

Modification

Change the LoggingProducerListener so that when TimeoutException will be catch, the error message will indicate that there is a connection issue or the broker is down.

Result

Log looks like this

24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (pool_write) [] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 16:17:04 (NFS-localhost) [] Producer failed to send the message, the broker is down or the connection was refused

or

24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] Producer failed to send the message, the broker is down or the connection was refused
24 Apr 2023 17:27:51 (pool_write) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag NFS-localhost PoolAcceptFile 0000C9CFA47686574B43B1EF9CF037A24780] Topic billing not present in metadata after 60000 ms.
24 Apr 2023 17:27:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqLTdag pool_write DoorTransferFinished 0000C9CFA47686574B43B1EF9CF037A24780] TEST Topic billing not present in metadata after 60000 ms. class org.springframework.kafka.KafkaException
24 Apr 2023 17:28:51 (NFS-localhost) [door:NFS-localhost@dCacheDomain:AAX6FqZqubA pool_write DoorTransferFinished 00002B30ED198C494F25A31F589AB91F903F] Producer failed to send the message, the broker is down or the co

Target: master
8.2, 9.0
 Require-book: no
 Require-notes: yes
 Patch: https://rb.dcache.org/r/13967/
 Acked-by: Lea Morschel, Abert Rossi, Tigran Mkrtchyan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant