Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Usage with RabbitMQ Connector (polling interface implemented incorrectly?) #414

Closed
Thylossus opened this issue Sep 2, 2020 · 33 comments

Comments

@Thylossus
Copy link

Hi,

I'm using the Camel Rabbit MQ Connector as a source connector for my Kafka cluster. Kafka connect is running in a Docker container built with the following Dockerfile

FROM openjdk:8-jre-alpine

ARG VERSION_APACHE_KAFKA=2.2.1
ARG VERSION_SCALA=2.12
ARG VERSION_APACHE_CAMEL=0.4.0

RUN mkdir -p /kafka/connect/plugins && \
    mkdir -p /kafka/connect/app && \
    mkdir -p /kafka/connect/config && \
    wget https://archive.apache.org/dist/kafka/${VERSION_APACHE_KAFKA}/kafka_${VERSION_SCALA}-${VERSION_APACHE_KAFKA}.tgz -q -O kafka.tgz && \
    tar -xzf kafka.tgz -C /kafka/connect/app --strip 1 && \
    rm -f kafka.tgz && \
    wget https://repo1.maven.org/maven2/org/apache/camel/kafkaconnector/camel-rabbitmq-kafka-connector/${VERSION_APACHE_CAMEL}/camel-rabbitmq-kafka-connector-${VERSION_APACHE_CAMEL}-package.zip -q -O rabbitmq-connector.zip && \
    unzip -qq rabbitmq-connector.zip -d /kafka/connect/plugins && \
    rm -f rabbitmq-connector.zip && \
    apk add -q --no-cache bash

ENTRYPOINT [ "/kafka/connect/entrypoint.sh" ]

When running this configuration and adding a RabbitMQ connector with the following configuration

{
  "name": "{{ config.connector_name }}",
  "config": {
    "connector.class": "org.apache.camel.kafkaconnector.rabbitmq.CamelRabbitmqSourceConnector",
    "topics": "{{ config.topic }}",
    "value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
    "tasks.max": 1,
    "camel.component.rabbitmq.hostname": "{{ kafka_connect_rabbitmq_host }}",
    "camel.component.rabbitmq.portNumber": {{ kafka_connect_rabbitmq_port }},
    "camel.component.rabbitmq.username": "{{ kafka_connect_rabbitmq_user }}",
    "camel.component.rabbitmq.password": "{{ kafka_connect_rabbitmq_password }}",
    "camel.source.path.exchangeName": "{{ config.exchange }}",
    "camel.source.endpoint.exchangeType": "topic",
    "camel.source.endpoint.autoDelete": false,
    "camel.source.endpoint.queue": "{{ config.queue }}",
    "camel.source.endpoint.routingKey": "{{ config.routing_key }}"
  }
}

the connector task created by this configuration consumes all resources of its assigned CPU core.

I've looked at the thread dumps and activated TRACE logging and came up with the following conclusion:
The CamelSourceTask does not seem to properly implement the poll method of the abstract SourceTask class.

The JavaDoc for the poll method is as follows (source):

Poll this source task for new records. If no data is currently available, this method
should block but return control to the caller regularly (by returning {@code null}) in
order for the task to transition to the {@code PAUSED} state if requested to do so.

The task will be {@link #stop() stopped} on a separate thread, and when that happens
this method is expected to unblock, quickly finish up any remaining processing, and
return.

Looking at the implementation I cannot see how the contract w.r.t. "should block, but return control to the caller regularly" is fulfilled. Since consumer.receiveNoWait(); is called, exchanges are not read in a blocking manner and the subsequent code is not blocking either.

This causes the execute method of WorkerSourceTask to continuously call the poll method which immediately returns if no exchange is available (due to the break statement in the while loop) effectively creating an infinite loop which consumes all CPU resources.

I would be grateful if someone could take a look at my analysis. Maybe there is some configuration option that solves my issue, but I was not able to find it. Looking forward to hearing from anyone.

Best regards,

Tobias

P.S.: when TRACE logging is activated, the logs immediately show that there is some kind of resource-intensive loop:

2020-09-02T10:57:16.381+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} Nothing to send to Kafka. Polling source for additional records
2020-09-02T10:57:16.381+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} About to send 0 records to Kafka
2020-09-02T10:57:16.382+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} Nothing to send to Kafka. Polling source for additional records
2020-09-02T10:57:16.382+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} About to send 0 records to Kafka
2020-09-02T10:57:16.382+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} Nothing to send to Kafka. Polling source for additional records
2020-09-02T10:57:16.382+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} About to send 0 records to Kafka
2020-09-02T10:57:16.382+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} Nothing to send to Kafka. Polling source for additional records
2020-09-02T10:57:16.382+0000 [task-thread-deribw-events-0] TRACE WorkerSourceTask - WorkerSourceTask{id=deribw-events-0} About to send 0 records to Kafka

The corresponding log statements can be found in the execute method of WorkerSourceTask

@oscerd
Copy link
Contributor

oscerd commented Sep 2, 2020

This is because we created a tiny layer based on camel. That's how a camel consumer should work in plain camel. There is space for improvements but I don't think it's implemented incorrectly.. the aim of this project is leveraging the camel components and the engine we already have.

@oscerd
Copy link
Contributor

oscerd commented Sep 2, 2020

By the way I'll try to reproduce and check.

@oscerd
Copy link
Contributor

oscerd commented Sep 2, 2020

Thanks for reporting.

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

I tested the connector yesterday for about 40 minutes with this configuration

name=CamelRabbitmqSourceConnector
connector.class=org.apache.camel.kafkaconnector.rabbitmq.CamelRabbitmqSourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter

camel.source.maxPollDuration=10000

topics=test26

camel.component.rabbitmq.hostname=172.17.0.2
camel.component.rabbitmq.portnumber=5672
camel.source.path.exchangeName=queue
camel.source.endpoint.exchangeType=topic
camel.source.endpoint.autoDelete=false
camel.source.endpoint.queue=queue
camel.source.endpoint.routingKey=key

The connector didn't consume because nothing was in RabbitMQ. This is the result with Java VisualVM

general

And the monitoring part

jvisualvm

As you can see the CPU is constantly at 12%.

How are you checking the used resources? Through top? Are you using the top irix mode?

Thylossus pushed a commit to Thylossus/docker-kafka-camel-rabbitmq-connector-cpu-usage-scenario that referenced this issue Sep 3, 2020
@Thylossus
Copy link
Author

Thylossus commented Sep 3, 2020

Hi,

thank you very much for looking into the issue. I've prepared a docker-compose file to reproduce my issue: https://github.com/Thylossus/docker-kafka-camel-rabbitmq-connector-cpu-usage-scenario. I did not have the time yesterday to do so.

When I start all containers and setup the RabbitMQ connector, the CPU usage almost immediately increases significantly. I use top in Irix mode to check the resources:

Mem: 4911524K used, 8137016K free, 1252K shrd, 43904K buff, 1593012K cached
CPU:  84% usr   0% sys   0% nic  15% idle   0% io   0% irq   0% sirq
Load average: 10.14 6.92 3.24 12/1265 473
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    6     1 root     S    2688m  21%   2  79% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/kafka/connect/app/bin/../logs -Dlog4j.configuration=file:/ka
    1     0 root     S     2180   0%   5   0% bash /kafka/connect/entrypoint.sh
  462     0 root     S     1532   0%   2   0% top
  467     0 root     R     1528   0%   6   0% top -b -n 1

However, on our test systems we also have New Relic Monitoring which reports high CPU usage as well.

With the docker-compose setup on my local machine, you can also see the heavy load on the host system:

grafik

Maybe the cause lies within my connector configuration (I've also tried adding camel.source.maxPollDuration=10000 as in your configuration, but it did not change the behavior) or running Kafka Connect in distributed mode. I've spent quite some time investigating this issue and cannot explain what's the cause of the high CPU usage except for my assumption in the initial issue description.

Best regards,

Tobias

@Thylossus
Copy link
Author

I've updated the example with an exposed JMX port. Now one can also connect to the Kafka Connect process via VisualVM or other JMX tools.

In the idle state (no RabbitMQ events), Visual VM reports ~80% cpu usage:

grafik

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

Thanks for this. I forgot to say that I was testing on kafka 2.4.0

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

We'll a look. @orpiske do you want to try to reproduce?

@orpiske
Copy link
Contributor

orpiske commented Sep 3, 2020

We'll a look. @orpiske do you want to try to reproduce?

Yes I do! I will take a look at that.

@Thylossus
Copy link
Author

Thanks for this. I forgot to say that I was testing on kafka 2.4.0

I've just tested Kafka version 2.4.0 in the Connect Docker container and still got the same results.

Thank you both for investigating this 👍

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

After ten minutes of running I'm still

oscerd@ghost:~/workspace/miscellanea/docker-kafka-camel-rabbitmq-connector-cpu-usage-scenario$ bash setup.sh 
Topic rabbitmqtest does not exist yet. Creating it.
Created topic rabbitmqtest.
Creating exchange X.test
Connector rabbitmqtest does not exist yet. Creating it.
{"name":"rabbitmqtest","config":{"connector.class":"org.apache.camel.kafkaconnector.rabbitmq.CamelRabbitmqSourceConnector","topics":"rabbitmqtest","value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter","tasks.max":"10","camel.component.rabbitmq.hostname":"rabbitmq","camel.component.rabbitmq.portNumber":"5672","camel.component.rabbitmq.username":"rabbitmq","camel.component.rabbitmq.password":"rabbitmq","camel.source.path.exchangeName":"X.test","camel.source.endpoint.exchangeType":"topic","camel.source.endpoint.autoDelete":"false","camel.source.endpoint.queue":"Q.test.kafka.import","camel.source.endpoint.routingKey":"events","name":"rabbitmqtest"},"tasks":[],"type":"source"}CPU usage as reported by top (Irix mode) in the Kafka Connect container
Mem: 20483064K used, 12140688K free, 1273088K shrd, 2280K buff, 8323776K cached
CPU:  11% usr   1% sys   0% nic  75% idle  12% io   0% irq   0% sirq
Load average: 2.47 1.59 1.59 5/2347 317
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2556m   7%   3   0% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
    1     0 root     S     2160   0%   6   0% {entrypoint.sh} /bin/bash /kafka/connect/entrypoint.sh
  313     0 root     R     1516   0%   4   0% top -b -n 1

CPU usage as reported by top (Irix mode) in the Kafka Connect container
Mem: 20475304K used, 12148448K free, 1268776K shrd, 2280K buff, 8320176K cached
CPU:   4% usr   2% sys   0% nic  81% idle  11% io   0% irq   0% sirq
Load average: 2.59 1.63 1.60 3/2343 350
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2584m   8%   3   5% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
    1     0 root     S     2160   0%   6   0% {entrypoint.sh} /bin/bash /kafka/connect/entrypoint.sh
  346     0 root     R     1516   0%   7   0% top -b -n 1

^Coscerd@ghost:~/workspace/miscellanea/docker-kafka-camel-rabbitmq-connector-cpu-usage-scenario$ bash setup.sh 
Topic rabbitmqtest already exists.
Creating exchange X.test
Connector rabbitmqtest already exists.
CPU usage as reported by top (Irix mode) in the Kafka Connect container
Mem: 20709508K used, 11914244K free, 1331772K shrd, 2280K buff, 8453220K cached
CPU:   7% usr   6% sys   0% nic  74% idle  12% io   0% irq   0% sirq
Load average: 1.85 1.80 1.68 5/2398 395
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2642m   8%   3   4% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
    1     0 root     S     2160   0%   6   0% {entrypoint.sh} /bin/bash /kafka/connect/entrypoint.sh
  391     0 root     R     1516   0%   0   0% top -b -n 1

CPU usage as reported by top (Irix mode) in the Kafka Connect container
Mem: 20713436K used, 11910316K free, 1332128K shrd, 2280K buff, 8453600K cached
CPU:   6% usr   1% sys   0% nic  87% idle   4% io   0% irq   0% sirq
Load average: 1.72 1.77 1.67 14/2392 401
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2641m   8%   3   3% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
    1     0 root     S     2160   0%   6   0% {entrypoint.sh} /bin/bash /kafka/connect/entrypoint.sh
  396     0 root     R     1516   0%   0   0% top -b -n 1

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

I just modified the cpu_quota here in kafka_connect service:

version: "2.1"

# Based on https://github.com/simplesteph/kafka-stack-docker-compose/blob/master/zk-single-kafka-multiple.yml
services:
  zoo1:
    image: zookeeper:3.4.9
    hostname: zoo1
    container_name: camel_demo_zoo1
    ports:
      - "2181:2181"
    environment:
      ZOO_MY_ID: 1
      ZOO_PORT: 2181
      ZOO_SERVERS: server.1=zoo1:2888:3888
    volumes:
      - ./zk-single-kafka-multiple/zoo1/data:/data
      - ./zk-single-kafka-multiple/zoo1/datalog:/datalog

  kafka1:
    image: confluentinc/cp-kafka:5.5.1
    hostname: kafka1
    container_name: camel_demo_kafka1
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka1:19092,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
      KAFKA_ZOOKEEPER_CONNECT: "zoo1:2181"
      KAFKA_BROKER_ID: 1
      KAFKA_LOG4J_LOGGERS: "kafka.controller=INFO,kafka.producer.async.DefaultEventHandler=INFO,state.change.logger=INFO"
    volumes:
      - ./zk-single-kafka-multiple/kafka1/data:/var/lib/kafka/data
    depends_on:
      - zoo1

  kafka2:
    image: confluentinc/cp-kafka:5.5.1
    hostname: kafka2
    container_name: camel_demo_kafka2
    ports:
      - "9093:9093"
    environment:
      KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka2:19093,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
      KAFKA_ZOOKEEPER_CONNECT: "zoo1:2181"
      KAFKA_BROKER_ID: 2
      KAFKA_LOG4J_LOGGERS: "kafka.controller=INFO,kafka.producer.async.DefaultEventHandler=INFO,state.change.logger=INFO"
    volumes:
      - ./zk-single-kafka-multiple/kafka2/data:/var/lib/kafka/data
    depends_on:
      - zoo1

  kafka3:
    image: confluentinc/cp-kafka:5.5.1
    hostname: kafka3
    container_name: camel_demo_kafka3
    ports:
      - "9094:9094"
    environment:
      KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka3:19094,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
      KAFKA_ZOOKEEPER_CONNECT: "zoo1:2181"
      KAFKA_BROKER_ID: 3
      KAFKA_LOG4J_LOGGERS: "kafka.controller=INFO,kafka.producer.async.DefaultEventHandler=INFO,state.change.logger=INFO"
    volumes:
      - ./zk-single-kafka-multiple/kafka3/data:/var/lib/kafka/data
    depends_on:
      - zoo1

  rabbitmq:
    image: rabbitmq:3-management
    hostname: rabbitmq
    container_name: camel_demo_rabbitmq
    ports:
      - "15672:15672"
      - "5672:5672"
    environment:
      RABBITMQ_DEFAULT_USER: "rabbitmq"
      RABBITMQ_DEFAULT_PASS: "rabbitmq"
      RABBITMQ_DEFAULT_VHOST: "/"

  kafka_connect:
    build: ./kafka-connect
    hostname: kafka_connect
    container_name: camel_demo_kafka_connect
    cpu_quota: 25000
    ports:
      - "8083:8083"
      - "9010:9010"
    environment:
      KAFKA_HEAP_OPTS: "-Xms256M -Xmx1G"
      KAFKA_JMX_OPTS: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false  -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0.0.0 -Dcom.sun.management.jmxremote.rmi.port=9010 "
      JMX_PORT: 9010
      KAFKA_LOG4J_OPTS: "-Dlog4j.configuration=file:/kafka/connect/config/log4j.properties"
    depends_on:
      - kafka1
      - kafka2
      - kafka3
      - rabbitmq

@orpiske
Copy link
Contributor

orpiske commented Sep 3, 2020

@oscerd I managed to reproduce it.

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

Me too. But with a little modification to conf it seems better

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

After one hour

oscerd@ghost:~/workspace/miscellanea/docker-kafka-camel-rabbitmq-connector-cpu-usage-scenario$ bash setup.sh 
Topic rabbitmqtest already exists.
Creating exchange X.test
Connector rabbitmqtest already exists.
CPU usage as reported by top (Irix mode) in the Kafka Connect container
Mem: 21581456K used, 11042296K free, 1323052K shrd, 2280K buff, 8477644K cached
CPU:   4% usr   0% sys   0% nic  82% idle  11% io   0% irq   0% sirq
Load average: 1.91 2.77 1.87 7/2374 417
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2640m   8%   3   2% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
    1     0 root     S     2160   0%   6   0% {entrypoint.sh} /bin/bash /kafka/connect/entrypoint.sh
  412     0 root     R     1516   0%   2   0% top -b -n 1

CPU usage as reported by top (Irix mode) in the Kafka Connect container
Mem: 21564480K used, 11059272K free, 1311232K shrd, 2280K buff, 8465832K cached
CPU:   3% usr   1% sys   0% nic  83% idle  12% io   0% irq   0% sirq
Load average: 1.92 2.75 1.87 10/2367 422
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    7     1 root     S    2640m   8%   3   4% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
    1     0 root     S     2160   0%   6   0% {entrypoint.sh} /bin/bash /kafka/connect/entrypoint.sh
  418     0 root     R     1516   0%   5   0% top -b -n 1

@Thylossus
Copy link
Author

@oscerd Did you get your results with or without the cpu_quota setting? I just ran the setup on a Linux machine and, again, got high CPU usage:

Mem: 7912780K used, 138900K free, 339276K shrd, 507192K buff, 2710832K cached
CPU:  96% usr   3% sys   0% nic   0% idle   0% io   0% irq   0% sirq
Load average: 5.59 2.29 1.94 13/1688 427
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    6     1 root     S    2672m  34%   3  89% /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xms256M -Xmx1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=0.0
  422     0 root     R     1528   0%   4   3% top -b -n 1
    1     0 root     S     2180   0%   0   0% bash /kafka/connect/entrypoint.sh

@orpiske
Copy link
Contributor

orpiske commented Sep 3, 2020

As soon as I add the connector - from running the setup script - there's a big spike in both the GC activity as well as the CPU usage. I'm checking if the same thing happens under different settings / connectors / etc.

Screenshot from 2020-09-03 13-33-31

@Thylossus
Copy link
Author

When I set cpu_quota to 25000 as @oscerd did, the CPU usage is, indeed, not at > 80% anymore, but this is rather a workaround, isn't it? What's you opinion?

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

I don't think it's a workaround at all, because basically this happens only in a container context.. I mean if you run the same configuration in standalone kafka connect you won't get the same resource consumption. It's something we need to investigate for sure by the way.

@orpiske
Copy link
Contributor

orpiske commented Sep 3, 2020

I agree that we need to investigate this. I am not sure if it's a bug either. As usual for performance issues, there's a lot of things involved. I did try w/ a different connector to see if there would be any similarity and there was none with regards to CPU, but there were other strange things nonetheless. So, yeah, I guess we need to take a closer look.

@orpiske
Copy link
Contributor

orpiske commented Sep 3, 2020

Another thing I noticed: the reproducer has the tasks.max set to 10. This is not yet supported. After I decreased it to 1, the CPU usage on my case decreased significantly.

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

Yes, those connector actually doesn't work in distributed mode, we need to support it in some way. Still WIP.

@Thylossus
Copy link
Author

Another thing I noticed: the reproducer has the tasks.max set to 10. This is not yet supported. After I decreased it to 1, the CPU usage on my case decreased significantly.

Have you had a look at the load on the individual cores of your system? Setting tasks.max to 1 will, indeed, reduce the overall system load as only one worker respectively one thread is spawned, but this worker will utilize "its core" to 100%. So if you have a system with two CPU cores, the CPU usage will be at 50% (or 100% in top with Irix mode)

@orpiske
Copy link
Contributor

orpiske commented Sep 3, 2020

Another thing I noticed: the reproducer has the tasks.max set to 10. This is not yet supported. After I decreased it to 1, the CPU usage on my case decreased significantly.

Have you had a look at the load on the individual cores of your system? Setting tasks.max to 1 will, indeed, reduce the overall system load as only one worker respectively one thread is spawned, but this worker will utilize "its core" to 100%. So if you have a system with two CPU cores, the CPU usage will be at 50% (or 100% in top with Irix mode)

I did not look individually to a specific core, but the load did decrease by a factor of 10 (ie.: from ~31% previously to ~3% after that). Good catch.

@Thylossus
Copy link
Author

Thank you both for spending so much time with investigating this issue 👍 I've configured our deployment so that the docker container's quota is fixed to 0.75 CPU cores and it works fine:

CONTAINER ID        NAME                    CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
1c24d113ae09        dericon-kafka-connect   75.27%              598.1MiB / 3.851GiB   15.17%              1.25MB / 452kB      0B / 4.1kB          46

I run the connector with task.max = 1 and while this still uses the quota completely, we can handle this load on our server without impacting other applications too much. This is a great progress compared to yesterday. Your help is highly appreciated.

I've also updated the sample with the latest settings.

While we can work with the current solution, I'm not sure if this issue can be closed yet, but I'll leave the decision with you.

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

Let's leave this open

@oscerd
Copy link
Contributor

oscerd commented Sep 3, 2020

I would focus on understanding of this specific to rabbitmq connector/component

@orpiske
Copy link
Contributor

orpiske commented Sep 4, 2020

Today I did play a bit with this and I think we may have some room to improve the CamelSourceTask.

I adjusted the code a little bit and managed to reduce both the CPU usage (from ~9% to ~0.4% on this computer), GC allocations and heap usage (from ~350Mb to ~290Mb) while idle.

Before:
before

After:
after

With the change, the code spends far less time in the CamelTask.poll() than the current one we have.

Of course, this is just an initial proof of concept. Although it passes through all the tests, we need to do more serious testing, reviews, cleanups, etc ... but it looks there's something we can do here.

@oscerd
Copy link
Contributor

oscerd commented Sep 4, 2020

Really well done @orpiske . Can you open a PR with your updates?

@oscerd
Copy link
Contributor

oscerd commented Sep 4, 2020

And thanks a lot!

@orpiske
Copy link
Contributor

orpiske commented Sep 4, 2020

Really well done @orpiske . Can you open a PR with your updates?

Thanks @oscerd! That was a fun one to play with.

Yes I can. I will cleanup the patches a little bit and I'll send one for review on Monday ... Tuesday at worst. I also put some code in place to make it easier for us to debug problems like this the future.

orpiske added a commit to orpiske/camel-kafka-connector that referenced this issue Sep 7, 2020
This adds a reference implementation for checking the resource usage of
the RabbitMQ component while iddle. The motivation for this is related
to the github issue apache#414.
orpiske added a commit to orpiske/camel-kafka-connector that referenced this issue Sep 7, 2020
This adds a reference implementation for checking the resource usage of
the RabbitMQ component while iddle. The motivation for this is related
to the github issue apache#414.
orpiske added a commit that referenced this issue Sep 8, 2020
This adds a reference implementation for checking the resource usage of
the RabbitMQ component while iddle. The motivation for this is related
to the github issue #414.
@Thylossus
Copy link
Author

Thank you for looking into it! The PR looks good. Looking forward to version 0.5.0 :)

oscerd pushed a commit that referenced this issue Sep 27, 2020
This adds a reference implementation for checking the resource usage of
the RabbitMQ component while iddle. The motivation for this is related
to the github issue #414.
oscerd pushed a commit that referenced this issue Sep 27, 2020
This adds a reference implementation for checking the resource usage of
the RabbitMQ component while iddle. The motivation for this is related
to the github issue #414.
@oscerd
Copy link
Contributor

oscerd commented Oct 2, 2020

I think we can close this one.

@oscerd oscerd closed this as completed Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants