Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the client seems cannot handle server failure properly #386

Closed
reAsOn2010 opened this issue Jun 4, 2015 · 2 comments
Closed

the client seems cannot handle server failure properly #386

reAsOn2010 opened this issue Jun 4, 2015 · 2 comments
Assignees
Milestone

Comments

@reAsOn2010
Copy link
Contributor

Environment:

  • single instance Zookeeper (version: 3.4.6)
  • a Kafka cluster with four broker on one machine, running in supervisor (version: 0.8.2.0)
  • Java version "1.8.0_45"
  • topic info:
Topic:timeline  PartitionCount:20   ReplicationFactor:2 Configs:
    Topic: timeline Partition: 0    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 1    Leader: 1   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 2    Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 3    Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 4    Leader: 0   Replicas: 0,1   Isr: 0,1
    <and more...>

I tried the latest code on branch master, to test cluster failover. It seems that SimpleConsumer exited due to an AttributeError when a broker was shutdown. Below is The stack:

[E 150604 11:24:47 /Users/leon/zhihu/kafka-python/kafka/conn.py:107 leon-MacBook-Pro:22207] Unable to receive data from Kafka
Traceback (most recent call last):
  File "/Users/leon/zhihu/kafka-python/kafka/conn.py", line 104, in _read_bytes
    raise socket.error("Not enough data to read message -- did server kill socket?")
error: Not enough data to read message -- did server kill socket?
[W 150604 11:24:47 /Users/leon/zhihu/kafka-python/kafka/client.py:204 leon-MacBook-Pro:22207] Could not receive response to request [00000084000100000000004a000c6b61666b612d707974686f6effffffff0000ea600000000100000001000874696d656c696e650000000500000012000000000000000900001000000000020000000000000028000010000000000e0000000000000009000010000000000a00000000000000070000100000000006000000000000000900001000] from server <KafkaConnection host=kids.aws.dev port=9094>: Kafka @ kids.aws.dev:9094 went away
Traceback (most recent call last):
  File "zhihu_logger/consumer.py", line 23, in <module>
    main()
  File "zhihu_logger/consumer.py", line 19, in main
    for m in consumer:
  File "/Users/leon/zhihu/kafka-python/kafka/consumer/simple.py", line 311, in __iter__
    message = self.get_message(True, timeout)
  File "/Users/leon/zhihu/kafka-python/kafka/consumer/simple.py", line 270, in get_message
    return self._get_message(block, timeout, get_partition_info)
  File "/Users/leon/zhihu/kafka-python/kafka/consumer/simple.py", line 283, in _get_message
    self._fetch()
  File "/Users/leon/zhihu/kafka-python/kafka/consumer/simple.py", line 344, in _fetch
    check_error(resp)
  File "/Users/leon/zhihu/kafka-python/kafka/common.py", line 218, in check_error
    if response.error:
AttributeError: 'FailedPayloadsError' object has no attribute 'error'

When fetch messages from Kafka cluster, we call the client.send_fetch_request method with parameter fail_on_error=False, so in responses there may exists FailedPayloadsError, and function check_error actually cannot deal with a real Exception. The logic then touch the response's attribute error which is not exist.

I tried to fix problem above (I can make a pull request about this), and later recovered the down broker. Kafka cluster waited the recovered broker to catch up. When this procedure was done, the leader changed back to the recovered broker. This caused a NotLeaderForPartitionError in consumer. So the Consumer exited.

Topic:timeline  PartitionCount:20   ReplicationFactor:2 Configs:
    Topic: timeline Partition: 0    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 1    Leader: 2   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 2    Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 3    Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 4    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 5    Leader: 2   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 6    Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 7    Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 8    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 9    Leader: 2   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 10   Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 11   Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 12   Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 13   Leader: 2   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 14   Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 15   Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 16   Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 17   Leader: 2   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 18   Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 19   Leader: 3   Replicas: 3,0   Isr: 3,0

the topic config changed from the above to the below.

Topic:timeline  PartitionCount:20   ReplicationFactor:2 Configs:
    Topic: timeline Partition: 0    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 1    Leader: 1   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 2    Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 3    Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 4    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 5    Leader: 1   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 6    Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 7    Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 8    Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 9    Leader: 1   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 10   Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 11   Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 12   Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 13   Leader: 1   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 14   Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 15   Leader: 3   Replicas: 3,0   Isr: 3,0
    Topic: timeline Partition: 16   Leader: 0   Replicas: 0,1   Isr: 0,1
    Topic: timeline Partition: 17   Leader: 1   Replicas: 1,2   Isr: 2,1
    Topic: timeline Partition: 18   Leader: 2   Replicas: 2,3   Isr: 2,3
    Topic: timeline Partition: 19   Leader: 3   Replicas: 3,0   Isr: 3,0

Below is the stack:

Traceback (most recent call last):
  File "zhihu_logger/consumer.py", line 23, in <module>
    main()
  File "zhihu_logger/consumer.py", line 19, in main
    for m in consumer:
  File "/Users/leon/opensource/kafka-python/kafka/consumer/simple.py", line 311, in __iter__
    message = self.get_message(True, timeout)
  File "/Users/leon/opensource/kafka-python/kafka/consumer/simple.py", line 270, in get_message
    return self._get_message(block, timeout, get_partition_info)
  File "/Users/leon/opensource/kafka-python/kafka/consumer/simple.py", line 283, in _get_message
    self._fetch()
  File "/Users/leon/opensource/kafka-python/kafka/consumer/simple.py", line 346, in _fetch
    check_error(resp)
  File "/Users/leon/opensource/kafka-python/kafka/common.py", line 220, in check_error
    raise error_class(response)
kafka.common.NotLeaderForPartitionError: FetchResponse(topic='timeline', partition=5, error=6, highwaterMark=-1, messages=<generator object _decode_message_set_iter at 0x10116ab40>)

The two problem exist in both producer and consumer, since the logic is mainly in Kafka client.

Looking forward to responses~ THX~ 😄

@dccmx
Copy link

dccmx commented Jun 4, 2015

Same issue here。Hoping for a solution.

@dpkp
Copy link
Owner

dpkp commented Jun 9, 2015

#392 and #393 should fix these issues and will be included in the forthcoming 0.9.4 release.

@dpkp dpkp closed this as completed Jun 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants