New error messages seen in faust app with aiokafka 0.7.1 #166

patkivikram · 2021-06-29T14:51:22Z

Checklist

[x ] I have included information about relevant versions
[x ] I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

Bring up the faust app and let it run for a while. After running for some time the application starts logging the following error messages

2021-06-28 10:57:33 [1] [ERROR] faust.transport.drivers.aiokafka [^----AIOKafkaConsumerThread]: Has not committed TP(topic='test_topic', partition=27) at all since worker start (started 5.52 minutes ago).

There are multiple possible explanations for this:

1) The processing of a single event in the stream
   is taking too long.

    The timeout for this is defined by the broker_commit_livelock_soft_timeout setting,
    currently set to 300.0.  If you expect the time
    required to process an event, to be greater than this then please
    increase the timeout.

## Expected behavior

These errors should only happen if the commit offset is not progressing

## Actual behavior
The app is running fine and consumer offsets are actually updated

# Versions

* Python version 3.9
* Faust version 0.6.7
* Operating system Centos75
* Kafka version 2.4.1

The text was updated successfully, but these errors were encountered:

* fix for consumer errors in app #166 * fix for consumer errors in app #166

bluefatter · 2021-07-09T03:59:14Z

Thanks for fix this. [Party]

bitdivision · 2021-07-09T09:09:53Z

We're still seeing this issue with v0.6.9

patkivikram · 2021-07-12T14:53:45Z

@bitdivision what are you seeing? Is this on a single worker or on multiple?

bitdivision · 2021-07-12T14:57:54Z

We've now ignored all of the following logs in sentry configuration, however we are still seeing the logs for them in 0.6.9:

        "[^---AIOKafkaConsumerThread]: Has not committed",
        "[^---AIOKafkaConsumerThread]: Aiokafka has not sent fetch request for",
        "[^---AIOKafkaConsumerThread]: Stream has not started processing",
        "[^---AIOKafkaConsumerThread]: Stream stopped processing, or is slow for",

We're running with 6 workers. I can give more details on specific logs if that's helpful?

Edit: to add to this, we've seen these messages for topics which are clearly being processed correctly. We would have seen lag increases if a topic stopped processing, but that's not been the case

patkivikram · 2021-07-12T15:37:17Z

can you share the logs you are seeing in 0.6.9 with more than 1 worker? It should not happen with 1 worker I hope :)

popadi · 2021-11-29T15:55:42Z

Any update on this? It still happens and it's hard to tell if it's a real issue or not. I'm using 0.6.10.

Edit:
Apparently one of my two agents was publishing thousands of messages every second and it was taking all the resources away from the other agents. The error stopped appearing after I added more replicas to my app and bumped the resources a bit.

* fix race condition when buffers are full * fix race condition when buffers are full * Fix error messages in faust app #166

richardhundt · 2022-06-28T13:11:20Z

If you run a worker with 1 topic N partitions, it only consumes from a single partition. After 300 seconds the monitor starts complaining that N-1 partitions are idle.

I've had to set isolated_partitions=True but this can't possibly be the intended behaviour, can it?

daigotanaka · 2022-07-21T06:37:06Z

Hi all,

I am having what seems to be the same problem with faust-streaming==0.8.5.

Here is my setup:

I have a source topic with 6 partitions.
As a test, I started a worker, and it's moving the offsets for Partitions 2, 3, 4, and 5. But 0 and 1 aren't processed, and they keep on lagging.
I added/removed a couple of more workers. Each time a worker joins or leaves the group, a reassignment happens. No matter which worker got assigned to Partition 0 or 1, the same partitions never move their offsets.

Our Kafka cluster is on Confluent Cloud. (Don't think it matters?)

I am also wondering if there is a bad record that's causing something like an infinite loop? If that is the case, the fix is on us, of course.
But I could not even do

tp = TP(<source_topic_name>, 0) 
await app.consumer.seek(tp, <some_offset>)

to skip to a certain offset.

So, just wanted to see if the problem in this issue is still alive just to eliminate the possibility that our issue is based on a Faust bug.

@richardhundt : You mentioned about isolated_partitions param. I checked the doc but could not understand what it does. Can you tell me how it helped you to fix your issue? Thanks!

Update: I just noticed my issue happens even with a single worker. This time, Partition 0 is stuck at offset=1. Other partitions are moving forward nicely.

richardhundt · 2022-07-21T08:06:02Z

@daigotanaka

As far as I can tell, the isolated_partitions parameter causes an actor to be spawned for each partition.

I expected that concurrency lets you create N actors for M partitions, including running N actors for a single partition or 1 actor for M partitions, but that doesn't seem to be how it works. The docs are kind of hand-wavy on the details, so I'm not sure if my understanding is correct.

I also found that in order to seek I needed to use an AIOKafkaConsumer directly. Something like this:

from aiokafka.consumer import AIOKafkaConsumer
from aiokafka import TopicPartition
from faust import TopicT

async def seek_topic_partition(topic: TopicT, partition: int, offset: int):
    app = topic.app
    consumer = AIOKafkaConsumer(loop=app.loop, group_id=app.conf.id)
    tp = TopicPartition(topic.get_topic_name(), partition)
    await consumer.start()
    consumer.assign([tp])
    await consumer.seek(tp, offset)
    await consumer.stop()

daigotanaka · 2022-07-21T16:40:33Z

@richardhundt Thank you for the details! I'll try AIOKafkaConsumer :)

joekohlsdorf · 2022-07-22T15:08:38Z

This was introduced in 0.6.5 when we actually started calling the verification: 7a45b2b#diff-5704609ad5592d977f497ac5defed2c54606a1bf7e42f0677ddf88f59f47938bR278

The code doesn't care if commits go through, offsets are set in a dictionary and this is all we look at:

This probably never worked, I didn't have time to look into this in detail but my guess is that the global variable is read and updated from different threads and isn't really global.
Committing to Kafka works fine, offsets advance on all partitions.

In #153 people also complained about a significant performance regression when this additional check was enabled.

Until we find the issue you can go back to 0.6.4 or patch this check out.

daigotanaka · 2022-08-25T19:04:42Z

A follow-up to this comment:
#166 (comment)

I stopped seeing a message like "Has not committed TP(topic='test_topic', partition=27) at all since worker start (started 5.52 minutes ago)" and all the partitions started to process as expected as multiple-workers join/leave after finding the follow misconfiguration on our end:

The issue was the mismatch between the number of replicas between app config and Topic object.
We were using the MyTopic class workaround for changing # of Topic replicas we adapted from here:
#76

This was causing a mismatch between # replicas settings between the app.config and the topic. Properly aligning them via the env var TOPIC_REPLICATION_FACTOR resolved our issue.

This might be a novice mistake, but just leaving a note here anyways in case it's useful.

Thanks @richardhundt and @joekohlsdorf for providing the pointers! Reading those helped to narrow down the issue :)

Hi all,

I am having what seems to be the same problem with faust-streaming==0.8.5.

Here is my setup:

I have a source topic with 6 partitions. As a test, I started a worker, and it's moving the offsets for Partitions 2, 3, 4, and 5. But 0 and 1 aren't processed, and they keep on lagging. I added/removed a couple of more workers. Each time a worker joins or leaves the group, a reassignment happens. No matter which worker got assigned to Partition 0 or 1, the same partitions never move their offsets.

Our Kafka cluster is on Confluent Cloud. (Don't think it matters?)

I am also wondering if there is a bad record that's causing something like an infinite loop? If that is the case, the fix is on us, of course. But I could not even do
tp = TP(<source_topic_name>, 0) 
await app.consumer.seek(tp, <some_offset>)
to skip to a certain offset.

So, just wanted to see if the problem in this issue is still alive just to eliminate the possibility that our issue is based on a Faust bug.

@richardhundt : You mentioned about isolated_partitions param. I checked the doc but could not understand what it does. Can you tell me how it helped you to fix your issue? Thanks!

Update: I just noticed my issue happens even with a single worker. This time, Partition 0 is stuck at offset=1. Other partitions are moving forward nicely.

joekohlsdorf · 2022-10-05T18:16:49Z

@wbarnha Could you please explain why you closed this issue?

I don't see any recent changes to the problematic verification code I showed in #166 (comment)

I can still reproduce the problem and the solution posted by @daigotanaka does not work for me.

wbarnha · 2022-10-05T18:26:30Z

Thanks for getting back to me, I thought this was fixed by @daigotanaka but I'll go ahead and re-investigate.

patkivikram · 2022-10-10T14:06:25Z

this should be fixed with #380 - can you please test it @joekohlsdorf?

patkivikram · 2022-10-19T20:10:03Z

Anyone still seeing this with the latest release?

JonathanSerafini · 2022-11-11T03:02:43Z

most if not all of our faust-streaming 0.9.2 consumers are spitting out a bunch of these errors on and off ...
interestingly this also includes the assignor leader topic :

[^---AIOKafkaConsumerThread]: Aiokafka has not sent fetch request for TP(topic='findings-processor-__assignor-__leader', partition=0) since start (started 7.57 hours ago)

bhanuka-ilume · 2023-02-06T22:59:18Z

Still seeing the issue in faust-streaming 0.9.5

[ERROR] [^---AIOKafkaConsumerThread]: Stream has not started processing TP(topic='', partition=0) (started 18.25 hours ago).

wbarnha · 2023-02-06T23:03:11Z

Still seeing the issue in faust-streaming 0.9.5

[ERROR] [^---AIOKafkaConsumerThread]: Stream has not started processing TP(topic='', partition=0) (started 18.25 hours ago).

I've also seen this error come up while Faust is actually running normally, so it's a bit hard to troubleshoot. I think the solution lies in reviewing our aiokafka drivers to log when everything is running nominally.

alihoseiny · 2023-07-19T07:22:38Z

For the record, We are facing the exact same issue using faust-streaming 0.10.13 and Python 3.11.3.

richardhundt · 2023-07-19T08:08:32Z

Something to consider: if you have a large consumer_max_fetch_size (default is 1024 ** 2), have small messages, and your agent takes long to process, then you could see this issue.

What happens is that you fetch a chunk of 1048576 bytes in a single poll of Kafka, if your messages are 1kb on average then you'll have about 1k messages. If each message takes 1 second to process, then you end up polling Kafka once every ~15 minutes, and that'll trigger this error.

Try setting consumer_max_fetch_size to something much smaller. There are a couple of other settings you can play with. Here's my setup for feeding an OCR system with file names where the OCR backend takes several seconds per message:

app = faust.App(
    ...
    broker_heartbeat_interval=3,
    broker_session_timeout=120,
    broker_request_timeout=240,
    broker_max_poll_records=30,
    broker_max_poll_interval=120000,
    broker_commit_livelock_soft_timeout = 30 * 60,
    stream_processing_timeout = 30 * 60,
    consumer_max_fetch_size=2048,
   ...
)

Here I'm trying to increase the polling frequency by limiting max poll records and max fetch size, while increasing intervals and timeouts.

joekohlsdorf · 2023-07-19T13:09:30Z

This is true but it also happens in environments which process millions of small messages.
We already know that this problem was introduced with this change. I don't know what's wrong with this validation but removing it fixes the problem (and speeds up processing significantly): 7a45b2b#diff-5704609ad5592d977f497ac5defed2c54606a1bf7e42f0677ddf88f59f47938bR278

qlhai · 2023-11-07T06:23:31Z

I also face this error msg in my project.
faust-streaming 0.10.16
aiokafka 0.8.1
kafka-python 2.0.2
Python 3.9.15
Debian GNU/Linux 11 (bullseye)

rezblaze · 2023-12-26T15:20:01Z

We have been seeing this error for a while:

The agent processing the stream is hanging (waiting for network, I/O or infinite loop).
2023-12-25 23:59:55,291 process=3269118 loglevel=ERROR request_id= correlation_id= logger=faust.transport.drivers.aiokafka _log_slow_processing() L909 [^--AIOKafkaConsumerThread]: Stream has not started processing TP(topic='build_events-BuildEvent.build_id-repartition', partition=4) (started 6.51 days ago).

There are multiple possible explanations for this:

The processing of a single event in the stream
is taking too long.

The timeout for this is defined by the stream_processing_timeout setting,
currently set to 300.0. If you expect the time
required to process an event, to be greater than this then please
increase the timeout.
The stream has stopped processing events for some reason.
The agent processing the stream is hanging (waiting for network, I/O or infinite loop).
2023-12-25 23:59:55,291 process=3269118 loglevel=ERROR request_id= correlation_id= logger=faust.transport.drivers.aiokafka _log_slow_processing() L909 [^--AIOKafkaConsumerThread]: Stream has not started processing TP(topic='build_events-BuildEvent.build_id-repartition', partition=1) (started 6.51 days ago).

There are multiple possible explanations for this:

The processing of a single event in the stream
is taking too long.

The timeout for this is defined by the stream_processing_timeout setting,
currently set to 300.0. If you expect the time
required to process an event, to be greater than this then please
increase the timeout.
The stream has stopped processing events for some reason.

We will get 100's / 1000's of these messages during a large run.

fastapi==0.90.1
uvicorn==0.14.0
python-dateutil==2.8.2
python-dotenv==1.0.0
faust-streaming==0.10.14
starlette-exporter==0.15.1
prometheus_fastapi_instrumentator==5.11.0
schedule==1.2.1
Python 3.11.3

patkivikram added a commit that referenced this issue Jun 29, 2021

fix for consumer errors in app #166

75a1231

patkivikram mentioned this issue Jun 29, 2021

fix for consumer errors in app #166 #167

Merged

patkivikram added a commit that referenced this issue Jun 29, 2021

fix for consumer errors in app #166

e3fe807

patkivikram added a commit that referenced this issue Jun 29, 2021

fix for consumer errors in app #166 (#167)

761713b

* fix for consumer errors in app #166 * fix for consumer errors in app #166

patkivikram added a commit that referenced this issue Jul 6, 2021

Fix error messages in faust app #166

b06e579

patkivikram added a commit that referenced this issue Dec 13, 2021

Fix error messages in faust app #166

28d0db2

patkivikram added a commit that referenced this issue Dec 13, 2021

fix race condition when buffers are full (#237)

7d861dc

* fix race condition when buffers are full * fix race condition when buffers are full * Fix error messages in faust app #166

wbarnha closed this as completed Oct 5, 2022

wbarnha reopened this Oct 5, 2022

wbarnha added the bug Something isn't working label Oct 5, 2022

wbarnha mentioned this issue Oct 6, 2022

Unmark *some* Aiokafkaconsumer thread tests #377

Merged

wbarnha mentioned this issue Jun 5, 2023

Aiokafka has not sent fetch request #503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New error messages seen in faust app with aiokafka 0.7.1 #166

New error messages seen in faust app with aiokafka 0.7.1 #166

patkivikram commented Jun 29, 2021 •

edited

bluefatter commented Jul 9, 2021

bitdivision commented Jul 9, 2021

patkivikram commented Jul 12, 2021

bitdivision commented Jul 12, 2021 •

edited

patkivikram commented Jul 12, 2021

popadi commented Nov 29, 2021 •

edited

richardhundt commented Jun 28, 2022

daigotanaka commented Jul 21, 2022 •

edited

richardhundt commented Jul 21, 2022

daigotanaka commented Jul 21, 2022

joekohlsdorf commented Jul 22, 2022 •

edited

daigotanaka commented Aug 25, 2022

joekohlsdorf commented Oct 5, 2022 •

edited

wbarnha commented Oct 5, 2022

patkivikram commented Oct 10, 2022 •

edited

patkivikram commented Oct 19, 2022

JonathanSerafini commented Nov 11, 2022 •

edited

bhanuka-ilume commented Feb 6, 2023

wbarnha commented Feb 6, 2023

alihoseiny commented Jul 19, 2023

richardhundt commented Jul 19, 2023 •

edited

joekohlsdorf commented Jul 19, 2023

qlhai commented Nov 7, 2023 •

edited

rezblaze commented Dec 26, 2023

New error messages seen in faust app with aiokafka 0.7.1 #166

New error messages seen in faust app with aiokafka 0.7.1 #166

Comments

patkivikram commented Jun 29, 2021 • edited

Checklist

Steps to reproduce

bluefatter commented Jul 9, 2021

bitdivision commented Jul 9, 2021

patkivikram commented Jul 12, 2021

bitdivision commented Jul 12, 2021 • edited

patkivikram commented Jul 12, 2021

popadi commented Nov 29, 2021 • edited

richardhundt commented Jun 28, 2022

daigotanaka commented Jul 21, 2022 • edited

richardhundt commented Jul 21, 2022

daigotanaka commented Jul 21, 2022

joekohlsdorf commented Jul 22, 2022 • edited

daigotanaka commented Aug 25, 2022

joekohlsdorf commented Oct 5, 2022 • edited

wbarnha commented Oct 5, 2022

patkivikram commented Oct 10, 2022 • edited

patkivikram commented Oct 19, 2022

JonathanSerafini commented Nov 11, 2022 • edited

bhanuka-ilume commented Feb 6, 2023

wbarnha commented Feb 6, 2023

alihoseiny commented Jul 19, 2023

richardhundt commented Jul 19, 2023 • edited

joekohlsdorf commented Jul 19, 2023

qlhai commented Nov 7, 2023 • edited

rezblaze commented Dec 26, 2023

patkivikram commented Jun 29, 2021 •

edited

bitdivision commented Jul 12, 2021 •

edited

popadi commented Nov 29, 2021 •

edited

daigotanaka commented Jul 21, 2022 •

edited

joekohlsdorf commented Jul 22, 2022 •

edited

joekohlsdorf commented Oct 5, 2022 •

edited

patkivikram commented Oct 10, 2022 •

edited

JonathanSerafini commented Nov 11, 2022 •

edited

richardhundt commented Jul 19, 2023 •

edited

qlhai commented Nov 7, 2023 •

edited