-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Assignments Issue In Master Branch #47
Comments
@josevalim with the |
@amacciola your fixes were not supposed to have changed the semantics of the code but it seems it did. I just pushed another approach to master, can you please try it out? |
@josevalim Will do. trying now |
@josevalim just tested it out and same issue. It fixes all the crashes and errors from the :drain_after_revoke method. but now the messages do not ingest properly after starting/stopping/starting |
I see. It seems that assignments_revoked has to fail so brod cleans it up. So I think the failures will have to stay. We can rewrite the failures to something else, so it is a bit more pleasant on the eyes though. It may also be that terminate_child is too abrupt and you may be better off using |
@josevalim makes sense. I will try using the |
Sorry i have been in here so much lately but i am now using the However i am still having the same issues. But this time i have dug into it more and i have more details. So the issue only happens if we Stop a Broadway pipeline right after starting it. In low latency envs like my local machine i see no issues. But in our cloud envs with more latency to Kafka we see issues. I believe when we start a Broadway Pipeline it sends its assignments via brod to the consumer_group for each partition. (pls correct me if i am butchering this at all) but when we shutdown the Pipeline all we are doing basically is just sending a shutdown signal to a Genserver. So i do not believe it is finishing the process of starting the Broadway Pipeline and the subscribing of the assignments for the consuner_group. The next time i start the Broadway Pipeline, since i am using the same consumer_group_id it just rejoins the consumer_group instead of re configuring it, as it should but now it will never fetch data from a handful of partitions Is there a proper way to stop a Broadway Pipeline ? something like a |
The stopping a Broadway pipeline is not a scenario that we tested because there is no public API for it. If we implemented it, it would be implemented with GenServer.stop, and I would assume it would work in most cases, but Kafka is extremely complicated so we would need to run a bunch of manual tests. You are welcome to try a pull request that adds this functionality or at least add some integration tests that reproduces the failures you are seeing so we can try to investigate them further. |
In our scenario users are able to start, pause, start ingestion from a UI which stops the pipelines on the backend. And they can do these actions right after one another if they chose. How i found a fix for this is just before i stop a pipeline i check if the If there are only a subset of the total partitions with offsets committed then something went wrong (Pipeline was shutdown too quick after starting), and i clear any data from our system that may have been ingested and then reset the consumer_group_id. |
Hello,
I posted this issue
#43
last week and we got a PR submitted to Master to fix this crash from happening. However i believe there has been an unexpected side effect from fixing this crash.
The scenario i am seeing the issue is when a pipeline is started with a consumer_group_id, stopped, and then started using the same consumer_group_id. When the pipeline starts it does not seem to register or fetch messages from all the partitions for the topic. In the below screenshot the top output is after i started and shortly after stopped the pipeline. Then the below output is when i start it again and let all the messages ingest.
When i reverted from using the
master
branch and went back to using the latest release tag. This issue went away. I will show my logs here from both cases.LOGS WHEN USING MASTER BRANCH W/ ISSUE:
Then when i restart the pipeline the partition issue and message inconsistency issues start to occur
LOGS WHEN USING LATEST RELEASE TAG WITH NO ISSUE:
Then when i restart the pipeline i still get the Genserver Crash error that we were trying to fix with the last PR
However even with the crash i still get the correct data ingested and partion offsets fetched. Where with the "fix" it seems to have broken that.
I dont know what the fix is for fixing the Genserver crash and at the same time not breaking the partition offset counts or the draining but that change in master should not be released until something has been found out
The text was updated successfully, but these errors were encountered: