Major Assignments Issue In Master Branch #47

amacciola · 2021-01-20T19:33:32Z

Hello,

I posted this issue
#43
last week and we got a PR submitted to Master to fix this crash from happening. However i believe there has been an unexpected side effect from fixing this crash.

The scenario i am seeing the issue is when a pipeline is started with a consumer_group_id, stopped, and then started using the same consumer_group_id. When the pipeline starts it does not seem to register or fetch messages from all the partitions for the topic. In the below screenshot the top output is after i started and shortly after stopped the pipeline. Then the below output is when i start it again and let all the messages ingest.

When i reverted from using the master branch and went back to using the latest release tag. This issue went away. I will show my logs here from both cases.

LOGS WHEN USING MASTER BRANCH W/ ISSUE:

13:05:12.258 [info] Servers.PubSub.IngestPubSub: Channel: "ingest_channel", Received message: START PIPELINE

05:12.274 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23121.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_0.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_0.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.275 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23125.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_1.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_1.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.276 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23129.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_2.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_2.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.276 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23133.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_3.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_3.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.277 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23137.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_4.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_4.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.279 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23144.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_5.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_5.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.279 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23149.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_6.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_6.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.280 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23153.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_7.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_7.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.280 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23157.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_8.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_8.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:12.280 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.23161.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_9.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157Pipeline.Broadway.Producer_9.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

13:05:14.020 [info] Elixir.CogyntWorkstationIngest.Servers.PubSub.IngestPubSub: Channel: "ingest_channel", Received message: STOP PIPELINE

13:05:18.284 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23142.0>,cb=#PID<0.23136.0>,generation=1):
elected=false

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23123.0>,cb=#PID<0.23120.0>,generation=1):
elected=true

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23155.0>,cb=#PID<0.23152.0>,generation=1):
elected=false

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23159.0>,cb=#PID<0.23156.0>,generation=1):
elected=false

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23135.0>,cb=#PID<0.23132.0>,generation=1):
elected=false

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23127.0>,cb=#PID<0.23124.0>,generation=1):
elected=false

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23147.0>,cb=#PID<0.23143.0>,generation=1):
elected=false

13:05:18.285 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23165.0>,cb=#PID<0.23160.0>,generation=1):
elected=false

13:05:18.287 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23151.0>,cb=#PID<0.23148.0>,generation=1):
elected=false

13:05:18.288 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23131.0>,cb=#PID<0.23128.0>,generation=1):
elected=false

13:05:18.294 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23155.0>,cb=#PID<0.23152.0>,generation=1):
assignments received:[]

13:05:18.294 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23165.0>,cb=#PID<0.23160.0>,generation=1):
assignments received:[]

13:05:18.294 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23159.0>,cb=#PID<0.23156.0>,generation=1):
assignments received:[]

13:05:18.294 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23147.0>,cb=#PID<0.23143.0>,generation=1):
assignments received:[]

13:05:18.294 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23135.0>,cb=#PID<0.23132.0>,generation=1):
assignments received:[]

13:05:18.294 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23127.0>,cb=#PID<0.23124.0>,generation=1):
assignments received:[]

13:05:18.295 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23131.0>,cb=#PID<0.23128.0>,generation=1):
assignments received:[]

13:05:18.295 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23151.0>,cb=#PID<0.23148.0>,generation=1):
assignments received:[]

13:05:18.295 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23123.0>,cb=#PID<0.23120.0>,generation=1):
assignments received:
  atm_accounts_entities:
    partition=0 begin_offset=undefined
13:05:18.296 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-0a422e20-5b4a-11eb-a49d-a6c0fcfc4157,coor=#PID<0.23142.0>,cb=#PID<0.23136.0>,generation=1):
assignments received:
  atm_accounts_entities:
    partition=1 begin_offset=undefined

Then when i restart the pipeline the partition issue and message inconsistency issues start to occur

LOGS WHEN USING LATEST RELEASE TAG WITH NO ISSUE:

14:19:58.590 [info] Elixir.CogyntWorkstationIngest.Servers.PubSub.IngestPubSub: Channel: "ingest_channel", Received message: START PIPELINE

14:19:58.603 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4211.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_0.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_0.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.604 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4215.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_1.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_1.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.604 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4219.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_2.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_2.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.605 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4223.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_3.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_3.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.605 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4227.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_4.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_4.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.607 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4231.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_5.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_5.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.607 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4235.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_6.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_6.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.608 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4242.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_7.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_7.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.609 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4246.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_8.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_8.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:19:58.609 [info] [supervisor: {:local, :brod_sup}, started: [pid: #PID<0.4250.0>, id: :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_9.Client", mfargs: {:brod_client, :start_link, [[{"kafka", 9072}], :"Elixir.EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621acPipeline.Broadway.Producer_9.Client", [connect_timeout: 30000]]}, restart_type: {:permanent, 10}, shutdown: 5000, child_type: :worker]]

14:20:00.499 [info] Elixir.CogyntWorkstationIngest.Servers.PubSub.IngestPubSub: Channel: "ingest_channel", Received message: STOP PIPELINE

14:20:04.619 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4217.0>,cb=#PID<0.4214.0>,generation=1):
elected=false

14:20:04.619 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4244.0>,cb=#PID<0.4241.0>,generation=1):
elected=false

14:20:04.619 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4225.0>,cb=#PID<0.4222.0>,generation=1):
elected=false

14:20:04.619 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4213.0>,cb=#PID<0.4210.0>,generation=1):
elected=false

14:20:04.621 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4240.0>,cb=#PID<0.4234.0>,generation=1):
elected=false

14:20:04.621 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4229.0>,cb=#PID<0.4226.0>,generation=1):
elected=false

14:20:04.621 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4259.0>,cb=#PID<0.4249.0>,generation=1):
elected=false

14:20:04.621 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4221.0>,cb=#PID<0.4218.0>,generation=1):
elected=false

14:20:04.621 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4233.0>,cb=#PID<0.4230.0>,generation=1):
elected=false

14:20:04.621 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4248.0>,cb=#PID<0.4245.0>,generation=1):
elected=false

14:20:04.630 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4213.0>,cb=#PID<0.4210.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4233.0>,cb=#PID<0.4230.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4240.0>,cb=#PID<0.4234.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4229.0>,cb=#PID<0.4226.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4248.0>,cb=#PID<0.4245.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4221.0>,cb=#PID<0.4218.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4259.0>,cb=#PID<0.4249.0>,generation=1):
assignments received:[]

14:20:04.631 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4217.0>,cb=#PID<0.4214.0>,generation=1):
assignments received:
  atm_accounts_entities:
    partition=2 begin_offset=undefined
14:20:04.632 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4225.0>,cb=#PID<0.4222.0>,generation=1):
assignments received:
  atm_accounts_entities:
    partition=7 begin_offset=undefined
14:20:04.637 [info] Group member (EventDefinition-6a044d07-3625-4043-b6c5-69629fa28dc8-7c515af4-5b54-11eb-b0a7-921d006621ac,coor=#PID<0.4244.0>,cb=#PID<0.4241.0>,generation=1):
assignments received:
  atm_accounts_entities:
    partition=5 begin_offset=undefined

Then when i restart the pipeline i still get the Genserver Crash error that we were trying to fix with the last PR

14:21:44.660 [error] GenServer #PID<0.4056.0> terminating
** (stop) exited in: GenServer.call(#PID<0.4053.0>, :drain_after_revoke, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:989: GenServer.call/3
    (broadway_kafka) lib/producer.ex:415: BroadwayKafka.Producer.assignments_revoked/1
    (brod) /app/deps/brod/src/brod_group_coordinator.erl:477: :brod_group_coordinator.stabilize/3
    (brod) /app/deps/brod/src/brod_group_coordinator.erl:391: :brod_group_coordinator.handle_info/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:msg, #PID<0.4119.0>, {:kpro_rsp, #Reference<0.2515379551.3856400387.31330>, :heartbeat, 0, %{error_code: :rebalance_in_progress}}}

However even with the crash i still get the correct data ingested and partion offsets fetched. Where with the "fix" it seems to have broken that.

I dont know what the fix is for fixing the Genserver crash and at the same time not breaking the partition offset counts or the draining but that change in master should not be released until something has been found out

The text was updated successfully, but these errors were encountered:

amacciola · 2021-01-20T19:48:46Z

@josevalim with the master branch it fixes the ton of rebalancing errors and Genserver crash errors that are thrown repeatedly when stopping pipelines but also brings along the error mentioned above. I cannot live with the error mentioned. So for now i will have to live with the errors and crashes. A fix for these would be greatly appreciated

josevalim · 2021-01-20T19:53:32Z

@amacciola your fixes were not supposed to have changed the semantics of the code but it seems it did. I just pushed another approach to master, can you please try it out?

amacciola · 2021-01-20T19:58:46Z

@josevalim Will do. trying now

amacciola · 2021-01-20T20:39:50Z

@josevalim just tested it out and same issue. It fixes all the crashes and errors from the :drain_after_revoke method.

but now the messages do not ingest properly after starting/stopping/starting

josevalim · 2021-01-20T20:55:09Z

I see. It seems that assignments_revoked has to fail so brod cleans it up. So I think the failures will have to stay. We can rewrite the failures to something else, so it is a bit more pleasant on the eyes though.

It may also be that terminate_child is too abrupt and you may be better off using GenServer.stop(NameOfThePipeline) to shut it down. Does it make a difference?

amacciola · 2021-01-20T20:59:24Z

@josevalim makes sense. I will try using the GenServer.stop(NameOfThePipeline) method and see if it makes a difference. otherwise i will deal with the errors for now.

amacciola · 2021-01-21T03:45:32Z

@josevalim

Sorry i have been in here so much lately but i am now using the master branch and using GenServer.stop(NameOfThePipeline) instead of using the terminate_child and i do notice less crashing/errors.

However i am still having the same issues. But this time i have dug into it more and i have more details. So the issue only happens if we Stop a Broadway pipeline right after starting it. In low latency envs like my local machine i see no issues. But in our cloud envs with more latency to Kafka we see issues.

I believe when we start a Broadway Pipeline it sends its assignments via brod to the consumer_group for each partition. (pls correct me if i am butchering this at all) but when we shutdown the Pipeline all we are doing basically is just sending a shutdown signal to a Genserver. So i do not believe it is finishing the process of starting the Broadway Pipeline and the subscribing of the assignments for the consuner_group. The next time i start the Broadway Pipeline, since i am using the same consumer_group_id it just rejoins the consumer_group instead of re configuring it, as it should but now it will never fetch data from a handful of partitions

Is there a proper way to stop a Broadway Pipeline ? something like a Broadway.stop ? or BroadwayKafka.stop ? I am just trying to figure out if this is something that we need to solve for our in our applications. With fetching metadata to ensure a Pipeline has been started successfully before we can do a shutdown or if a graceful shutdown would just wait for it to be successfully started

josevalim · 2021-01-21T06:38:11Z

The stopping a Broadway pipeline is not a scenario that we tested because there is no public API for it. If we implemented it, it would be implemented with GenServer.stop, and I would assume it would work in most cases, but Kafka is extremely complicated so we would need to run a bunch of manual tests.

You are welcome to try a pull request that adds this functionality or at least add some integration tests that reproduces the failures you are seeing so we can try to investigate them further.

amacciola · 2021-01-22T16:18:14Z

In our scenario users are able to start, pause, start ingestion from a UI which stops the pipelines on the backend. And they can do these actions right after one another if they chose. How i found a fix for this is just before i stop a pipeline i check if the :brod.fetch_committed_offsets() total amount of partitions with offsets committed matched how many partitions the topic has. If there were 0 offsets committed i just reset the consumer_group_id that way it will use a new ID the next time the pipeline starts.

If there are only a subset of the total partitions with offsets committed then something went wrong (Pipeline was shutdown too quick after starting), and i clear any data from our system that may have been ingested and then reset the consumer_group_id.

amacciola changed the title ~~Major Assignments Issue~~ Major Assignments Issue In Master Branch Jan 20, 2021

josevalim closed this as completed in fce7b46 Jan 20, 2021

josevalim mentioned this issue Jul 16, 2021

Stop consuming messages after failure #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major Assignments Issue In Master Branch #47

Major Assignments Issue In Master Branch #47

amacciola commented Jan 20, 2021

amacciola commented Jan 20, 2021

josevalim commented Jan 20, 2021

amacciola commented Jan 20, 2021

amacciola commented Jan 20, 2021

josevalim commented Jan 20, 2021

amacciola commented Jan 20, 2021

amacciola commented Jan 21, 2021 •

edited

Loading

josevalim commented Jan 21, 2021

amacciola commented Jan 22, 2021

Major Assignments Issue In Master Branch #47

Major Assignments Issue In Master Branch #47

Comments

amacciola commented Jan 20, 2021

amacciola commented Jan 20, 2021

josevalim commented Jan 20, 2021

amacciola commented Jan 20, 2021

amacciola commented Jan 20, 2021

josevalim commented Jan 20, 2021

amacciola commented Jan 20, 2021

amacciola commented Jan 21, 2021 • edited Loading

josevalim commented Jan 21, 2021

amacciola commented Jan 22, 2021

amacciola commented Jan 21, 2021 •

edited

Loading