Skip to content

[bug]: bridge_channel.c: Stream topology change amplification with multiple layers of Local channels #530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
InterLinked1 opened this issue Jan 12, 2024 · 15 comments · Fixed by #532
Closed
1 task done
Labels
bug support-level-core Functionality with core support level

Comments

@InterLinked1
Copy link
Contributor

Severity

Major

Versions

21.0.0

Components/Modules

bridge_channel.c

Operating Environment

Debian 12

Frequency of Occurrence

Constant

Issue Description

I dug into a specific instance of the infamous Extremely long queue length queuing... warnings that I was seeing and found that this is due to an amplification of stream topology change control frames (AST_CONTROL_STREAM_TOPOLOGY_CHANGED) being propagated back and forth between the channels. By this, I mean if I analyze how many frames are queued on the channel, it grows to oblivion, quickly reaching thousands of queued frames within seconds and this will keep growing as long as the channels are bridged.

As far as these kinds of things go, they make the bridge completely unusable, and result in a generally degraded system due to high CPU usage - and left unchecked, this is technically a resource leak until channels are hung up since frames will accumulate forever - faster than they are being consumed. If users don't do that manually, then I suspect Asterisk would, in time, use all available memory. Based on my experience, I suspect a number of instances of this error message might have the same underlying issue.

I've put together a sample dialplan that reproduces this as well as included logs from running against that, with some debug messages added to show the continuous stream topology change frames. This happens precisely when dialing a queue agent that is via a local channel and the agent does an attended transfer, and when the channel gets bridged.

The issue only happens when using a local channel; however, core_unreal isn't constantly triggering the stream topology change frames, that might be happening just in the bridging core.

(Either Queue or Dial can be used, though Dial is simpler for reproduction purposes).

Caller A calls B, and B initiates an attended transfer, e.g. using *2 (starting at test-operator-transfer,1,1)

The issue happens with both Wait and BridgeWait, i.e. regardless of if the channel is masqueraded.

Example below where party A dials extension 123, triggering the issue when B (PJSIP/Polycom5 in this example) initiates the attended transfer, more precisely 3 seconds afterwards, when the channels are bridged.

I went ahead and reproduced it using the following on a second system, using DAHDI instead of PJSIP for both endpoints, without modifications to log bridging changes but indeed I do get hundreds of these messages, so this issue is very easy to reproduce consistently:

DEBUG[136258][C-000001e4]: chan_dahdi.c:9344 dahdi_indicate: Requested indication 36 on channel DAHDI/18-1

features.conf:

[featuremap]
atxfer => *2 

Dialplan:

[dialme]
exten => s,1,Wait(3)
	same => n,Answer()
	same => n,PlayTones(600)
	same => n,Wait(9999)
	same => n,Hangup()

[test-operator-transfer]
exten => 1,1,NoOp() ; this could also be Answer, doesn't matter
	same => n,Dial(Local/s@dialme/n,,G(split))
	same => n,Hangup()
	same => n(split),Goto(caller)
	same => n,Goto(callee)
	same => n(caller),Set(GLOBAL(bridgeme)=${CHANNEL})
	same => n,NoOp() ; wait to get bridged
	;same => n,Wait(9999999)
	same => n,BridgeWait(,,e(s))
	same => n,Hangup()
	same => n(callee),NoOp()
	same => n,Wait(0.1)
	same => n,Bridge(${bridgeme})
	same => n,Hangup()

[queue-agent]
exten => s,1,Progress()
	same => n,Dial(PJSIP/Polycom5) ; dial the actual agent device
	same => n,Hangup()

[from-internal]
exten => 123,1,MSet(__TRANSFER_CONTEXT=test-operator-transfer,__TRANSFER_EXTEN=1)
	same => n,Dial(Local/s@queue-agent/n,,t)
	same => n,Hangup()

GitHub complains the relevant log output is too long, so I've linked it here: https://paste.interlinked.us/zl4geodvat.txt

Relevant log output

No response

Asterisk Issue Guidelines

  • Yes, I have read the Asterisk Issue Guidelines
@jcolp
Copy link
Member

jcolp commented Jan 12, 2024

Have you tested with the inclusion of the patch from #403?

@jcolp jcolp added support-level-core Functionality with core support level and removed triage labels Jan 12, 2024
@jcolp
Copy link
Member

jcolp commented Jan 12, 2024

Or if you revert 315eb551dbd18ecd424a2f32179d4c1f6f6edd26 does it resolve it?

@InterLinked1
Copy link
Contributor Author

I applied the patch from #403 and it resolved my minimally reproducible example but not the actual issue that I stumbled into in the first place, so now I'll have to come up with a new, more complex example.

This happens with all channel technologies, not just PJSIP, so I think it's safe to say it is not resolved by either of the above.

I've narrowed down now I can make the issue reproduce or not reproduce based on replacing same => n,Dial(Local/s@dialme/n,,G(split)) with something suitably more complex. I'm trying to narrow that down to something minimal and will report back.

@InterLinked1
Copy link
Contributor Author

InterLinked1 commented Jan 12, 2024

Okay, I figured out how to get it to still reproduce with #403.

Just add another layer of dialing a Local channel. So add:

[dialme2]
exten => s,1,Dial(Local/${EXTEN}@dialme/n,,g)
	same => n,Hangup()

And then change the Dial to:

same => n,Dial(Local/s@dialme2/n,,G(split))

@InterLinked1 InterLinked1 changed the title [bug]: bridge_channel.c: Stream topology change amplification bug [bug]: bridge_channel.c: Stream topology change amplification with multiple layers of Local channels Jan 12, 2024
@gtjoseph
Copy link
Member

Try reverting the commit @jcolp mentioned above. We've had another report of this (and can reproduce it) and narrowed it down to that commit from November 2022.

@gtjoseph
Copy link
Member

@InterLinked1 We were going to cut release candidates today so if you can test the revert and it fixes your issue, we can include the revert in the RCs.

@InterLinked1
Copy link
Contributor Author

@InterLinked1 We were going to cut release candidates today so if you can test the revert and it fixes your issue, we can include the revert in the RCs.

This seems to be somewhat subtle. The example I showed actually didn't reproduce with DAHDI so maybe you guys are right that it's that PJSIP thing.

However, I reverted lines 1938-1946 in res_pjsip_session.c, and the issue still happens with that modified reproducible example. Didn't bother with the test_media_add lines.

@jcolp
Copy link
Member

jcolp commented Jan 12, 2024

The res_pjsip_session change is not the huge change, the change to channel.c is.

@gtjoseph
Copy link
Member

I'm not following. Did you revert the changes in channel.c from 315eb55?

@InterLinked1
Copy link
Contributor Author

The res_pjsip_session change is not the huge change, the change to channel.c is.

Ah, I'm dumb. I saw that but it looked like the same code from #403 so I assumed I already had it and #403 was just reverting that bit. Same code but different file.

Yes, this resolves the new example and the original issue - thanks for the catch!

gtjoseph added a commit to gtjoseph/asterisk that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themseles in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: asterisk#530
@alex2grad
Copy link
Contributor

If "core & res_pjsip: Improve topology change handling." is reverted by #532.
Do we still need #403?

gtjoseph added a commit to gtjoseph/asterisk that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: asterisk#530
@jcolp
Copy link
Member

jcolp commented Jan 12, 2024

It's not strictly required, but does no harm in still being present.

asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
(cherry picked from commit b71e338)
asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
(cherry picked from commit 4715c1b)
asterisk-org-access-app bot pushed a commit that referenced this issue Jan 12, 2024
This reverts commit 315eb55.

Over the past year, we've had several reports of "topology storms"
occurring where 2 external facing channels connected by one or more
local channels and bridges will get themselves in a state where
they continually send each other topology change requests.  This
usually manifests itself in no-audio calls and a flood of
"Exceptionally long queue length" messages.  It appears that this
commit is the cause so we're reverting it for now until we can
determine a more appropriate solution.

Resolves: #530
(cherry picked from commit c31cd32)
@duncannz
Copy link

duncannz commented Jun 24, 2024

I'm having an issue on Asterisk 20.8.1 with the same symptoms as originally described in this issue, but I'm not quite sure if it has the same root cause, so I wanted to check here first before opening a new issue.

I am experiencing:

  • Extremely high CPU usage
  • Log messages such as Exceptionally long queue length (0 voice / 129 total) queuing to Recorder/ARI-00000007;1
  • If I turn on debug logging, the messages
    [2024-06-24 04:09:29] DEBUG[9621][C-00000007]: chan_pjsip.c:1644 chan_pjsip_indicate:  PJSIP/webrtc_client-00000005: Indicated Video Update
    [2024-06-24 04:09:29] DEBUG[9621][C-00000007]: chan_pjsip.c:1875 chan_pjsip_indicate:  PJSIP/webrtc_client-00000005
    
    are logged a huge number of times

My setup is a local channel that is recorded on both sides (and each side is in a bridge with another channel using PJSIP), and this seems to be enough to trigger the issue. I'm not using video in any way.

Basically, my issue is exactly the same as this 2021 post, but that post had no replies, and that is literally the only thing I can find on Google for "Indicated Video Update".

Does this sound like the same issue that was reported here, or like a separate issue?

@jcolp
Copy link
Member

jcolp commented Jun 24, 2024

It's most likely a separate issue.

@duncannz
Copy link

Thank you, created issue #780 for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug support-level-core Functionality with core support level
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants