Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jitterbuffer fix + basic send PLI implementation #461

Merged
merged 8 commits into from
Feb 19, 2021
Merged

Conversation

Przem83
Copy link
Contributor

@Przem83 Przem83 commented Feb 2, 2021

Hi,
I've prepared a fix that may interest a lot of people having picture lost issues. I discovered that a big part of those issues might be related with a bad implementation of a jitter buffer that is used for collecting RTP packets before they are passed to the appropriate decoding function. You see, each of the RTP packets sent by a browser (sender) has a packet number that is used to properly group the packets into video frames on the receiver side. In theory those packet numbers should be unique, in practice however they are limited by the unsigned short int size which is 65535. Unfortunately the aiortc jitter buffer implementation did not handle the packet numbering return from 65535 to 0 correctly, causing the jitter buffer to be purged and the video signal to be lost or at least largely degraded. The situation repeated itself every 65535 received packets and was complicated by the fact that the aiortc does not support sending PLI (picture lost indicator) messages. Without the PLI messages being sent that video image could remain corrupted indefinitely as it is the video encoder decision if and when to send a new keyframe (in webRTC this usually happens on a rapid scene change or when it is explicitly requested through PLI).

This commit fixes the jitter buffer issue and adds a simple send PLI mechanism to the jitter buffer implementation.

This commit fixes the jitterbuffer issue that caused severe picture
lost issues and adds a simple send PLI mechanism
@codecov
Copy link

codecov bot commented Feb 2, 2021

Codecov Report

Merging #461 (2c1bc4f) into main (04ebbfa) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #461   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         5498      5566   +68     
=========================================
+ Hits          5498      5566   +68     
Impacted Files Coverage Δ
src/aiortc/__init__.py 100.00% <100.00%> (ø)
src/aiortc/about.py 100.00% <100.00%> (ø)
src/aiortc/codecs/h264.py 100.00% <100.00%> (ø)
src/aiortc/contrib/media.py 100.00% <100.00%> (ø)
src/aiortc/contrib/signaling.py 100.00% <100.00%> (ø)
src/aiortc/jitterbuffer.py 100.00% <100.00%> (ø)
src/aiortc/rtcdatachannel.py 100.00% <100.00%> (ø)
src/aiortc/rtcdtlstransport.py 100.00% <100.00%> (ø)
src/aiortc/rtcicetransport.py 100.00% <100.00%> (ø)
src/aiortc/rtcpeerconnection.py 100.00% <100.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04ebbfa...20a5455. Read the comment docs.

Ratuszek Przemysław - Hurt added 2 commits February 2, 2021 16:55
The code has been optimized and some additional bugfixes has been added
Introduced a new smart_remove() method that prevents sending corrupt
frames to the decoder.
With the new smart_remove() method I had to modify the tests to take
into account the new smart_remove behaviour
@jlaine
Copy link
Collaborator

jlaine commented Feb 10, 2021

Could we please have a clean PR so there is a chance of actually merging it? If you have identified a bug related to packet number wrap-around, please submit a PR which addresses JUST that. The work on PLI seems interesting but incomplete, so this looks like a candidate for a separate PR.

@Przem83
Copy link
Contributor Author

Przem83 commented Feb 10, 2021

What exactly do you mean by clean PR exactly? - do you want me to squash all commits into one or are you talking about separating the wrap-around fix from the PLI part? From my perspective they both are a jitterbuffer fix as a properly implemented jitterbuffer should trigger a PLI request each time it is being purged (reset).

PS. The PLI implementation I proposed is simple but it is also complete (it addresses all the cases when a PLI should be sent). The only thing missing is the PLI re-transmission mechanism, but that can be a candidate for a separate PR as you said.

@jlaine
Copy link
Collaborator

jlaine commented Feb 10, 2021

Yes if you could please split the wrap-around fix and the PLI part that would be perfect. Also make sure you remove any "print()" statements and commented lines of code please.

This commit cleanes the code from print() statements and commented lines
@Przem83
Copy link
Contributor Author

Przem83 commented Feb 12, 2021

I've cleared the code as you asked. For the split of code i did create a separate branch with just the wrap-around fix, but before switching the branches in the PR I would like to try to convince you that pushing the PR as it is right now is actually a better idea - and here is why:

  1. For static cameras (my use case), without the send PLI part, whenever there will be some kind of disturbance in the network the image will be permanently lost as there is no auto recovery mechanism in case the jitterbuffer is overflown. It is a big issue for anyone who want to use an aiortc connection for extended time periods as I do.
  2. In it's current form the PR should permanently fix the following issues: Jitterbuffer overflow causing up to 3000 lost frame packets #26, Send Picture Loss Indication (PLI) when a keyframe is needed #58, Video stream became 'broken' randomly #182, example videostream-cli is Failing #196, Recording Janus Video room produces periodic aberations  #335 and at least improve situation for random video interrupts #35, why the picture is frozen when we're working with asyncio.Queue? #28, ValueError: packet is too long when trying to stream from v4l2 /dev/video* #176
  3. In my opinion making a separate PR for PLI implementation that is based on only two simple if statements is an overkill (almost no difference between the branch with and without PLI)
  4. Disabling the current send PLI implementation is as simple as removing a "sendPLI=self._send_rtcp_pli" parameter from JitterBuffer constructor (line 258 of rtcrtpreceiver.py)

I hope you do see the reasoning behind my point of view and that you'll agree on merging the fix in it's current form - if not then I'll of course switch the branches in the PR and we'll do it your way. I'm waiting for your final decision,
Best regards,
Przemek.

@@ -114,14 +141,14 @@ def test_add_seq_too_high_discard_four(self):
jbuffer.add(RtpPacket(sequence_number=1, timestamp=1234))
self.assertEqual(jbuffer._origin, 0)

jbuffer.add(RtpPacket(sequence_number=2, timestamp=1234))
self.assertEqual(jbuffer._origin, 0)
# jbuffer.add(RtpPacket(sequence_number=2, timestamp=1234))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the code instead of commenting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem I'll remove it - I just thought it's easier to understand what's going on this way.

jbuffer = JitterBuffer(capacity=4)

jbuffer.add(RtpPacket(sequence_number=0, timestamp=1234))
# jbuffer.add(RtpPacket(sequence_number=1, timestamp=1234))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem I'll remove it - I just thought it's easier to understand this way.

@@ -73,7 +83,7 @@ def _remove_frame(self, sequence_number: int) -> Optional[JitterFrame]:
# check we have prefetched enough
frames += 1
if frames >= self._prefetch:
self.remove(remove)
self.remove(remove) # this might be a bit faster than smart_remove
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment: should we use smart_remove() or not here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can get rid of the remove() function entirely and replace it with smart_remove(). I haven't done so already as i didn't want to brake the backward compatibility and because smart_remove() checks the dumb_mode flag with each for loop tick (that's why it might be slightly slower). The smart_remove() run with dumb_mode flag set to true behaves exactly as the remove() function. I see 3 options here, we can:

  1. leave it as it is
  2. replace the remove() calls with smart_remove(dumb_mode=True) calls everywhere and get rid of remove() method
  3. rename smart_remove() to remove() and replace the dumb_mode flag with smart_mode flag in the current implementation - then we will have only remove(int) calls or remove(int, smart_mode=True) calls everywhere.

I leave the final decision to you as you now better how each of the changes can impact the rest of aiortc code.

delta = 0
misorder = 0
else:
delta = (packet.sequence_number - self._origin) % self.__max_number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use uint16_add (defined in aiortc.utils) for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - I believe in python (-2 & 0xFFFF) = 2 while (-2 % 65536) = 65534

Copy link
Collaborator

@jlaine jlaine Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope they are identical:

>>> -2 & 0xFFFF
65534
>>> -2 % 65536
65534

It's good news too, otherwise all our sequence number calculations for what we send would be messed up :)

"""
Send an RTCP packet to report picture loss.
"""
print("[INFO][JitterBuffer] PLI sent to media_ssrc:", media_ssrc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This print() statement seems kind of pointless. Could we instead define a function in each test which stores a value which we can assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you're right that it does nothing currently - I just thought it looks better then a pass statement. I'm not sure if I get what you want to achieve with the assert method though. For me this would be also pointless, as the async function will be run after the test finishes and we can not run a synchronous method with asyncio.ensure_future() (jitterbufer.py line 44 an 54).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we are not asserting the function was actually called.

@jlaine
Copy link
Collaborator

jlaine commented Feb 13, 2021

On the whole this PR looks very good, thanks so much for working on this. The one point I'm not very comfortable about is having synchronous code calling into asynchronous code. Is there any other way we could handle this?

@Przem83
Copy link
Contributor Author

Przem83 commented Feb 15, 2021

OK - I've added some comments to your review and I'll wait for your answer before pushing the next commit. I see your point with running the asynchronous code from synchronous one - but I haven't found any easy solution to get around it. I did test the code very thoroughly though and confirmed that everything works as intended. If not doing this synchronous/asynchronous mix is really that important to you then I believe the best solution would be to rewrite the jitterbuffer code and make it asynchronous, as all the rtcp send methods I've found are asynchronous as well.

self._origin = (self._origin + 1) % self.__max_number

def smart_remove(self, count: int, dumb_mode: bool = False) -> bool:
# smart_remove makes sure that all packages belonging to the same frame are removed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this sorry, make this a docstring

@jlaine
Copy link
Collaborator

jlaine commented Feb 15, 2021

There is a very simple way of avoiding the call to the asynchronous method: change the signature of add() to return Tuple[Optional[JitterFrame], bool]. The second member of the tuple would be "do we need to send a PLI?". That way the call to the async code is done by the caller, which in our case is an async method (so we won't use ensure_future at all):

encoded_frame = self.__jitter_buffer.add(packet)

Solved async function call from synchronous code, and removed dumb_flag
from smart_remove, for "dumb remove" remove should be called from now
on.
@@ -89,4 +99,23 @@ def remove(self, count: int) -> None:
for i in range(count):
pos = self._origin % self._capacity
self._packets[pos] = None
self._origin += 1
self._origin = (self._origin + 1) % self.__max_number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._origin = uint16_add(self._origin, 1)

break
timestamp = packet.timestamp
self._packets[pos] = None
self._origin = (self._origin + 1) % self.__max_number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._origin = uint16_add(self._origin, 1)

@@ -18,35 +19,44 @@ def __init__(self, capacity: int, prefetch: int = 0) -> None:
self._origin: Optional[int] = None
self._packets: List[Optional[RtpPacket]] = [None for i in range(capacity)]
self._prefetch = prefetch
self.__max_number = 65536
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed with the suggested changes below

self.remove(self.capacity)
self._origin = packet.sequence_number
delta = misorder = 0
if self._capacity >= 128:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the 128 magic value? Could we have a named constant with a descriptive name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the value set in rtcrtpreceiver.py (line 257) as the jitterbuffer size for video stream. I assume that this jitterbuffer size for video will rather increase in time, while the audio buffer should not reach 128 any time soon. I used the jitterbuffer size (capacity) to make sure that we are actually dealing with video before sending PLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A viable alternative would be to add a "is_video" flag to the jitterbuffer __init__ method, as currently there is no easy way to check if the jitterbuffer is used for video other then checking the self._capacity value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_video sounds like a good idea!

excess = delta - self.capacity + 1
self.remove(excess)
if self.smart_remove(excess):
Copy link
Collaborator

@jlaine jlaine Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does smart_remove change self._origin if we also change it right afterwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change it afterwards only when the buffer has been entirely purged (in such a case the smart_remove returns True). If we would delete the "self._origin = packet.sequence_number" from line 52, in case of situation where the buffer was purged we would and up with a self._origin value that points to a self._packets element that is None and this is a bad thing that would cause the buffer to be 100% filled no matter what (the remove_frame method would terminate on line 72 until the value would be overwritten).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we have the same situation with remove in line 40 and 41 of jitterbuffer.py. I've found that the previous code did not address the situation correctly when I've encountered some abnormal local network conditions, that caused a lot of packet drops.

@Przem83
Copy link
Contributor Author

Przem83 commented Feb 17, 2021

I'll do another iteration with the code to address the suggestions from the newest review tomorrow. I think i should be able to push the next commit tomorrow before 15:00 (CET)

pli_flag, encoded_frame = self.__jitter_buffer.add(packet)
# check if the PLI should be sent
if pli_flag:
asyncio.ensure_future(self._send_rtcp_pli(packet.ssrc))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use ensure_future, just await the call to send_rtcp_pli

Small code corections + new test_rtp_missing_video_packet test that takes
into account the send PLI method.
@jlaine
Copy link
Collaborator

jlaine commented Feb 19, 2021

Ok I think we've reached a good place and you have obviously taken the time to test this code. I'm going to merge it as-is, we can always have a follow-up PR. I'd actually love to see other PRs from you if you have time.

Thanks so much for bringing the coverage back to 100%!

@jlaine jlaine merged commit a70c288 into aiortc:main Feb 19, 2021
@jlaine
Copy link
Collaborator

jlaine commented Feb 19, 2021

Ah crap I forgot to squash the commits. The final commit is 42f4e0f

@Przem83 Przem83 deleted the main branch April 27, 2021 09:14
jlaine pushed a commit to jlaine/aiortc that referenced this pull request Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants