Jitterbuffer fix + basic send PLI implementation #461

Przem83 · 2021-02-02T14:48:42Z

Hi,
I've prepared a fix that may interest a lot of people having picture lost issues. I discovered that a big part of those issues might be related with a bad implementation of a jitter buffer that is used for collecting RTP packets before they are passed to the appropriate decoding function. You see, each of the RTP packets sent by a browser (sender) has a packet number that is used to properly group the packets into video frames on the receiver side. In theory those packet numbers should be unique, in practice however they are limited by the unsigned short int size which is 65535. Unfortunately the aiortc jitter buffer implementation did not handle the packet numbering return from 65535 to 0 correctly, causing the jitter buffer to be purged and the video signal to be lost or at least largely degraded. The situation repeated itself every 65535 received packets and was complicated by the fact that the aiortc does not support sending PLI (picture lost indicator) messages. Without the PLI messages being sent that video image could remain corrupted indefinitely as it is the video encoder decision if and when to send a new keyframe (in webRTC this usually happens on a rapid scene change or when it is explicitly requested through PLI).

This commit fixes the jitter buffer issue and adds a simple send PLI mechanism to the jitter buffer implementation.

This commit fixes the jitterbuffer issue that caused severe picture lost issues and adds a simple send PLI mechanism

codecov · 2021-02-02T14:52:46Z

Codecov Report

Merging #461 (2c1bc4f) into main (04ebbfa) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #461   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         5498      5566   +68     
=========================================
+ Hits          5498      5566   +68

Impacted Files	Coverage Δ
src/aiortc/__init__.py	`100.00% <100.00%> (ø)`
src/aiortc/about.py	`100.00% <100.00%> (ø)`
src/aiortc/codecs/h264.py	`100.00% <100.00%> (ø)`
src/aiortc/contrib/media.py	`100.00% <100.00%> (ø)`
src/aiortc/contrib/signaling.py	`100.00% <100.00%> (ø)`
src/aiortc/jitterbuffer.py	`100.00% <100.00%> (ø)`
src/aiortc/rtcdatachannel.py	`100.00% <100.00%> (ø)`
src/aiortc/rtcdtlstransport.py	`100.00% <100.00%> (ø)`
src/aiortc/rtcicetransport.py	`100.00% <100.00%> (ø)`
src/aiortc/rtcpeerconnection.py	`100.00% <100.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04ebbfa...20a5455. Read the comment docs.

The code has been optimized and some additional bugfixes has been added Introduced a new smart_remove() method that prevents sending corrupt frames to the decoder.

With the new smart_remove() method I had to modify the tests to take into account the new smart_remove behaviour

jlaine · 2021-02-10T06:16:41Z

Could we please have a clean PR so there is a chance of actually merging it? If you have identified a bug related to packet number wrap-around, please submit a PR which addresses JUST that. The work on PLI seems interesting but incomplete, so this looks like a candidate for a separate PR.

Przem83 · 2021-02-10T15:30:11Z

What exactly do you mean by clean PR exactly? - do you want me to squash all commits into one or are you talking about separating the wrap-around fix from the PLI part? From my perspective they both are a jitterbuffer fix as a properly implemented jitterbuffer should trigger a PLI request each time it is being purged (reset).

PS. The PLI implementation I proposed is simple but it is also complete (it addresses all the cases when a PLI should be sent). The only thing missing is the PLI re-transmission mechanism, but that can be a candidate for a separate PR as you said.

jlaine · 2021-02-10T21:24:54Z

Yes if you could please split the wrap-around fix and the PLI part that would be perfect. Also make sure you remove any "print()" statements and commented lines of code please.

This commit cleanes the code from print() statements and commented lines

Przem83 · 2021-02-12T10:57:20Z

I've cleared the code as you asked. For the split of code i did create a separate branch with just the wrap-around fix, but before switching the branches in the PR I would like to try to convince you that pushing the PR as it is right now is actually a better idea - and here is why:

For static cameras (my use case), without the send PLI part, whenever there will be some kind of disturbance in the network the image will be permanently lost as there is no auto recovery mechanism in case the jitterbuffer is overflown. It is a big issue for anyone who want to use an aiortc connection for extended time periods as I do.
In it's current form the PR should permanently fix the following issues: Jitterbuffer overflow causing up to 3000 lost frame packets #26, Send Picture Loss Indication (PLI) when a keyframe is needed #58, Video stream became 'broken' randomly #182, example videostream-cli is Failing #196, Recording Janus Video room produces periodic aberations #335 and at least improve situation for random video interrupts #35, why the picture is frozen when we're working with asyncio.Queue? #28, ValueError: packet is too long when trying to stream from v4l2 /dev/video* #176
In my opinion making a separate PR for PLI implementation that is based on only two simple if statements is an overkill (almost no difference between the branch with and without PLI)
Disabling the current send PLI implementation is as simple as removing a "sendPLI=self._send_rtcp_pli" parameter from JitterBuffer constructor (line 258 of rtcrtpreceiver.py)

I hope you do see the reasoning behind my point of view and that you'll agree on merging the fix in it's current form - if not then I'll of course switch the branches in the PR and we'll do it your way. I'm waiting for your final decision,
Best regards,
Przemek.

jlaine · 2021-02-13T17:30:07Z

tests/test_jitterbuffer.py

@@ -114,14 +141,14 @@ def test_add_seq_too_high_discard_four(self):
        jbuffer.add(RtpPacket(sequence_number=1, timestamp=1234))
        self.assertEqual(jbuffer._origin, 0)

-        jbuffer.add(RtpPacket(sequence_number=2, timestamp=1234))
-        self.assertEqual(jbuffer._origin, 0)
+        #  jbuffer.add(RtpPacket(sequence_number=2, timestamp=1234))


Please remove the code instead of commenting it.

No problem I'll remove it - I just thought it's easier to understand what's going on this way.

jlaine · 2021-02-13T17:30:27Z

tests/test_jitterbuffer.py

+        jbuffer = JitterBuffer(capacity=4)
+
+        jbuffer.add(RtpPacket(sequence_number=0, timestamp=1234))
+        # jbuffer.add(RtpPacket(sequence_number=1, timestamp=1234))


Please remove the commented line

No problem I'll remove it - I just thought it's easier to understand this way.

jlaine · 2021-02-13T17:31:17Z

src/aiortc/jitterbuffer.py

@@ -73,7 +83,7 @@ def _remove_frame(self, sequence_number: int) -> Optional[JitterFrame]:
                # check we have prefetched enough
                frames += 1
                if frames >= self._prefetch:
-                    self.remove(remove)
+                    self.remove(remove)  # this might be a bit faster than smart_remove


I don't understand the comment: should we use smart_remove() or not here?

Actually we can get rid of the remove() function entirely and replace it with smart_remove(). I haven't done so already as i didn't want to brake the backward compatibility and because smart_remove() checks the dumb_mode flag with each for loop tick (that's why it might be slightly slower). The smart_remove() run with dumb_mode flag set to true behaves exactly as the remove() function. I see 3 options here, we can:

leave it as it is

replace the remove() calls with smart_remove(dumb_mode=True) calls everywhere and get rid of remove() method

rename smart_remove() to remove() and replace the dumb_mode flag with smart_mode flag in the current implementation - then we will have only remove(int) calls or remove(int, smart_mode=True) calls everywhere.

I leave the final decision to you as you now better how each of the changes can impact the rest of aiortc code.

jlaine · 2021-02-13T17:35:00Z

src/aiortc/jitterbuffer.py

+            delta = 0
+            misorder = 0
+        else:
+            delta = (packet.sequence_number - self._origin) % self.__max_number


Can we use uint16_add (defined in aiortc.utils) for consistency?

I don't think so - I believe in python (-2 & 0xFFFF) = 2 while (-2 % 65536) = 65534

Nope they are identical:

>>> -2 & 0xFFFF 65534 >>> -2 % 65536 65534

It's good news too, otherwise all our sequence number calculations for what we send would be messed up :)

jlaine · 2021-02-13T17:36:11Z

tests/test_jitterbuffer.py

+    """
+    Send an RTCP packet to report picture loss.
+    """
+    print("[INFO][JitterBuffer] PLI sent to media_ssrc:", media_ssrc)


This print() statement seems kind of pointless. Could we instead define a function in each test which stores a value which we can assert?

Well you're right that it does nothing currently - I just thought it looks better then a pass statement. I'm not sure if I get what you want to achieve with the assert method though. For me this would be also pointless, as the async function will be run after the test finishes and we can not run a synchronous method with asyncio.ensure_future() (jitterbufer.py line 44 an 54).

My point is we are not asserting the function was actually called.

jlaine · 2021-02-13T17:37:32Z

On the whole this PR looks very good, thanks so much for working on this. The one point I'm not very comfortable about is having synchronous code calling into asynchronous code. Is there any other way we could handle this?

Przem83 · 2021-02-15T09:58:59Z

OK - I've added some comments to your review and I'll wait for your answer before pushing the next commit. I see your point with running the asynchronous code from synchronous one - but I haven't found any easy solution to get around it. I did test the code very thoroughly though and confirmed that everything works as intended. If not doing this synchronous/asynchronous mix is really that important to you then I believe the best solution would be to rewrite the jitterbuffer code and make it asynchronous, as all the rtcp send methods I've found are asynchronous as well.

jlaine · 2021-02-15T12:36:26Z

src/aiortc/jitterbuffer.py

+            self._origin = (self._origin + 1) % self.__max_number
+
+    def smart_remove(self, count: int, dumb_mode: bool = False) -> bool:
+        # smart_remove makes sure that all packages belonging to the same frame are removed


I missed this sorry, make this a docstring

jlaine · 2021-02-15T12:45:21Z

There is a very simple way of avoiding the call to the asynchronous method: change the signature of add() to return Tuple[Optional[JitterFrame], bool]. The second member of the tuple would be "do we need to send a PLI?". That way the call to the async code is done by the caller, which in our case is an async method (so we won't use ensure_future at all):

aiortc/src/aiortc/rtcrtpreceiver.py

Line 490 in d56e14c

encoded_frame = self.__jitter_buffer.add(packet)

Solved async function call from synchronous code, and removed dumb_flag from smart_remove, for "dumb remove" remove should be called from now on.

jlaine · 2021-02-16T14:01:18Z

src/aiortc/jitterbuffer.py

@@ -89,4 +99,23 @@ def remove(self, count: int) -> None:
        for i in range(count):
            pos = self._origin % self._capacity
            self._packets[pos] = None
-            self._origin += 1
+            self._origin = (self._origin + 1) % self.__max_number


self._origin = uint16_add(self._origin, 1)

jlaine · 2021-02-16T14:01:36Z

src/aiortc/jitterbuffer.py

+                    break
+                timestamp = packet.timestamp
+            self._packets[pos] = None
+            self._origin = (self._origin + 1) % self.__max_number


self._origin = uint16_add(self._origin, 1)

jlaine · 2021-02-16T14:02:12Z

src/aiortc/jitterbuffer.py

@@ -18,35 +19,44 @@ def __init__(self, capacity: int, prefetch: int = 0) -> None:
        self._origin: Optional[int] = None
        self._packets: List[Optional[RtpPacket]] = [None for i in range(capacity)]
        self._prefetch = prefetch
+        self.__max_number = 65536


Not needed with the suggested changes below

jlaine · 2021-02-16T14:03:43Z

src/aiortc/jitterbuffer.py

+                self.remove(self.capacity)
+                self._origin = packet.sequence_number
+                delta = misorder = 0
+                if self._capacity >= 128:


What's the 128 magic value? Could we have a named constant with a descriptive name?

This is the value set in rtcrtpreceiver.py (line 257) as the jitterbuffer size for video stream. I assume that this jitterbuffer size for video will rather increase in time, while the audio buffer should not reach 128 any time soon. I used the jitterbuffer size (capacity) to make sure that we are actually dealing with video before sending PLI.

A viable alternative would be to add a "is_video" flag to the jitterbuffer __init__ method, as currently there is no easy way to check if the jitterbuffer is used for video other then checking the self._capacity value.

is_video sounds like a good idea!

jlaine · 2021-02-16T14:04:37Z

src/aiortc/jitterbuffer.py

            excess = delta - self.capacity + 1
-            self.remove(excess)
+            if self.smart_remove(excess):


Why does smart_remove change self._origin if we also change it right afterwards?

We change it afterwards only when the buffer has been entirely purged (in such a case the smart_remove returns True). If we would delete the "self._origin = packet.sequence_number" from line 52, in case of situation where the buffer was purged we would and up with a self._origin value that points to a self._packets element that is None and this is a bad thing that would cause the buffer to be 100% filled no matter what (the remove_frame method would terminate on line 72 until the value would be overwritten).

BTW we have the same situation with remove in line 40 and 41 of jitterbuffer.py. I've found that the previous code did not address the situation correctly when I've encountered some abnormal local network conditions, that caused a lot of packet drops.

Przem83 · 2021-02-17T14:09:38Z

I'll do another iteration with the code to address the suggestions from the newest review tomorrow. I think i should be able to push the next commit tomorrow before 15:00 (CET)

jlaine · 2021-02-17T16:17:10Z

src/aiortc/rtcrtpreceiver.py

+        pli_flag, encoded_frame = self.__jitter_buffer.add(packet)
+        # check if the PLI should be sent
+        if pli_flag:
+            asyncio.ensure_future(self._send_rtcp_pli(packet.ssrc))


Don't use ensure_future, just await the call to send_rtcp_pli

Small code corections + new test_rtp_missing_video_packet test that takes into account the send PLI method.

jlaine · 2021-02-19T15:09:51Z

Ok I think we've reached a good place and you have obviously taken the time to test this code. I'm going to merge it as-is, we can always have a follow-up PR. I'd actually love to see other PRs from you if you have time.

Thanks so much for bringing the coverage back to 100%!

jlaine · 2021-02-19T15:13:50Z

Ah crap I forgot to squash the commits. The final commit is 42f4e0f

…c#461)

Jitterbuffer fix + basic send PLI implementation

8af6607

This commit fixes the jitterbuffer issue that caused severe picture lost issues and adds a simple send PLI mechanism

Ratuszek Przemysław - Hurt added 2 commits February 2, 2021 16:55

Fixed Linters issues

6cfb403

New improved JitterBuffer implementation

4609b77

The code has been optimized and some additional bugfixes has been added Introduced a new smart_remove() method that prevents sending corrupt frames to the decoder.

Przem83 force-pushed the main branch 3 times, most recently from 49893a8 to 4644cae Compare February 9, 2021 14:15

Updated tests for JitterBuffer

7b0f7da

With the new smart_remove() method I had to modify the tests to take into account the new smart_remove behaviour

Przem83 force-pushed the main branch from 4644cae to 7b0f7da Compare February 9, 2021 14:32

jlaine force-pushed the main branch from a2a8d58 to 5036f69 Compare February 10, 2021 13:03

Code cleaned for merging

4824475

This commit cleanes the code from print() statements and commented lines

jlaine reviewed Feb 13, 2021

View reviewed changes

jlaine reviewed Feb 15, 2021

View reviewed changes

Improvement to the code after code review

cd99fde

Solved async function call from synchronous code, and removed dumb_flag from smart_remove, for "dumb remove" remove should be called from now on.

Przem83 force-pushed the main branch from 79f3505 to cd99fde Compare February 16, 2021 12:22

jlaine reviewed Feb 16, 2021

View reviewed changes

jlaine reviewed Feb 17, 2021

View reviewed changes

Improvement to the code after second code review

380f8d6

Small code corections + new test_rtp_missing_video_packet test that takes into account the send PLI method.

Przem83 force-pushed the main branch from e5f4b43 to 380f8d6 Compare February 18, 2021 14:32

Added a "is_video" flag to JitterBuffer __init__

20a5455

Przem83 force-pushed the main branch from 5767222 to 20a5455 Compare February 19, 2021 13:20

jlaine merged commit a70c288 into aiortc:main Feb 19, 2021

Przem83 deleted the main branch April 27, 2021 09:14

jlaine pushed a commit to jlaine/aiortc that referenced this pull request Jun 18, 2024

Use timezone aware UTC time starting with Cryptography 42.0.0. (aiort…

488f1ab

…c#461)

Jitterbuffer fix + basic send PLI implementation #461

Jitterbuffer fix + basic send PLI implementation #461

Conversation

Przem83 commented Feb 2, 2021

codecov bot commented Feb 2, 2021 • edited

Codecov Report

jlaine commented Feb 10, 2021

Przem83 commented Feb 10, 2021

jlaine commented Feb 10, 2021

Przem83 commented Feb 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlaine Feb 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlaine commented Feb 13, 2021 • edited

Przem83 commented Feb 15, 2021

Choose a reason for hiding this comment

jlaine commented Feb 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlaine Feb 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Przem83 commented Feb 17, 2021

Choose a reason for hiding this comment

jlaine commented Feb 19, 2021

jlaine commented Feb 19, 2021 • edited

codecov bot commented Feb 2, 2021 •

edited

jlaine Feb 15, 2021 •

edited

jlaine commented Feb 13, 2021 •

edited

jlaine commented Feb 15, 2021 •

edited

jlaine Feb 16, 2021 •

edited

jlaine commented Feb 19, 2021 •

edited