Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jbuf: fix for adaptive mode #960

Closed
wants to merge 6 commits into from

Conversation

cspiel1
Copy link
Collaborator

@cspiel1 cspiel1 commented Sep 21, 2023

The adaptive mode was broken. Video frames are not relevant for the wish size. The estimation of the number of frames was not accurate because key frames have much higher #frame/#packet ratio then differential frames.

The wish size depends only on rdiff, the measure for how much packet sequence number differ from the expected sequence number if packets arrive out of order.

  • jbuf: wish size is given in #packets
  • test: jbuf packets with equal timestamps
  • jbuf: trace data for plot
  • jbuf: return EAGAIN only if wish size is exceeded
  • jbuf: faster reduce wish size in adaptive mode
  • jbuf: simplify rdiff computation

The plots generated from baresip/tools/jbuf of PR baresip/baresip#2724 visualize how the adaptive mode works.

The wish size for the adaptive mode depends on how much the packets are re-
ordered. It can only be measured by looking on the sequence number of the RTP
header and how much it jumps around because of re-ordered packets. Counting
video frames is not needed and causes failures when the wish size is computed.
@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

macos tests are failing. This has nothing to do with this PR. @sreimers please could you have a look?

@cspiel1 cspiel1 marked this pull request as ready for review September 21, 2023 06:35
@sreimers
Copy link
Member

But you want to specify for video how many frames are held. As you say, a video frame can be x packets (depending on keyframe and frame changes) and you want to make sure that a frame is ready (all packets in order) before decoding starts.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

I think it is not relevant if the frame is ready. The decoder should know what to do if the frame is not complete. It waits for more packets.
The jbuf should not differ between audio and video. And observing the timestamp currently has no other application and can be ignored.

The only application of the jbuf is to re-order packets. Thus observing the sequence number is enough.

edit:
Please look the plots: baresip/baresip#2724
The adaptive mode works again and the decoded video looks good with modules avcodec and x11.
I tested with: jitter simulation of 50ms. See: https://github.com/baresip/baresip/pull/2724/files#diff-40cd15ad2ef562a2c2cc1079ca43315fd35425f2e864c7dda28dd864c8753f96R14

@sreimers
Copy link
Member

Ok I can move the logic with correct video frame handling also outside of jbuf I think, to keep jbuf simple.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

Is it really relevant if the chunks of a video frame follow very quickly or with small interruptions (waiting for next RTP packets) to the decoder?

@sreimers
Copy link
Member

sreimers commented Sep 21, 2023

It is important that a frame is complete, because otherwise, say the last packet (or new frame packet) arrives before it is finished (since jbuf doesn't keep all relevant packets by default), the entire frame (all video chunks) is corrupted. Since keyframes can be hundreds of packets in a short period of time, the probability is high.

@sreimers
Copy link
Member

@sreimers
Copy link
Member

Another use-case is Generic RTP Feedback - RTPFB (RTCP), since the encoder can try to resend lost packets (makes only sense for video). Therefore the jitter buffer has to held all packets that belongs to a video frame until they arrive to order the packets.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

In a next PR, we could implement this:

  • jbuf_put() could remember timestamps where a sequence number is already missing
  • jbuf_put() resets remembered timestamps if a missing packet arrives
  • jbuf_get() holds back packets with such timestamps (for a while)
  • The upper layer then could do the RTPFB (RTCP) trick.

We could enable this behavior for jbuf by a setter and enable it only for video streams.

@sreimers
Copy link
Member

sreimers commented Sep 21, 2023

Not sure if understand the apis correctly, but jbuf has to keep always a whole frame (at least one or more), otherwise a video frame get's easily corrupted as mentioned (if you move only chunks to the decoder).

The adaptive mode wish size calculation based on packets, makes not much sense for video I think.

Edit:
Keep in mind, say you have a video with 25 fps. So every 40ms a frame. One frame could be 1 Packet or maybe 1000 Packets or more. So you can not calculate based on packets. With a jitter of 50ms you have to keep at least two frames (80ms) to avoid corrupted frames if reordering happens.

Maybe we should use ptime and timestamp based calculation instead.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

I do not agree. The job of jbuf is to reorder out-of-order RTP packets. If it is large enough to reorder all out-of-order packets (with some probability) then e.g. video frames will not be corrupted. This is a very simple invariant.

The current implementation in main is wrong and leads to freezing video if the packets arrive out-of-order.

Edit:
Note, that not video frames need to be re-ordered. Instead RTP packets may need to be re-ordered.

@sreimers
Copy link
Member

sreimers commented Sep 21, 2023

The current implementation in main is wrong and leads to freezing video if the packets arrive out-of-order.

Only in adaptive mode?

Note, that not video frames need to be re-ordered. Instead RTP packets may need to be re-ordered

Yes, and that's not the point, let's make this example:

  • TS: 1, SEQ: 1
  • TS: 1, SEQ: 2
  • TS: 2, SEQ: 31 <- out of order
    ....
  • TS: 1, SEQ: 29
  • TS: 1, SEQ: 30

If video decoder consumes chunks of packets from jbuf the frame can be corrupted right? You have to wait at least one frame of packets to ensure all packets in correct order before process within video decoder. For audio it's easy because 1 Frame = 1 Packet.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

Only in adaptive mode?

Yes, only adaptive mode. Because the computation of the ratio and thus wish size is wrong.

You have to wait at least one frame of packets ...

I think I know what you want and that it is enough to wait until:

  • we get a packet where TS increases to t+1, and
  • the buffered packets with timestamp t are a sequence where no sequence number is missing.

We could add this behavior in a following PR.

Edit:
For both audio and video it should also be possible to have a zero latency jbuf in adaptive mode if the network is perfect. This means a jbuf_put() - jbuf_get() combination should pass through the same packet. In baresip stream.c the get follows immediately the put.

Of course only if jbuf min is configured to zero.

As soon as one out-of-order case is detected the jbuf could switch from "zero latency" to the "frame complete" mode.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

Low latency video is a necessary requirement to reach lip synchronous and low latency audio+video conversation.

@sreimers
Copy link
Member

You always have to wait for a frame to complete before passing to the video decoder. I would prefer not to hold packets within the video decode module itself (avcodec uses an mbuf for this) until a frame is completely received, so jbuf can handle any out-of-order packets, at least for that frame, this adds no extra latency. jbuf can be emptied at once.

For real lip sync you need rtcp SR and calculate playout buffers to keep audio and video in sync.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

Okay. In order to decide that a frame with timestamp t is complete

  • jbuf has to wait until the first packet with timestamp t+1 is received, or
  • is it possible to wait for the marker bit (for video).

@sreimers
Copy link
Member

  • jbuf has to wait until packet with timestamp t+1 is received, or
  • is it possible to wait for the marker bit (for video).

We should check both, thats what pjsip does, as mentioned earlier:

    /* Quickly see if there may be a full picture in the jitter buffer, and
     * decode them if so. More thorough check will be done in decode_frame().
     */
    ts_diff = pj_ntohl(hdr->ts) - stream->dec_frame.timestamp.u32.lo;
    if (ts_diff != 0 || hdr->m) {

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 21, 2023

Should I add this in this PR?

@sreimers
Copy link
Member

Currently this PR breaks fixed buffer video frame handling, since it reverts it. See this example:

You want define a min. latency for jitter buffer and audio with at least 100ms, so you know one packet represents 20ms you can configure min. 5 packets. But if you want to archive the same for video you can't specify min packets, since video frames are fragmented. So you have to count/specify frames instead.

@sreimers
Copy link
Member

The jbuf should not differ between audio and video. And observing the timestamp currently has no other application and can be ignored.

That's not true, see pjsip as mentioned or webrtc modules/video_coding/packet_buffer.cc. A good jitter buffer for video needs frame handling.

@sreimers
Copy link
Member

jbuf: trace data for plot
Can you make a separate PR for this and instead DEBUG_LEVEL > 6, I would prefer #ifdef RE_JBUF_TRACE to enable explicit.

@sreimers
Copy link
Member

Another more detailed explanation why video frame handling is so useful in jbuf (besides jitter timing calculations):

If you only handle packets, the decoder pulls packets from jbuf without frame knowledge, so some packets are in the decode buffer now and other packets in jbuf. if a out-of-order packet arrives that is is already consumed/cached by decode, jbuf detects a late packet and it's lost. So a single out-of-order packet will result in the loss of an entire frame, which can be prevented if all the packets that are part of a frame are kept in jbuf.

I added a PR with better average ratio calculation (inspired by webrtc code): #962

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 25, 2023

If you only handle packets, the decoder pulls packets from jbuf without frame knowledge, so some packets are in the decode buffer now and other packets in jbuf. if a out-of-order packet arrives that is is already consumed/cached by decode, jbuf detects a late packet and it's lost. So a single out-of-order packet will result in the loss of an entire frame, which can be prevented if all the packets that are part of a frame are kept in jbuf.

My argument was that if jbuf is large enough to re-order all out-of-order packets, then also the video frames will be correct. Anyway, I'll try to implement a frame counter and a "frame complete" check.

I added a PR with better average ratio calculation (inspired by webrtc code): #962

I will have a look tomorrow.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 26, 2023

I try to implement a frame processing that guarantees that frames are complete. For now I convert this to draft.

@cspiel1 cspiel1 marked this pull request as draft September 26, 2023 10:28
@sreimers
Copy link
Member

I'm not fully understand why a complete new solution is needed? Can you explain, before putting to much work in to this.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Sep 26, 2023

The field wish was not meant to ensure that frames are complete. The adaptive mode also was not meant to deal with frames. If we ensure that frames are complete and in order, then maybe the adaptive mode will become obsolete. That's what I guess. But I have to make a proof of concept first.

I saw that the frame/packet ratio is not a good measure to adjust the wish size, even if we compute a moving average.

Further I would avoid linear effort in jbuf_put() and jbuf_get() in order to check if a frame is complete.

I close this and will re-open a draft for discussion.

@cspiel1 cspiel1 closed this Sep 26, 2023
@sreimers
Copy link
Member

I think this works best if video decode is pulling by time callback (like a audio source/destination) and frame. At the moment it's packet/frame/push driven.

@cspiel1 cspiel1 deleted the jbuf_only_rtp_packets branch September 27, 2023 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants