Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage in AV applications for measuring AV desynchronization #46

Open
gizahNL opened this issue Aug 16, 2021 · 11 comments
Open

Usage in AV applications for measuring AV desynchronization #46

gizahNL opened this issue Aug 16, 2021 · 11 comments

Comments

@gizahNL
Copy link

gizahNL commented Aug 16, 2021

Hi!

I'm currently looking into writing an application that encodes a timestamp into a sound signal for the purpose of measuring desynchronization between audio and video (I already transmit data in the video by writing a bit pattern in the video pixels by modifying the luminance).
For that purpose it's important that the library gives me the exact timestamp of the start of receiving the preamble, is that doable?

@ggerganov
Copy link
Owner

ggerganov commented Aug 16, 2021

It should be doable:

@gizahNL
Copy link
Author

gizahNL commented Aug 16, 2021

It should be doable:

I'd be intending to do this via the C library (the application I have is written in C), sending a struct containing the channel number & the timestamp at time of sending.
And then pushing the samples generated by the library out.

On the receiving end I'd send all received samples through the library as well. For the purpose of correct timing it would be crucial to get the timestamp at which the library receives the first tone from the preamble sequence, so is that possible? To extract the timestamp of first reception of the preamble. That way any processing delay can be calculated by substraction of the time of reception of the struct and time of reception of the preamble. Since this is a live stream there is likely always a processing delay, since sample count transmitted can vary depending on codec and muxing strategy.

@ggerganov
Copy link
Owner

The current interface of ggwave does not provide a mechanism to get the exact time of the beginning of the transmission.

Still, you can estimate it in the following way:

  • when ggwave_decode returns non-zero result, assume that this is the exact end of the transmission
  • compute the length of the received message
  • subtract the computed length from the current timestamp

This will give you an approximate timestamp of the first tone which will be +/- 100 ms from the real timestamp.
If this type of precision is enough for your application, I can give you more details how to make the calculation.
I can probably provide a simple example program in C to demonstrate how to compute it.

@gizahNL
Copy link
Author

gizahNL commented Sep 26, 2021

Coming back to this, I have implemented code now that is able to signal timestamps encoded in our audio stream and use that to correlate with timestamps encoded in video pixels.
Unfortunately accuracy is still below what I'd like to have for this purpose. The length is a bit off as well unfortunately, I guess ggwave calculates some silent samples in it's size calculation?

Is it feasible to get the offset to the start of the signal as seen from the end in sample count when decoding, as well as the amount of unused samples trailing in the provided buffer? I believe that should bring me closer to ~1ms accuracy

@gizahNL
Copy link
Author

gizahNL commented Sep 26, 2021

To give an example, for the current size I transmit ggwave_encode returns what would be a sample size of 73728, while (from testing) the actual amount of samples after which the decode function starts to return data is around 68832 samples

@ggerganov
Copy link
Owner

The current format of the audio payload when using variable length messages is like this:

  • assuming 48 kHz sample rate is used
  • 1 frame contains 1024 samples
  • the first 16 frames (i.e. 16384 samples) are the "begin" sound marker
  • next you have F frames of data (see below how F is determined)
  • in the end there are another 16 frames of the "end" sound marker

To determine F:

  • suppose you have n bytes of actual data that you want to transmit
  • the actual number of bytes that are transmitted is N = 3 + n + E(n):
    • 3 bytes are used to encode the length of the message (1 byte for length + 2 for ECC)
    • E(n) is the number of ECC bytes used for length n. See the function getECCBytesForLength(int len)
  • F = ((N + bytesPerTx - 1)/bytesPerTx)*framesPerTx

In your case, you have 73728 samples in total which means N = 73728 / 1024 = 72 frames.
Accounting for the 16 frames of "begin" sound marker and another 16 frames for "end" sound marker, this leaves N = 72 - 32 = 40 frames.

When you transmit such message, the receiver must receive the first 16 frames of the "begin" sound marker together with the N frames of data before possibly be able to decode the message. But that is not all.

The uncertainty comes during the receiving of the "end" marker. It cannot be predicted how many "end" frames the receiver needs to receive in order to "detect" the "end" marker. In perfect conditions, ggwave could send only 1 frame of the "end" marker and it would be enough for the receiver to detect it. However, due to noise and other imperfections, we send 16 frames and we hope that the receiver would pick up at least 1 of them. The problem is that we don't know which one. So there is uncertainty of about ~16 frames or about 350 ms in the detection process.

This explains why the decode function starts to return data earlier than expected.
Also, it explains why it can take different number of samples each time.

All this can be improved by making the decode function report an estimate of the number of "end" frames that have been received in the current decoding. But this requires modifications in the decoding part.

One thing you can try to for your use case is to reduce the number of sound marker frames from 16 to 4.
To do that, you need to edit the constant kDefaultMarkerFrames and recompile:

static constexpr auto kDefaultMarkerFrames = 16;

Reducing this number will degrade the transmission reliability, but it should improve the precision of your detection approach.

@gizahNL
Copy link
Author

gizahNL commented Sep 27, 2021

Do your calculations also hold true for fixed length data? We're setting the payloadLength parameter on init for both receiver and sender. Since we just send a struct (7 bytes of string, to memcmp since we sometimes get garbage data, 2 32bit values for Timecode and 1 uint8 for channel number) our length is known beforehand.

Trading reliability for accuracy in our usecase seems like a good trade off, so I'll explore that route :)

Thanks so far for all the help and the great library!

@ggerganov
Copy link
Owner

The fixed-length mode is a bit different, although the logic is simpler.
In this mode:

  • there are no sound marker frames at all.
  • N = n + E(n)
  • F is again given by F = ((N + bytesPerTx - 1)/bytesPerTx)*framesPerTx
  • so for n=7 which is your case, we have E(7) = 4 => N = 7 + 4 = 11 bytes

So to get a total of 73728 samples this means you are most likely using framesPerTx = 6 and bytesPerTx = 1, although the math does not completely checkout, which is strange:

  • let me know which Tx protocol you are using
  • make sure you are not actually sending 8 bytes, instead of 7

But regardless of this, again, the decode function will give you the result at different times, depending on noise and background sound. The reason is that in fixed-length mode, the receiver continuously tries to decode the last F frames of audio data. This is in contrast to the variable-length mode, where the receiver just listens for "begin" and "end" markers, and then decodes only the data in-between.

So what you observe is that when the receiver receives the first n + E(n)/2 bytes, it will sometimes already be able to decode the original data thanks to the error correction mechanism. But of course, sometimes it will need more ECC bytes and hence the successful decoding can occur at different times.

I'm not really sure how to improve the receive timestamp accuracy for this transfer mode.
I believe that currently there should be a detection uncertainty of about U = (E(n)/2 + bytesPerTx - 1)/bytesPerTx)*framesPerTx frames, which for n = 7, bytesPerTx = 1 and framesPerTx = 6 should be U = 12 frames which is 12288 samples.

@gizahNL
Copy link
Author

gizahNL commented Sep 28, 2021

Our size is a bit bigger than 8 bytes ;) It's 7 +4 +4 + 1

typedef struct __attribute__ ((packed)) {
//    int64_t  pts; //not needed I think
//    char[8] check;
    char check[CANARY_STRING_S];//7 bytes
    uint32_t tv_sec;
    uint32_t tv_usec;
    uint8_t  channel_no;
} latency_signal_t;

    memcpy(signal_tx.check, CANARY_STRING, CANARY_STRING_S);
    ggwave_Parameters gParam;
    gParam.payloadLength = sizeof(signal_tx);
    gParam.sampleRateInp = 48000;
    gParam.sampleRateOut = 48000;
    gParam.samplesPerFrame = 1024;
    gParam.soundMarkerThreshold = 3.0f;
    gParam.sampleFormatInp = GGWAVE_SAMPLE_FORMAT_F32;
    gParam.sampleFormatOut = GGWAVE_SAMPLE_FORMAT_F32;

    ggwave_Instance ggwave  = ggwave_init(gParam);
    //One for each audio track we might decode from
    ggwave_Instance ggwave_rx[16];
    for (int i=0; i < 16; i++)
        ggwave_rx[i]  = ggwave_init(gParam);

    size_t ggwave_bufsize = ggwave_encode(ggwave, &signal_tx, sizeof(signal_tx), GGWAVE_TX_PROTOCOL_ULTRASOUND_FAST, 25, NULL, 2) * sizeof(float);

    int64_t ggwave_latency = av_rescale_q((int64_t)(ggwave_bufsize/4), (AVRational){1, 48000}, (AVRational){1, 1000});
    //this is still inacurate, so set it statically to 1067 which is the common occurence for above settings.
    //adding the 7 check bytes added another max latency of 367
    ggwave_latency = 1067 + 367;//_NORMAL
    ggwave_latency = 967;//_FAST
    //ggwave_latency = 467;//_FASTEST

Samples is bufsize/4

There is a 60ish ms variance on the latency when we calculate it. I'm assuming part is due to the effects you've described, and part is due to "unaligned" reads (our audio frames come in in matching sample counts for the framerate, i.e.: 1920 for PAL (48000/25, not always constant btw, some formats have a switching cadence), audio codecs tend to have different on the wire frame sizes, so things might not line up (i.e. non ggwave data at front/back) when we call decode.
Hence why I think to get higher accuracy it would be needed to accurately tell how many samples have been consumed, and how many unconsumed samples are left

@ggerganov
Copy link
Owner

ggerganov commented Sep 30, 2021

Thanks for the info.
I am still not sure how to reduce these 60-ish ms variance. The only idea that I have is to compute the audio power within the ultrasound frequencies and detect when it jumps. This will give you the starting point of the transmission, and from there you can compute how many samples have been consumed.

To compute the signal power we can try computing FFTs of 256 samples which should give ~5 ms precission.
Maybe it is worth to give this a try and see how it works.

@vpalmisano
Copy link
Contributor

Hi, I'm trying to use your library for measuring the audio delay in a WebRTC communication.
I see that the library emits this debug log when the end marker is actually received:

ggprintf("Received end marker. Frames left = %d, recorded = %d\n", m_rx.framesLeftToRecord, m_rx.recvDuration_frames);

Do you think we could expose this event to the js application providing the m_rx.recvDuration_frames value? It should contain the exact number of frames that needs to be subtracted to the arriving timestamp. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants