Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAP playback to TWS headset is desynchronized #515

Closed
pv opened this issue May 8, 2023 · 11 comments
Closed

BAP playback to TWS headset is desynchronized #515

pv opened this issue May 8, 2023 · 11 comments

Comments

@pv
Copy link
Contributor

pv commented May 8, 2023

I'm seeing BAP playback to a TWS headset (Samsung Galaxy Buds2 Pro + Intel AX210) left and right earbuds to be desynchronized immediately at playback start. Looking at audio recorded by external mic, it seems the playback on left and right earbuds is off by 10ms.

This occurs sometimes, quite often when the devices connect after taking them out of their case. Disconnect + reconnect fixes the issue, but sometimes it needs to be done more than once to get synchronized playback.

Looking at the HCI traces, I did not see the sound server (Pipewire) doing something strange, and BlueZ/kernel problem also seems unlikely, but uploading here mainly to have some place to refer to for the files.

Headset: Samsung Galaxy Buds2 Pro
Controller: Intel AX210
Bluez: 6bafe1c
Kernel: 6.3.1 + https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/commit/?id=d4b20f0b8491bfb3238ebf7df4b13243189620d2
Pipewire: https://gitlab.freedesktop.org/pvir/pipewire/-/commit/068a132203a9896ff69c0f561406355c86fbdcd5 (master/0.3.70 is also same, this is just to make absolutely sure exactly same data is written to both CIS)

This has occurred also on all previous versions of the software/firmware I've tested, also on bluetooth-next kernel (e6e576ec4e728b201a8013).

Bluetooth: hci1: Minimum firmware build 1 week 10 2014
Bluetooth: hci1: Bootloader timestamp 2019.40 buildtype 1 build 38
Bluetooth: hci1: Found device firmware: intel/ibt-0041-0041.sfi
Bluetooth: hci1: Boot Address: 0x100800
Bluetooth: hci1: Firmware Version: 98-13.23
Bluetooth: hci1: Waiting for firmware download to complete
Bluetooth: hci1: Firmware loaded in 1599630 usecs
Bluetooth: hci1: Waiting for device to boot
Bluetooth: hci1: Device booted in 25438 usecs
Bluetooth: hci1: Malformed MSFT vendor event: 0x02
Bluetooth: hci1: Found Intel DDC parameters: intel/ibt-0041-0041.ddc
Bluetooth: hci1: Applying Intel DDC parameters completed
Bluetooth: hci1: Firmware timestamp 2023.13 buildtype 1 build 62562

Input signal: ax210-samsung-gbuds2pro-tws-desync-3-input.flac.log

HCI trace 1 (desynchronized): ax210-samsung-gbuds2pro-tws-desync-3.log
External mic (close to both earbuds) recording 1: ax210-samsung-gbuds2pro-tws-desync-3.flac.log

HCI trace 2 (desynchronized): ax210-samsung-gbuds2pro-tws-desync-4.log
External mic (close to both earbuds) recording 2: ax210-samsung-gbuds2pro-tws-desync-4.flac.log

You can hear in the recordings that the blip sounds are slightly separated, and it is obvious in stereo. Inspecting the waveforms one can see there's a 10ms offset.

For comparsion (I don't see any suspicious difference however in the traces however):

HCI trace 3 (synchronization ok): ax210-samsung-gbuds2pro-tws-ok.log

@Vudentz
Copy link
Contributor

Vudentz commented Jun 21, 2023

@pv Im reading the following part of the spec and it looks like we need to sync the sequence number somehow:

BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 6, Part G
page 3041

SDUs sent to the upper layer by the ISOAL shall be given a sequence number
which is initialized to 0 when the CIS or BIS is created. SDUs received by the
ISOAL from the upper layer shall be given a sequence number which is
initialized to 0 when the CIS or BIS is created. The upper layer may
synchronize its sequence number with the sequence number in the ISOAL
once the Datapath is configured and the link is established.
The sequence number shall be incremented for each SDU_Interval, whether or
not an SDU was received from or sent to the upper layer.

That said normally we would use HCI_LE_Read_ISO_TX_Sync to sync up the sequence number but I don't think that is implemented in AX210 firmware, note that we need to increment the sequence number even when there is no data.

@pv
Copy link
Contributor Author

pv commented Jun 21, 2023

The user application probably could set some running sequence number in a CMSG, and we could then use those to align the packets for different CIS in same CIG on kernel side, and also assign new sequence numbers for the controller. Getting it right is probably easier to do on kernel side.

I'm not sure if AX210 does something with the numbers. IIRC I tried aligning the sequence numbers in the two CIS (passing them from userspace in cmsg) earlier, but this had no effect. The timestamps did something, but not possible to put correct numbers to get audio without the TX Sync. Maybe need to try this again, it was some time ago.

@Vudentz
Copy link
Contributor

Vudentz commented Jun 21, 2023

@pv Id leave this for the kernel to resolve since it might revolve into HCI traffic, and in case we move the socket around the stack so it is probably not a good idea to use CMSG for something so fundamental as sequence number, anyway if you tried that and it didn't make any difference, then maybe it is not related to that.

@pv
Copy link
Contributor Author

pv commented Jun 21, 2023

Yeah, for the CMSG I mean that it would be useful to have some mechanism for indicating to the kernel which packets correspond to the same time in different CIS, to make it easier to synchronize the streams on socket level. Running sequence numbering for the packets (unrelated to whatever kernel decides to send to adapter) same in each CIS could be one way.

@Vudentz
Copy link
Contributor

Vudentz commented Jun 22, 2023

@pv can you give it a try with following patch to see if it makes a difference:

https://patchwork.kernel.org/project/bluetooth/patch/20230621205230.1006939-1-luiz.dentz@gmail.com/

@pv
Copy link
Contributor Author

pv commented Jun 23, 2023

Doesn't seem to change it. BlueZ (except transport.send) and Pipewire I think do not make use of the getsockopt ISO QoS latency/interval values obtained after connection, but use the ASE QoS values.

I guess a suspicion was that user application should stuff N>0 extra packets to the sockets immediately when playback starts, so that kernel+adapter do not run out of data to send if there are timing glitches. I tried this earlier with different numbers of extra frames, but didn't seem to help with the desync.

Writing many initial packets also seems to make it a bit worse. The kernel seems to do this:

bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #252 [hci0] 53.405433
bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #253 [hci0] 53.405848
bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #254 [hci0] 53.406037
bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #255 [hci0] 53.406211
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #256 [hci0] 53.406409
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #257 [hci0] 53.406624
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #258 [hci0] 53.406819
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #259 [hci0] 53.407072

even though the socket writes are in interleaved CIS1 CIS2 CIS1 CIS2 ... order. Probably there's no ordering guarantee. The ordering can also swap spontaneously as kernel sends ... CIS1 CIS2 CIS2 CIS1 ... or CIS1 CIS1 CIS2 CIS2. However, I have also HCI traces with fully CIS1 CIS2 CIS1 CIS2 ... interleaved traffic that but desynchronized playback, and traces with interleaving order swaps giving synchronized playback. So it's not clear this has something to do with it.

@Vudentz
Copy link
Contributor

Vudentz commented Jun 23, 2023

Doesn't seem to change it. BlueZ (except transport.send) and Pipewire I think do not make use of the getsockopt ISO QoS latency/interval values obtained after connection, but use the ASE QoS values.

We probably need to fix that though so the CIS Established values are used since that afaik is what the controller expects.

I guess a suspicion was that user application should stuff N>0 extra packets to the sockets immediately when playback starts, so that kernel+adapter do not run out of data to send if there are timing glitches. I tried this earlier with different numbers of extra frames, but didn't seem to help with the desync.

Writing many initial packets also seems to make it a bit worse. The kernel seems to do this:

bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #252 [hci0] 53.405433
bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #253 [hci0] 53.405848
bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #254 [hci0] 53.406037
bluetoothd[5840]: < ISO Data TX: Handle 2560 flags 0x02 dlen 124                                        #255 [hci0] 53.406211
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #256 [hci0] 53.406409
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #257 [hci0] 53.406624
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #258 [hci0] 53.406819
bluetoothd[5840]: < ISO Data TX: Handle 2561 flags 0x02 dlen 124                                        #259 [hci0] 53.407072

Well these seems to be sent at the same iteration though, Im thinking on doing some changes to the how we schedule since the current scheduler don't take into account the timestamps of the packets.

even though the socket writes are in interleaved CIS1 CIS2 CIS1 CIS2 ... order. Probably there's no ordering guarantee. The ordering can also swap spontaneously as kernel sends ... CIS1 CIS2 CIS2 CIS1 ... or CIS1 CIS1 CIS2 CIS2. However, I have also HCI traces with fully CIS1 CIS2 CIS1 CIS2 ... interleaved traffic that but desynchronized playback, and traces with interleaving order swaps giving synchronized playback. So it's not clear this has something to do with it.

This is because of the current queuing discipline, it is basically a priority based one which in this case probably means there are 2 channels with the same priority so they will get each half of the buffers/credits at each scheduler iteration, which is not that great for timing sensitive packets.

@Vudentz
Copy link
Contributor

Vudentz commented Jul 28, 2023

@pv Im using these files to with new btmon -a/--analyze code:

  Found LE-ISO connection with handle 2560
        Address: 00:00:00:00:00:00 (OUI 00-00-00)
        Connection setup missing
        1878 RX packets
        2102 TX packets
        2100 TX completed packets
        9-53 msec (~13 msec) TX Latency
        124-124 octets (~124 octets) TX packet size

                                                                               
  10000 +-+----------------------------------------------------------------+   
         +|                                                                |   
         +|                                                Packets +-----+ |   
         +|         +                                                      |   
   1000 +-|         |                                                      |   
         +|         |                                                      |   
         +|         |                                                      |   
         +|         |                                                      |   
    100 +-|         |                                                      |   
         +|         |                         +                            |   
         +|         |                         |                            |   
         +|         |            +            |                            |   
          |    +   +|            |            |            +               |   
     10 +-|    |   ||            |            |            |            +  |   
         +|    |   ||            |            |            |            |  |   
         +|    |   ||           +|       +    |          ++|      +     |  |   
          |    |   ||           ||       |    |          |||      |     |  |   
      1 +-|    |   ||           ||       |    |          |||      |     |  |   
         ++----------------------------------------------------------------+   
          +      +     +      +     +      +     +      +     +      +     +   
          5      10    15     20    25     30    35     40    45     50    55  
                                    Latency (ms)
  Found LE-ISO connection with handle 2561
        Address: 00:00:00:00:00:00 (OUI 00-00-00)
        Connection setup missing
        1896 RX packets
        2102 TX packets
        2100 TX completed packets
        3-33 msec (~13 msec) TX Latency
        124-124 octets (~124 octets) TX packet size

                                                                               
  10000 +-+----------------------------------------------------------------+   
         +|                                                                |   
         +|                                                Packets +-----+ |   
         +|                       +                                        |   
   1000 +-|                       |                                        |   
         +|                       |                                        |   
         +|                       |                                        |   
         +|                       |                                        |   
    100 +-|                       |                                        |   
         +|     +                 |                                        |   
         +|     |                 |                  +                     |   
         +|     |                 |                  |                     |   
          |     |          +    + |                  |                     |   
     10 +-|     |          |    | |                  |                     |   
         +|     |          |    | |                + |                     |   
         +|     |          |    | |                | |                 +   |   
          |     |          |    | |                | |                 |   |   
      1 +-|     |          |    | |                | |                 |   |   
         ++----------------------------------------------------------------+   
          +        +         +        +        +        +         +        +   
          0        5         10       15       20       25        30       35  
                                    Latency (ms)

No wonder they get out of sync, although the vast majority are around 12-13 ms there are quite a few that are spread.

@Vudentz
Copy link
Contributor

Vudentz commented Jul 28, 2023

I suspect this is when things start to derail:

> HCI Event: Number of Completed Packets (0x13) plen 9
        Num handles: 2
        Handle: 2560 Address: 00:00:00:00:00:00 (OUI 00-00-00)
        Count: 3
        Latency: 53 msec (9-53 msec ~43 msec)
        Latency: 43 msec (9-53 msec ~43 msec)
        Latency: 33 msec (9-53 msec ~38 msec)
        Handle: 2561 Address: 00:00:00:00:00:00 (OUI 00-00-00)
        Count: 3
        Latency: 23 msec (3-33 msec ~13 msec)
        Latency: 13 msec (3-33 msec ~13 msec)
        Latency: 3 msec (3-33 msec ~8 msec)

I will probably need to add the frame number to make it clearer which are these responses.

@pv
Copy link
Contributor Author

pv commented Jul 29, 2023

From what I recall, and can still reproduce (on AX210 Firmware timestamp 2023.24 buildtype 1 build 67068, Found device firmware: intel/ibt-0041-0041.sfi, Firmware Version: 252-24.23) is that the playback sync does not "slip" during playback and there are no sound glitches. Instead, the left/right streams are desynchronized immediately at playback start, so if it is scheduling issue then culprit is probably close to the start of the streaming.

If user stops writing to the socket for some tens of ms, and then resumes sending packets, this can change the synchronization (either make it synchronized, or make it desynchronized).

pv added a commit to pv/linux that referenced this issue Feb 7, 2024
ISO streams in the same group should be synchronized, this is required
e.g. by stereo audio playback in TWS headsets.  Currently the only
synchronization mechanism we use is that the user has to supply data for
each stream at the right time and in right order, and then we send them
to the controller ASAP without explicitly trying to keep streams in sync
and hope for the best.

However, it is observed e.g. with Intel AX210 that rather often a packet
for one of the streams does not get sent but is left in the queue and
postponed, even if submitted to controller queue at the same time (as
far as able). After this, that ISO stream latency is persistently one
packet larger relative to the other streams in the group, and e.g. TWS
stereo audio is ruined.  User sending packets to the ISO sockets cannot
know this occurred, nor can it easily correct for it.

Fix the problem by adding an explicit stream synchronization mechanism
to ISO sockets, and handling stream synchronization in packet
scheduling.

The user may tag each packet sent to ISO socket via CMSG with a
monotonically increasing sequence number.  The kernel shall ensure that
SDU timings within ISO group are such that packets with same CMSG
sequence number are aligned.

The implementation here does this by delaying sending some packets until
the streams are aligned.  It will also wait for controller queue to
flush sufficiently, if it appears that streams are out of sync on the
controller side.  See the code for explanation.

This patch fixes the observed TWS desynchronization issue (currently not
completely, see below), if the sound server is modified to add the
packet seq,
https://gitlab.freedesktop.org/pvir/pipewire/-/commits/iso-seqnum

I tried various other things to try to fix this (FIFO scheduling etc.),
without much success. Requiring conn->sent alignment before sending more
stuff to the controller seems necessary.

There are a few things to think about, hence RFC:

- Any better ideas?

- The conn->sent alignment check probably should be somehow refined.  It
  seems to be enough with low transport latency, but with larger
  transport latencies one can still get streams out of sync,
  characterized by the controller not reporting Number of Completed
  packets for all streams in one go, but sending completed packet events
  for some of the streams only a bit later.

- This sort of scheduling thing that depends on controller queue or
  clock IMO needs to be done in kernel, user should just tell the kernel
  how stuff is supposed to be scheduled.

- However, not clear if packet sequence number is the best way to for
  this sort of thing, or if user should specify some sort of timestamps
  or other information instead.

  E.g. we might change this to use SDU timestamps (if they can be used),
  but user would communicate the scheduling info still the same way.

- Ideally we probably should use SDU timestamps to explicitly tell the
  controller which packets go to same interval, but at least when last
  tried AX210 didn't support HCI_LE_Read_ISO_TX_Sync, and it appears
  clock synchronization is required which cannot(?) be obtained without
  it.

- This patch can make latency to grow on transmission glitch, if it
  needs to wait for controller queue to drain so that streams can be
  aligned again. For proper flow control especially in audio use cases,
  but also more generally, would be better to have for user a way to
  know the number of packets in queue, in socket buf, and in controller,
  and the max.  TIOCOUTQ is similar, but does not give packet counts nor
  controller queue so is insufficient.

Link: bluez/bluez#515
Signed-off-by: Pauli Virtanen <pav@iki.fi>
pv added a commit to pv/linux that referenced this issue Feb 7, 2024
ISO streams in the same group should be synchronized, this is required
e.g. by stereo audio playback in TWS headsets.  Currently the only
synchronization mechanism we use is that the user has to supply data for
each stream at the right time and in right order, and then we send them
to the controller ASAP without explicitly trying to keep streams in sync
and hope for the best.

However, it is observed e.g. with Intel AX210 that rather often a packet
for one of the streams does not get sent but is left in the queue and
postponed, even if submitted to controller queue at the same time (as
far as able). After this, that ISO stream latency is persistently one
packet larger relative to the other streams in the group, and e.g. TWS
stereo audio is ruined.  User sending packets to the ISO sockets cannot
know this occurred, nor can it easily correct for it.

The issue can e.g. be triggered by bad transmission (e.g. cover one of
the TWS earbuds with hand, or move the device further away), which
appears to make a random change to the latency of its stream.

Fix the problem by adding an explicit stream synchronization mechanism
to ISO sockets, and handling stream synchronization in packet
scheduling.

The user may tag each packet sent to ISO socket via CMSG with a
monotonically increasing sequence number.  The kernel shall ensure that
SDU timings within ISO group are such that packets with same CMSG
sequence number are aligned.

The implementation here does this by delaying sending some packets until
the streams are aligned.  It will also wait for controller queue to
flush sufficiently, if it appears that streams are out of sync on the
controller side.  See the code for explanation.

This patch fixes the observed TWS desynchronization issue (currently not
completely, see below), if the sound server is modified to add the
packet seq,
https://gitlab.freedesktop.org/pvir/pipewire/-/commits/iso-seqnum

I tried various other things to try to fix this (FIFO scheduling,
different number of packets submitted at the same time, etc.), without
much success, and this is the first one to give results.  Requiring
conn->sent alignment before sending more stuff to the controller seemed
the important step here.

There are a few things to think about, hence RFC:

- Any better ideas?

  The spec seems vague here (if we can't set SDU timestamps), so some
  approach that's not based on guessing what the controller internally
  does could be better.

- The conn->sent alignment check probably should be somehow refined.  It
  seems to be enough with low transport latency, but with larger
  transport latencies one can still get streams out of sync,
  characterized by the controller not reporting Number of Completed
  packets for all streams in one go, but sending completed packet events
  for some of the streams only a bit later.

- This sort of scheduling thing that depends on controller queue or
  clock IMO needs to be done in kernel, user should just tell the kernel
  how stuff is supposed to be scheduled.

- However, not clear if packet sequence number is the best way to for
  this sort of thing, or if user should specify some sort of timestamps
  or other information instead.

  E.g. we might change this to use SDU timestamps (if they can be used),
  but user would communicate the scheduling info still the same way.

- Ideally we probably should use SDU timestamps to explicitly tell the
  controller which packets go to same interval, but at least when last
  tried AX210 didn't support HCI_LE_Read_ISO_TX_Sync, and it appears
  clock synchronization is required which cannot(?) be obtained without
  it.

- This patch can make latency to grow on transmission glitch, if it
  needs to wait for controller queue to drain so that streams can be
  aligned again. For proper flow control especially in audio use cases,
  but also more generally, would be better to have for user a way to
  know the number of packets in queue, in socket buf, and in controller,
  and the max.  TIOCOUTQ is similar, but does not give packet counts nor
  controller queue so is insufficient.

Link: bluez/bluez#515
Signed-off-by: Pauli Virtanen <pav@iki.fi>
github-actions bot pushed a commit to tedd-an/bluetooth-next that referenced this issue Feb 7, 2024
ISO streams in the same group should be synchronized, this is required
e.g. by stereo audio playback in TWS headsets.  Currently the only
synchronization mechanism we use is that the user has to supply data for
each stream at the right time and in right order, and then we send them
to the controller ASAP without explicitly trying to keep streams in sync
and hope for the best.

However, it is observed e.g. with Intel AX210 that rather often a packet
for one of the streams does not get sent but is left in the queue and
postponed, even if submitted to controller queue at the same time (as
far as able). After this, that ISO stream latency is persistently one
packet larger relative to the other streams in the group, and e.g. TWS
stereo audio is ruined.  User sending packets to the ISO sockets cannot
know this occurred, nor can it easily correct for it.

The issue can e.g. be triggered by bad transmission (e.g. cover one of
the TWS earbuds with hand, or move the device further away), which
appears to make a random change to the latency of its stream.

Fix the problem by adding an explicit stream synchronization mechanism
to ISO sockets, and handling stream synchronization in packet
scheduling.

The user may tag each packet sent to ISO socket via CMSG with a
monotonically increasing sequence number.  The kernel shall ensure that
SDU timings within ISO group are such that packets with same CMSG
sequence number are aligned.

The implementation here does this by delaying sending some packets until
the streams are aligned.  It will also wait for controller queue to
flush sufficiently, if it appears that streams are out of sync on the
controller side.  See the code for explanation.

This patch fixes the observed TWS desynchronization issue (currently not
completely, see below), if the sound server is modified to add the
packet seq,
https://gitlab.freedesktop.org/pvir/pipewire/-/commits/iso-seqnum

I tried various other things to try to fix this (FIFO scheduling,
different number of packets submitted at the same time, etc.), without
much success, and this is the first one to give results.  Requiring
conn->sent alignment before sending more stuff to the controller seemed
the important step here.

There are a few things to think about, hence RFC:

- Any better ideas?

  The spec seems vague here (if we can't set SDU timestamps), so some
  approach that's not based on guessing what the controller internally
  does could be better.

- The conn->sent alignment check probably should be somehow refined.  It
  seems to be enough with low transport latency, but with larger
  transport latencies one can still get streams out of sync,
  characterized by the controller not reporting Number of Completed
  packets for all streams in one go, but sending completed packet events
  for some of the streams only a bit later.

- This sort of scheduling thing that depends on controller queue or
  clock IMO needs to be done in kernel, user should just tell the kernel
  how stuff is supposed to be scheduled.

- However, not clear if packet sequence number is the best way to for
  this sort of thing, or if user should specify some sort of timestamps
  or other information instead.

  E.g. we might change this to use SDU timestamps (if they can be used),
  but user would communicate the scheduling info still the same way.

- Ideally we probably should use SDU timestamps to explicitly tell the
  controller which packets go to same interval, but at least when last
  tried AX210 didn't support HCI_LE_Read_ISO_TX_Sync, and it appears
  clock synchronization is required which cannot(?) be obtained without
  it.

- This patch can make latency to grow on transmission glitch, if it
  needs to wait for controller queue to drain so that streams can be
  aligned again. For proper flow control especially in audio use cases,
  but also more generally, would be better to have for user a way to
  know the number of packets in queue, in socket buf, and in controller,
  and the max.  TIOCOUTQ is similar, but does not give packet counts nor
  controller queue so is insufficient.

Link: bluez/bluez#515
Signed-off-by: Pauli Virtanen <pav@iki.fi>
github-actions bot pushed a commit to BluezTestBot/bluetooth-next that referenced this issue Feb 7, 2024
ISO streams in the same group should be synchronized, this is required
e.g. by stereo audio playback in TWS headsets.  Currently the only
synchronization mechanism we use is that the user has to supply data for
each stream at the right time and in right order, and then we send them
to the controller ASAP without explicitly trying to keep streams in sync
and hope for the best.

However, it is observed e.g. with Intel AX210 that rather often a packet
for one of the streams does not get sent but is left in the queue and
postponed, even if submitted to controller queue at the same time (as
far as able). After this, that ISO stream latency is persistently one
packet larger relative to the other streams in the group, and e.g. TWS
stereo audio is ruined.  User sending packets to the ISO sockets cannot
know this occurred, nor can it easily correct for it.

The issue can e.g. be triggered by bad transmission (e.g. cover one of
the TWS earbuds with hand, or move the device further away), which
appears to make a random change to the latency of its stream.

Fix the problem by adding an explicit stream synchronization mechanism
to ISO sockets, and handling stream synchronization in packet
scheduling.

The user may tag each packet sent to ISO socket via CMSG with a
monotonically increasing sequence number.  The kernel shall ensure that
SDU timings within ISO group are such that packets with same CMSG
sequence number are aligned.

The implementation here does this by delaying sending some packets until
the streams are aligned.  It will also wait for controller queue to
flush sufficiently, if it appears that streams are out of sync on the
controller side.  See the code for explanation.

This patch fixes the observed TWS desynchronization issue (currently not
completely, see below), if the sound server is modified to add the
packet seq,
https://gitlab.freedesktop.org/pvir/pipewire/-/commits/iso-seqnum

I tried various other things to try to fix this (FIFO scheduling,
different number of packets submitted at the same time, etc.), without
much success, and this is the first one to give results.  Requiring
conn->sent alignment before sending more stuff to the controller seemed
the important step here.

There are a few things to think about, hence RFC:

- Any better ideas?

  The spec seems vague here (if we can't set SDU timestamps), so some
  approach that's not based on guessing what the controller internally
  does could be better.

- The conn->sent alignment check probably should be somehow refined.  It
  seems to be enough with low transport latency, but with larger
  transport latencies one can still get streams out of sync,
  characterized by the controller not reporting Number of Completed
  packets for all streams in one go, but sending completed packet events
  for some of the streams only a bit later.

- This sort of scheduling thing that depends on controller queue or
  clock IMO needs to be done in kernel, user should just tell the kernel
  how stuff is supposed to be scheduled.

- However, not clear if packet sequence number is the best way to for
  this sort of thing, or if user should specify some sort of timestamps
  or other information instead.

  E.g. we might change this to use SDU timestamps (if they can be used),
  but user would communicate the scheduling info still the same way.

- Ideally we probably should use SDU timestamps to explicitly tell the
  controller which packets go to same interval, but at least when last
  tried AX210 didn't support HCI_LE_Read_ISO_TX_Sync, and it appears
  clock synchronization is required which cannot(?) be obtained without
  it.

- This patch can make latency to grow on transmission glitch, if it
  needs to wait for controller queue to drain so that streams can be
  aligned again. For proper flow control especially in audio use cases,
  but also more generally, would be better to have for user a way to
  know the number of packets in queue, in socket buf, and in controller,
  and the max.  TIOCOUTQ is similar, but does not give packet counts nor
  controller queue so is insufficient.

Link: bluez/bluez#515
Signed-off-by: Pauli Virtanen <pav@iki.fi>
pv added a commit to pv/linux that referenced this issue Feb 20, 2024
Applications sending ISO packets need to know latency on kernel and
controller side accurately, so that they can maintain it at a desired
target value.

Moreover, Intel AX210 (and maybe others) do not appear to successfully
use SDU timings or sequence numbers to maintain synchronization between
CIS in same CIG, and streams often stay out of sync persistently even
though SDUs for each CIS are sent to ISO sockets and to the controller
exactly at the same time.  As these controllers have broken LE Read ISO
TX Sync, it's not very clear in the spec how the synchronization is
supposed to work.

Add new ioctl SIOCBTREADTXINFO, which provides a reference time when a
packet was last sent and the length of the queue at that time.  From
this information and packet send times it records, user application can
reconstruct the total tx latency between submitting a packet to a
socket, and controller declaring it completed.

This patch does not implement support for LE Read ISO TX Sync. However,
the ioctl() is such that it shall be possible to implement it also via
LE Read ISO TX Sync.  Moreover, it may be useful for audio use cases to
implement it also for L2CAP later.

Link: bluez/bluez#515
Signed-off-by: Pauli Virtanen <pav@iki.fi>
ford-prefect pushed a commit to PipeWire/pipewire that referenced this issue Apr 12, 2024
Use TX timestamps to get accurate reading of queue length and latency on
kernel + controller side.

This is new kernel BT feature, so requires kernel with the necessary
patches, available currently only in bluetooth-next/master branch.
Enabling Poll Errqueue kernel experimental Bluetooth feature is also
required for this.

Use the latency information to mitigate controller issues where ISO
streams are desynchronized due to tx problems or spontaneously when some
packets that should have been sent are left sitting in the queue, and
transmission is off by a multiple of the ISO interval.  This state is
visible in the latency information, so if we see streams in a group have
persistently different latencies, drop packets to resynchronize them.

Also make corrections if the kernel/controller queues get too long, so
that we don't have too big latency there.

Since BlueZ watches the same socket for errors, and TX timestamps arrive
via the socket error queue, we need to set BT_POLL_ERRQUEUE in addition
to SO_TIMESTAMPING so that BlueZ doesn't think TX timestamps are errors.

Link: bluez/bluez#515
Link: https://lore.kernel.org/linux-bluetooth/cover.1710440392.git.pav@iki.fi/
Link: https://lore.kernel.org/linux-bluetooth/f57e065bb571d633f811610d273711c7047af335.1712499936.git.pav@iki.fi/
@pv
Copy link
Contributor Author

pv commented Apr 12, 2024

With the TX latency patches, this is now mitigated on application side.

@pv pv closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants