Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional drops detected in output RTP stream #17

Closed
bondagit opened this issue Aug 2, 2020 · 32 comments
Closed

occasional drops detected in output RTP stream #17

bondagit opened this issue Aug 2, 2020 · 32 comments

Comments

@bondagit
Copy link
Owner

bondagit commented Aug 2, 2020

As reported by @jtc-dolby we have an issue with the ALSA RAVENNA driver causing occasional drops or stream realignment in output RTP stream.
It's possible to detect the problem by playing on the RAVENNA device a low frequency e.g. 100Hz sine wave and checking the resulting recording on another device.
The issue can be reproduced with both memory mapped and read/write interleaved modes.

See enclosed pictures:

drop_mmap
drop_no_mmap

According to my tests the problem doesn't show up on the recording side and it seems to affect the playback only.

After a number of tries I could associate the issue to the following debug prints logged by the driver (see lines in bold):

[432559.316518] entering mr_alsa_audio_pcm_trigger (substream name=subdevice #0 #0) ...
[432559.316521] mr_alsa_audio_pcm_trigger(Start), rate=48000 format=32 channels=2 period_size=48
[432559.316523] starting playback I/O
[432559.323410] LastTICCounter = 2636220 ui64TICCounter = 2636227 (Timer period = 69 [100us])
[432559.323427] Timer period out of range: 0 [ms]. Target period = 1

[432559.323428] next_wakeup: 432561021200000 now: 432561021145163
[432560.597372] LastTICCounter = 2637506 ui64TICCounter = 2637501 (Timer period = 9 [100us])
[432660.226216] LastTICCounter = 2737128 ui64TICCounter = 2737130 (Timer period = 18 [100us])
[432660.227235] LastTICCounter = 2737131 ui64TICCounter = 2737131 (Timer period = 0 [100us])
[432719.026270] Timer period out of range: 0 [ms]. Target period = 1
[432719.026276] next_wakeup: 432720725500000 now: 432720725422743

@bondagit bondagit changed the title occasional drops or stream re-alignment detected in output RTP stream occasional drops or stream realignment detected in output RTP stream Aug 2, 2020
@bondagit
Copy link
Owner Author

bondagit commented Sep 2, 2020

Hi @jtc-dolby,
the problem doesn't seem to show up anymore if I recompile the Linux kernel with the timer frequency set to 100Hz:
CONFIG_HZ=100
In the kernel menu config you can change the timer frequency in:
Kernel features -> Timer frequency
Can you try this and report if it works for you ?
Thanks

@bondagit
Copy link
Owner Author

Also this a frequency of 250Hz the problem doesn't show up.

@jtc-dolby
Copy link
Collaborator

jtc-dolby commented Sep 22, 2020

You said that the problem doesn't show up with 250Hz but this is the default frequency in kernel config file. If this is true then why would expect a difference in behaviour with 100Hz? I will try 100Hz anyway and see if it changes anything.

@jtc-dolby
Copy link
Collaborator

jtc-dolby commented Sep 23, 2020

I tested with 100Hz. Not good news. I got drops using aplay both with and without the -M option (Memory mapped mode). (Before I got drops only with the -M option). The drops were much more frequent when using the -M option: 43 vs 5 drops over a 45min period.

@drtechno1971
Copy link

drtechno1971 commented Sep 26, 2020

Try:
CONFIG_HZ=1000
CONFIG_NO_HZ_FULL=y
Problem with this driver it uses a webui instead of plain C console echo, but this might fair better.
Also, It might be better to write a driver extension to Alsa instead of using a 3rd party virtual driver to lower latency.

@drtechno1971
Copy link

Also you might want to try passing idle=mwait at boot time so the kernal doesn't reschedule the timing due to others using adaptive timing ticks.

bondagit added a commit that referenced this issue Jan 19, 2021
…ode on the network loopback interface.

This can be used t odebug issue #17.
@bondagit
Copy link
Owner Author

I have finally found some time to proceed on the debugging of this issue and I have just pushed on the branch test_issue_17 the files required to run a test on the loopback network interface and to verify the integrity of the wav file recorded.
In the test I configure a single 48Khz L24 stereo source and the corresponding sink to playback and record on the loopback network interface a sample wav file created artificially to enable simple detection of corruptions on the recording.
To run the test and create the recording:

./run_test.sh

To verify that the recording is correct:

cd test
c++ check.cc -o check
./check

The expected output is : "ok", otherwise the program prints out the file position where the corruption was detected.
I have run the test successfully many times on both ARM and x86_64 platforms and I never had a single invalid recording.
In my tests I have tried to use both mermory mapped and read/write interleaved modes.

@jtc-dolby
Copy link
Collaborator

jtc-dolby commented Jan 19, 2021 via email

@bondagit
Copy link
Owner Author

When I have originally opened the issue I was running the tests using the daemon to playback a 48Khz L24 stereo file and the Audinate Dante USB to perform the recording.
Apparently this device has strict latency constrains and when it receives a late packet (that is a packet exceeding the 2ms delay in this configuration) it replaces the late samples with zeros and this leads to such recordings.
I enclosed a picture showing such late packets as reported by the Dante device latency tab:
Dante latency

@bondagit
Copy link
Owner Author

hi @jtc-dolby , great to hear you again ;-)
Can you just try to run the simple test on the test_issue_17 branch ? I'd like to hear the results. Thanks.
Anyway this is not conclusive analysis and I am going to perform more device to device (not loopback) tests.

@bondagit
Copy link
Owner Author

bondagit commented Jan 19, 2021

At present I suspect that the issue is actually related to these late packets.
At the moment I am not sure why I have such late packets when I run tests on the X86_64 platform but I have none when using some ARM boards I tested here.
In my opinion the issue is not related to the daemon or the Merging driver but it might come from the Linux kernel or the network device or driver in use.
I enclosed a picture with the latency measurement of a stream sent by the daemon running on NanoPi2 with Linux nanopineo2 5.8.8-sunxi64 kenrl.
As you can see I have no late packets and the peak latency is 1.4ms
Dante latency NanoPi2
On this board the latency keeps stable also on a longer period of time and I don't have a single late packet, see picture below:
Dante latency NanoPi2 -2

@jtc-dolby
Copy link
Collaborator

jtc-dolby commented Jan 19, 2021 via email

@bondagit
Copy link
Owner Author

After additional tests on the X86 platform I can reproduce a corruption of the output file by changing the driver base timer period to 48 ("tic_frame_size_at_1fs": 48) in the daemon configuration file used by the test (test/daemon.conf). This was set to 192 in my commit.
So I repeated the tests changing the "tic_frame_size_at_1fs" parameter to 48 (1 ms), 96 (2 ms) and 192 (4 ms). The result is that while on the ARM platforms (I tested so far) the output file is always OK for all the values, on my X86_64 I can systematically reproduce a corruption with 48 only and it always works with 96 and 192. On this platform my attempts to use a low latency or preemptive kernel didn’t solve the issue.
The root cause of the issue is unknown to me but the test I implemented can be used to verify if a specific platform is able to serve or not at the specified base period. In case of a negative result (corruption of the output file) my suggestion is to switch to another.

@bondagit bondagit changed the title occasional drops or stream realignment detected in output RTP stream occasional drops detected in output RTP stream Jan 24, 2021
@jtc-dolby
Copy link
Collaborator

I have executed the 'run_test.sh' on the issue 17 branch several times without making any changes and the checker returns OK. Should I try again with tic set to 48 instead of 192?

@jtc-dolby
Copy link
Collaborator

I ran the tests again with TIC=48 three times and it was OK everytime.

@gugaguga24678
Copy link

When I have originally opened the issue I was running the tests using the daemon to playback a 48Khz L24 stereo file and the Audinate Dante USB to perform the recording.
Apparently this device has strict latency constrains and when it receives a late packet (that is a packet exceeding the 2ms delay in this configuration) it replaces the late samples with zeros and this leads to such recordings.
I enclosed a picture showing such late packets as reported by the Dante device latency tab:
Dante latency

Hi @bondagit ,is it possible to change the delay setting from 2ms to 5ms?
I have the same problem. When setting tic_frame_size_at_1fs to 48, the Dante Controller shows more than 50% of the packets are late. The drops were much more frequent when setting tic_frame_size_at_1fs to 96 or 192.
I ran the run_test.sh and check several times, the program prints out "error at position:" every time.

@bondagit
Copy link
Owner Author

bondagit commented Jan 26, 2021

Hi @bondagit ,is it possible to change the delay setting from 2ms to 5ms?

I don't know how to do this, check with Audinate.

I have the same problem. When setting tic_frame_size_at_1fs to 48, the Dante Controller shows more than 50% of the packets are late. The drops were much more frequent when setting tic_frame_size_at_1fs to 96 or 192.
I ran the run_test.sh and check several times, the program prints out "error at position:" every time.

What platform are you using for the tests ? What Linux Kernel version ?
The parameter "tic_frame_size_at_1fs" sets the driver base period. At each tick of this period the driver schedule the outbound RTP packets for sending and processes the inbound packets. So if your Dante is using a delay setting of 2ms you have to set tic_frame_size_at_1fs to 48 (1ms), any larger setting cannot work. Despite this as you can see you still have late packets and also the loopback test fails so your platform doesn't seem to support this setting. My suggestion is to try with another board.

@bondagit
Copy link
Owner Author

Hi @jtc-dolby

I ran the tests again with TIC=48 three times and it was OK everytime.

So your platform looks ok, or at least it's ok with a 4 mins recording. Probably we should test it for a longer period of time.
But, if I remember it correctly, you reported drops every minute in case of playback in memory mapped mode and this is weird.
Do you have a Dante AVIO USB device ? can you monitor for late packets and measure the avg and the peak latency of the output stream ?

@jtc-dolby
Copy link
Collaborator

I did see drops fairly frequently (every minute or so). I don't have a Dante AVIO device but the Lawo devices I do have gather statistics regarding late packets. I can use the same build on the same machine as I just ran the tests and see what happens. Is it possible there are differences between stream type or codec? My focus was the AM824 because this is carrying data and that is covered by a CRC so corruption is very apparent. I can put data inside L16/L24 for testing purposes.

@Aika0
Copy link

Aika0 commented Jan 26, 2021

Testing issue 17 with "tic_frame_size_at_1fs": 48 on various Raspberry Pis

Device A (10.1.14.120)
Raspberry Pi 4 Model B Rev 1.1
Hardware: BCM2711
Revision: a03111
Kernel: 5.4.83-v7l+
Result 1: ok
Result 2: ok
Result 3: ok

Device B (10.1.14.77)
Raspberry Pi 3 Model B Plus Rev 1.3
Hardware: BCM2845
Revision: a020d3
Kernel: 5.4.83-v7+
Result 1: ok
Result 2: ok
Result 3: ok

Device C (10.1.14.55)
Raspberry Pi 3 Model B Rev 1.2
Hardware: BCM2835
Revision: a02082
Kernel: 5.4.83-v7+
Result 1: ok
Result 2: ok
Result 3: ok

Device D (10.1.14.92)
Raspberry Pi 2 Model B Rev 1.1
Hardware: BCM2835
Revision: a21041
Kernel: 5.4.83-v7+
Result 1: ok
Result 2: ok
Result 3: ok

@bondagit
Copy link
Owner Author

bondagit commented Jan 27, 2021

Testing issue 17 with "tic_frame_size_at_1fs": 48 on various Raspberry Pis
....

Great job, thank you. The results were as expected as I was able to get it to work correctly even on a board with an ARM SOC older than the ones you tested.
My idea was to create a new page in the documentation with a list of compatible HWs. Anyway at the moment the test validates only the configuration with a single L24 / stereo / 48Khz source and sink which is however one of the most used.

@bondagit
Copy link
Owner Author

bondagit commented Jan 27, 2021

I did see drops fairly frequently (every minute or so). I don't have a Dante AVIO device but the Lawo devices I do have gather statistics regarding late packets. I can use the same build on the same machine as I just ran the tests and see what happens.

Thanks, I think this should help to narrow down the issue: If you have late packets it can be that your receiver device is dropping them (replacing the late samples with silence).

Is it possible there are differences between stream type or codec? My focus was the AM824 because this is carrying data and that is covered by a CRC so corruption is very apparent. I can put data inside L16/L24 for testing purposes.

Yes, a different codec can make a difference. For this reason I was also thinking to enhance this test suite to allow for different configurations. In the meantime you could try modifying my test to play an AM824 (L32) file and record an L32 wav file. Afterwards you should be able to use the CRC to detect registration errors instead of relying on the samples sequence as I did.
This way we can test using the loopback interface and we exclude any network related issue.

bondagit added a commit that referenced this issue Jan 31, 2021
To run the test suite:

./run_test.sh sample_format sample_rate channels duration
Where:
    sample_format can be one of S16_LE, S24_3LE, S32_LE
    sample_rate can be one of 44100, 48000, 96000
    channels can be one of 1, 2, 4
    duration is in the range 1 to 10 minutes

The test suite creates a raw file with the specified parameters, runs a loopback test where the file gets played and recorded using the loopback network interface and checks that the recorded file, after the initial silence, contains the expected samples sequence.
This test was developed to further investigate the issue #17.
@bondagit
Copy link
Owner Author

bondagit commented Jan 31, 2021

Hi @jtc-dolby ,
first of all please include a couple of new patches I have just released for the driver.
Apart from this I enhanced the test suite to support multiple configurations and I re-executed some tests on the platforms I have here with the results reported below.

On X86_64 with tic_frame_size_at_1fs set to 48 in test/daemon.conf, I have the following results:
./run_test.sh S16_LE 48000 2 5 -> fails
./run_test.sh S24_3LE 48000 2 5 -> fails
./run_test.sh S32_LE 48000 2 5 -> fails
On X86_64 with the tic_frame_size_at_1fs set to 96 and 192 in test/daemon.conf, I have the following results:
./run_test.sh S16_LE 48000 2 5 -> OK
./run_test.sh S24_3LE 48000 2 5 -> OK
./run_test.sh S32_LE 48000 2 5 -> OK
On ARM with tic_frame_size_at_1fs set to 48, 96, 192 in test/daemon.conf, I have the following results:
./run_test.sh S16_LE 48000 2 5 -> OK
./run_test.sh S24_3LE 48000 2 5 -> OK
./run_test.sh S32_LE 48000 2 5 -> OK

So, according to my tests the sample format doesn't affect the test result. One of the possible reasons is that the samples size of driver internal playback and capture buffers is always set to 32 bit.

Can you repeat these 3 tests using your platform ? Thanks.

bondagit added a commit that referenced this issue Feb 3, 2021
…al playback buffer corruption in case the read/write interleaved mode is used.

The problem arises from pre-buffering performed after ALSA prepare and before ALSA start trigger-
The patch enables the early startup of the audio playback interrupt and disables the cleanup (mute) of the playback buffer.
This issue can be reproduced using the test suite developed to investigate #17 and by removing the (-M) option from aplay in run_test.sh script.
@bondagit
Copy link
Owner Author

bondagit commented Feb 3, 2021

I have just found another issue in the driver. This time it's related to the playback in read-write interleaved mode and it affects the initial samples of the output stream.
The issue is solved by 77cf7bd. This patch has anyway no effect on the ./run_test.sh results reported above. The new problem can be easly reproduced by removing the option (-M) to aplay in run_test.sh script.

@bondagit
Copy link
Owner Author

bondagit commented Feb 7, 2021

Hi @jtc-dolby, did you have the chance to run the new tests? Thanks

@jtc-dolby
Copy link
Collaborator

I ran the test. Results below. I didn't reinstall the driver between tests but just edited the conf file and reran the script. Is that correct?

Here are my results from the tests:
On X86_64 - Ubuntu 19.10
with tic_frame_size_at_1fs set to 48 in test/daemon.conf, I have the following results:
./run_test.sh S16_LE 48000 2 5 -> fails
./run_test.sh S24_3LE 48000 2 5 -> fails
./run_test.sh S32_LE 48000 2 5 -> fails
with the tic_frame_size_at_1fs set to 96 in test/daemon.conf, I have the following results:
./run_test.sh S16_LE 48000 2 5 -> fails
./run_test.sh S24_3LE 48000 2 5 -> fails
./run_test.sh S32_LE 48000 2 5 -> OK
with the tic_frame_size_at_1fs set to 192 in test/daemon.conf, I have the following results:
./run_test.sh S16_LE 48000 2 5 -> OK
./run_test.sh S24_3LE 48000 2 5 -> fails
./run_test.sh S32_LE 48000 2 5 -> OK

@bondagit
Copy link
Owner Author

bondagit commented Feb 10, 2021

Yes, that's correct because the last test script uninstall and reinstall the driver at every run.
It's hard to draw a conclusion here but what I can certainly say is that you cannot achieve 1ms latency on the platform you are using, or better saying you cannot achieve it with a good level of reliabilty. And this is also consistent with the results you had using your final setup.
Please consider that the test result is very much affected by contention among the drivers & processes running on your platform and it provides only an indication of the latency you can achieve: the whole solution is not thought to offer a reliable communication channel but it's meant to keep the latency under control.
Anyway it can still be that the problem depends on some race condition or bug in the driver code.
At present I can achieve the best results using ARM boards running a minimal Linux distro dedicated to AES67 only.

@jtc-dolby
Copy link
Collaborator

OK. I think that is a reasonable theory. My machine is pretty old and 1ms is quite tight. I'd like to prove it by increasing the latency and making the problems go away. 192 (4ms) certainly looks better. Can I go higher? How about 480 (10ms)?

@bondagit
Copy link
Owner Author

yes, I have succesfully tried to raise it up 480 (10ms) included. If think even more should be possible.
I have just noticed that the last patch I released may create troubles. Please repeat the test leaving out the folloiwing patch for the driver:

ravenna-alsa-fix-playback-rw-mode.patch

This was not included at the time I ran the tests my side. I will revise it as soon as I have time.

@bondagit
Copy link
Owner Author

Hi @jtc-dolby,
I have reworked the patch for the read-write interleaved mode and also simplified and unified the driver PCM interface. See commit 8a56006
The previous patch was not always working properly and causing troubles also in memory mapped mode.
After this new patch I re-tested the configurations above multiple times on both X86_64 and ARM and I consistently have the same results already reported (by me) with memory mapped and read-write interleaved modes.
Could you re-try the tests your side using the same configurations you used previously and the latest build ?
Thanks.

@keygee
Copy link
Collaborator

keygee commented Feb 14, 2021

Hi Andrea,
I made some tests and I would like to share with you my results:

tests made:

tic_frame_size_at_1fs set to 48:

  • ./run_test.sh S16_LE 48000 2 5
  • ./run_test.sh S24_3LE 48000 2 5
  • ./run_test.sh S32_LE 48000 2 5

tic_frame_size_at_1fs set to 96:

  • ./run_test.sh S16_LE 48000 2 5
  • ./run_test.sh S24_3LE 48000 2 5
  • ./run_test.sh S32_LE 48000 2 5

tic_frame_size_at_1fs set to 192:

  • ./run_test.sh S16_LE 48000 2 5
  • ./run_test.sh S24_3LE 48000 2 5
  • ./run_test.sh S32_LE 48000 2 5

I made these tests on:

Machine 1:

  • Raspberry Pi 4 Model B Rev 1.1
    ARMv7 Processor rev 3 (v7l)
    Kernel: 4.19.108-v7l+ (cross Compiled)
    Distribution: CRUX 3.5

Machine 2

  • Sony VAIO VPCS13V9E
    Intel(R) Core(TM) i5 CPU M 460 @ 2.53GHz
    Kernel: 5.7.6
    Distribution: CRUX 3.6

Machine 3

  • DELL Optiplex 980
    Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
    Kernel: 5.7.6
    Distribution: CRUX 3.6

All results have been OK in all machines, except if I make "stressing work" during the test (compiling heavy code or transcoding videos...). Repeating the tests in a idle state resulted always OK.

If could be of any interest, I can share my .config ( or part of it ) of my kernels
Another info that could (or not) be of any interest, is that I compiled all without clang, because I haven't in my machines.

If you need more info or tests, please let me know.

Cheers
Guido

@bondagit
Copy link
Owner Author

Hi Guido ,
thanks a lot for taking the time to run these tests. Your timely contribution is very much appreciated ;-) and your tests also show that there is no open issue on the X86_64 platform.
I have also run more tests my side and the driver situation looks pretty stable now.
Of course as you experienced the tests results may be affected by other loads running on the CPU at the same time and this has been discussed already and the problem could be mitigated by setting high priority for aplay & arecord or just by taking care to stop additional loads during the tests ...
I will move the test suite to the master branch as replacement of the demo, add some documentation and prepare the repo for release 1.0

bondagit added a commit that referenced this issue Feb 16, 2021
- removed demo
- updated documentation with info about the platform compatiblity test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants