Performance optimization for `_split_bitstream` #564

johnboiles · 2021-09-09T05:34:24Z

In profiling performance of #559, I noticed that a ton of time is spent in _split_bitstream. So I set out to speed it up. In my test case (passing through un-transcoded h264 1080p), it's now 280x faster than it was. I think it's also a lot easier to read.

Here's what I'm using to profile

python -m cProfile -o out.profile webcam.py
snakeviz -s -H 0.0.0.0 out.profile

johnboiles · 2021-09-09T15:10:26Z

I noticed a lot of time was spent calling len. Looks like buf doesn't ever change in this method, so I first made a change to just call len(buf) once. If I'm reading this right, that resulted in a 42% improvement (using cumulative percall metric) to _split_bitstream. Here's without the optimization:

Here's with the optimiazation:

Cumulative time per call shows an improvement of 42%! I think this is the most relevant metric since it includes calls to sub functions.

(0.007372-0.004275)/0.007372 = 0.42010309

But total time is also improved by 30%

(0.006159-0.004275)/0.006159 = 0.30589381

To me this indicates not that len is particularly expensive, but making any function call this many times is really expensive. Probably this is the kinda thing inline in c++ would fix.

johnboiles · 2021-09-13T01:57:40Z

Not satisfied with a 40% improvement I dove deeper and got to a 2+ order of magnitude speed increase 🚀 😄 Apparently bytes.find is very efficient.

After the first commit (roughtly the same as the second pic in the first comment): 0.004232s cumulative per call

Now with the second commit: 0.00002629s cumulative per call (and wayyyy down in the list of functions in the profile)

Based on the original measurement of 0.007372s cumulative per call, that's a 99.64338% improvement or 280x faster!

We can squash these commits together when we merge, but I think they're interesting on their own to see the improvement process.

X-Ryl669 · 2021-09-14T08:47:20Z

You're modifying the behavior of the code here. In H264, NAL unit can start by either 0x00 00 01 or 0x00 00 00 01. In your patch, you've dropped the second case. If I were you, I would check what is the byte before the 0x00 00 01 (once found) and if it's 0x00, then adjust your len and pointer accordingly. (See line 237 where nal_start should be adjusted, like you did for nal_end). As an (incorrect but likely true 99.999% of the time) optimization, you can estimate that if nal_end is using 4 bytes NAL start code, then nal_start is also using the same format, and avoid checking it, so the line that read:

            elif buf[i - 1] == 0:
                # 4-byte start code case, jump back one byte
                yield buf[nal_start:i - 1]

should read:

            elif i > 0 and buf[i - 1] == 0:
                # 4-byte start code case, jump back one byte
                yield buf[nal_start - 1:i - 1]

Nice improvement by the way.

EDIT: Nevermind, I misread the patch, it should be safe the way you've written it.

johnboiles · 2021-09-14T15:18:11Z

Haha yeah reading the old code it was really hard to understand the cases it was handling. The resulting code is simple but it's the result of a lot of time getting confused then clarifying :) Thanks for thinking about it!

I also went ahead and fixed up the test to actually check if the right packet is output (not just the right number of packets). Looks like it was using literals wrong anyways (b'\ff' != b'\xff'). If there's any lingering doubt about how this operates we should add a couple lines to the test case for it. You can run it with:

python -m unittest tests.test_h264.H264Test.test_split_bitstream

Nice improvement by the way.

Thanks!! I'm running this on a Raspberry Pi 3 1.2 and this change alone took my CPU usage from ~80% of a core to 20% when passing through raw 1080p h264 data. It was extremely satisfying.

jlaine · 2021-12-02T13:12:39Z

This is awesome, thanks for the very detailed description, more readable code and improved tests!

jlaine · 2021-12-02T14:46:40Z

I fixed the linter error, but the test suite fails. Could you check whether it's the code or the test that's wrong?

jlaine · 2021-12-08T11:13:18Z

@rprata @johnboiles any chance of fixing the failing test (or the code)?

johnboiles · 2021-12-08T15:36:12Z

Yeah I can just might be a couple days

…

On Wed, Dec 8, 2021 at 3:13 AM Jeremy Lainé ***@***.***> wrote: @rprata <https://github.com/rprata> @johnboiles <https://github.com/johnboiles> any chance of fixing the failing test (or the code)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#564 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABVN7EHKNVKVQAZ3QY7473UP44VVANCNFSM5DWJZDCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jlaine · 2021-12-31T14:12:59Z

@johnboiles any news on this, I'd love to merge it?

johnboiles · 2022-01-01T05:11:33Z

Weird, I must have made that last part of the test without actually running it. My bad. Should be good to go now.

tmatth · 2022-01-13T23:29:48Z

Bump

jlaine · 2022-01-24T10:11:50Z

Hm, the linter error is your fault, the test against appr.tc is not..

lgrahl · 2022-01-24T10:16:34Z

Is the apprtc test against the live server? Because Google decided to take it down.

jlaine · 2022-01-24T10:30:18Z

Is the apprtc test against the live server? Because Google decided to take it down.

Yeah it was. I've put together a PR which rips out anything related to AppRTC in #623

codecov · 2022-01-24T10:40:35Z

Codecov Report

Merging #564 (acfa341) into main (a4acc4c) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #564   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         5675      5623   -52     
=========================================
- Hits          5675      5623   -52

Impacted Files	Coverage Δ
src/aiortc/codecs/h264.py	`100.00% <100.00%> (ø)`
src/aiortc/contrib/signaling.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4acc4c...acfa341. Read the comment docs.

jlaine · 2022-01-24T10:50:40Z

OK well we're almost but not quite there: there is still a branch of code which has no corresponding unit test. Could you fix this please?

jlaine · 2022-01-24T12:30:11Z

I have added the missing unit test, and am merging the PR, thanks so much!

johnboiles mentioned this pull request Sep 9, 2021

Add --no-transcode option to webcam.py rprata/aiortc#1

Merged

This was referenced Sep 9, 2021

feat: creating media player that works without transcoding #559

Merged

WebRTC webcam support OctoPrint/OctoPrint#4225

Merged

rprata approved these changes Oct 14, 2021

View reviewed changes

jlaine force-pushed the john/h264-split-bitstream-improvement branch 2 times, most recently from 14a3803 to 6a9b4ef Compare December 2, 2021 14:41

jlaine force-pushed the john/h264-split-bitstream-improvement branch from fab1a95 to ffeabce Compare January 24, 2022 10:36

Performance optimization for _split_bitstream

acfa341

jlaine force-pushed the john/h264-split-bitstream-improvement branch from ffeabce to acfa341 Compare January 24, 2022 12:22

jlaine merged commit 14bf221 into aiortc:main Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimization for `_split_bitstream` #564

Performance optimization for `_split_bitstream` #564

johnboiles commented Sep 9, 2021 •

edited

johnboiles commented Sep 9, 2021 •

edited

johnboiles commented Sep 13, 2021 •

edited

X-Ryl669 commented Sep 14, 2021 •

edited

johnboiles commented Sep 14, 2021 •

edited

jlaine commented Dec 2, 2021

jlaine commented Dec 2, 2021

jlaine commented Dec 8, 2021

johnboiles commented Dec 8, 2021 via email

jlaine commented Dec 31, 2021

johnboiles commented Jan 1, 2022

tmatth commented Jan 13, 2022

jlaine commented Jan 24, 2022 •

edited

lgrahl commented Jan 24, 2022

jlaine commented Jan 24, 2022

codecov bot commented Jan 24, 2022 •

edited

jlaine commented Jan 24, 2022

jlaine commented Jan 24, 2022

Performance optimization for _split_bitstream #564

Performance optimization for _split_bitstream #564

Conversation

johnboiles commented Sep 9, 2021 • edited

johnboiles commented Sep 9, 2021 • edited

johnboiles commented Sep 13, 2021 • edited

X-Ryl669 commented Sep 14, 2021 • edited

johnboiles commented Sep 14, 2021 • edited

jlaine commented Dec 2, 2021

jlaine commented Dec 2, 2021

jlaine commented Dec 8, 2021

johnboiles commented Dec 8, 2021 via email

jlaine commented Dec 31, 2021

johnboiles commented Jan 1, 2022

tmatth commented Jan 13, 2022

jlaine commented Jan 24, 2022 • edited

lgrahl commented Jan 24, 2022

jlaine commented Jan 24, 2022

codecov bot commented Jan 24, 2022 • edited

Codecov Report

jlaine commented Jan 24, 2022

jlaine commented Jan 24, 2022

Performance optimization for `_split_bitstream` #564

Performance optimization for `_split_bitstream` #564

johnboiles commented Sep 9, 2021 •

edited

johnboiles commented Sep 9, 2021 •

edited

johnboiles commented Sep 13, 2021 •

edited

X-Ryl669 commented Sep 14, 2021 •

edited

johnboiles commented Sep 14, 2021 •

edited

jlaine commented Jan 24, 2022 •

edited

codecov bot commented Jan 24, 2022 •

edited