Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimization for _split_bitstream #564

Merged

Conversation

johnboiles
Copy link
Contributor

@johnboiles johnboiles commented Sep 9, 2021

In profiling performance of #559, I noticed that a ton of time is spent in _split_bitstream. So I set out to speed it up. In my test case (passing through un-transcoded h264 1080p), it's now 280x faster than it was. I think it's also a lot easier to read.

Here's what I'm using to profile

python -m cProfile -o out.profile webcam.py
snakeviz -s -H 0.0.0.0 out.profile

@johnboiles
Copy link
Contributor Author

johnboiles commented Sep 9, 2021

I noticed a lot of time was spent calling len. Looks like buf doesn't ever change in this method, so I first made a change to just call len(buf) once. If I'm reading this right, that resulted in a 42% improvement (using cumulative percall metric) to _split_bitstream. Here's without the optimization:

image

Here's with the optimiazation:

image

Cumulative time per call shows an improvement of 42%! I think this is the most relevant metric since it includes calls to sub functions.

(0.007372-0.004275)/0.007372 = 0.42010309

But total time is also improved by 30%

(0.006159-0.004275)/0.006159 = 0.30589381

To me this indicates not that len is particularly expensive, but making any function call this many times is really expensive. Probably this is the kinda thing inline in c++ would fix.

@johnboiles
Copy link
Contributor Author

johnboiles commented Sep 13, 2021

Not satisfied with a 40% improvement I dove deeper and got to a 2+ order of magnitude speed increase 🚀 😄 Apparently bytes.find is very efficient.

After the first commit (roughtly the same as the second pic in the first comment): 0.004232s cumulative per call

image

Now with the second commit: 0.00002629s cumulative per call (and wayyyy down in the list of functions in the profile)

image

Based on the original measurement of 0.007372s cumulative per call, that's a 99.64338% improvement or 280x faster!

We can squash these commits together when we merge, but I think they're interesting on their own to see the improvement process.

@X-Ryl669
Copy link

X-Ryl669 commented Sep 14, 2021

You're modifying the behavior of the code here. In H264, NAL unit can start by either 0x00 00 01 or 0x00 00 00 01. In your patch, you've dropped the second case. If I were you, I would check what is the byte before the 0x00 00 01 (once found) and if it's 0x00, then adjust your len and pointer accordingly. (See line 237 where nal_start should be adjusted, like you did for nal_end). As an (incorrect but likely true 99.999% of the time) optimization, you can estimate that if nal_end is using 4 bytes NAL start code, then nal_start is also using the same format, and avoid checking it, so the line that read:

            elif buf[i - 1] == 0:
                # 4-byte start code case, jump back one byte
                yield buf[nal_start:i - 1]

should read:

            elif i > 0 and buf[i - 1] == 0:
                # 4-byte start code case, jump back one byte
                yield buf[nal_start - 1:i - 1]

Nice improvement by the way.

EDIT: Nevermind, I misread the patch, it should be safe the way you've written it.

@johnboiles
Copy link
Contributor Author

johnboiles commented Sep 14, 2021

Haha yeah reading the old code it was really hard to understand the cases it was handling. The resulting code is simple but it's the result of a lot of time getting confused then clarifying :) Thanks for thinking about it!

I also went ahead and fixed up the test to actually check if the right packet is output (not just the right number of packets). Looks like it was using literals wrong anyways (b'\ff' != b'\xff'). If there's any lingering doubt about how this operates we should add a couple lines to the test case for it. You can run it with:

python -m unittest tests.test_h264.H264Test.test_split_bitstream

Nice improvement by the way.

Thanks!! I'm running this on a Raspberry Pi 3 1.2 and this change alone took my CPU usage from ~80% of a core to 20% when passing through raw 1080p h264 data. It was extremely satisfying.

@jlaine
Copy link
Collaborator

jlaine commented Dec 2, 2021

This is awesome, thanks for the very detailed description, more readable code and improved tests!

@jlaine jlaine force-pushed the john/h264-split-bitstream-improvement branch 2 times, most recently from 14a3803 to 6a9b4ef Compare December 2, 2021 14:41
@jlaine
Copy link
Collaborator

jlaine commented Dec 2, 2021

I fixed the linter error, but the test suite fails. Could you check whether it's the code or the test that's wrong?

@jlaine
Copy link
Collaborator

jlaine commented Dec 8, 2021

@rprata @johnboiles any chance of fixing the failing test (or the code)?

@johnboiles
Copy link
Contributor Author

johnboiles commented Dec 8, 2021 via email

@jlaine
Copy link
Collaborator

jlaine commented Dec 31, 2021

@johnboiles any news on this, I'd love to merge it?

@johnboiles
Copy link
Contributor Author

Weird, I must have made that last part of the test without actually running it. My bad. Should be good to go now.

@tmatth
Copy link

tmatth commented Jan 13, 2022

Bump

@jlaine
Copy link
Collaborator

jlaine commented Jan 24, 2022

Hm, the linter error is your fault, the test against appr.tc is not..

@lgrahl
Copy link

lgrahl commented Jan 24, 2022

Is the apprtc test against the live server? Because Google decided to take it down.

@jlaine
Copy link
Collaborator

jlaine commented Jan 24, 2022

Is the apprtc test against the live server? Because Google decided to take it down.

Yeah it was. I've put together a PR which rips out anything related to AppRTC in #623

@jlaine jlaine force-pushed the john/h264-split-bitstream-improvement branch from fab1a95 to ffeabce Compare January 24, 2022 10:36
@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

Merging #564 (acfa341) into main (a4acc4c) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #564   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         5675      5623   -52     
=========================================
- Hits          5675      5623   -52     
Impacted Files Coverage Δ
src/aiortc/codecs/h264.py 100.00% <100.00%> (ø)
src/aiortc/contrib/signaling.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4acc4c...acfa341. Read the comment docs.

@jlaine
Copy link
Collaborator

jlaine commented Jan 24, 2022

OK well we're almost but not quite there: there is still a branch of code which has no corresponding unit test. Could you fix this please?

@jlaine jlaine force-pushed the john/h264-split-bitstream-improvement branch from ffeabce to acfa341 Compare January 24, 2022 12:22
@jlaine
Copy link
Collaborator

jlaine commented Jan 24, 2022

I have added the missing unit test, and am merging the PR, thanks so much!

@jlaine jlaine merged commit 14bf221 into aiortc:main Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants