-
Notifications
You must be signed in to change notification settings - Fork 101
VP8 performance tweaks
The biggest single performance hog in VP8 decoding is the in-loop deblocking filter -- it takes at least 1/3 of CPU time in Safari at 480p with existing transcodes, and is hard to optimize without SIMD instructions:
- works with signed chars, which is SIMD-friendly...
- requires clamping at multiple stages
- requires multiplication and shifting in a couple places
- does addition that needs clamping
Due to the clamping, multiplication etc needs, can't just do 4 bytes next to each other in a 32-bit word.
It might be friendly to GPU usage, but that's a large undertaking.
The recommendation in the spec for "low-power devices" is to encode with the "simple" version of the loop filter forced, which can be done by passing -profile 1
to vpxenc, or -profile:v 1
to ffmpeg. This requires re-encoding files, and has a quality trade-off, but makes a HUGE difference in decode speed.
Safari:
- early 2015 MacBook Pro 13: 720p (comfy) / 1080p (barely)
- iPad Pro 9.7: 720p (comfy) / 1080p (barely)
- iPad Air: 480p (comfy)
- iPhone 5c: 240p (moderate)
Edge:
- early 2015 MacBook Pro 13: 720p (comfy) / 1080p (barely)
- Atom laptop: 480p
IE:
- early 2015 MacBook Pro 13: 480p
- Atom laptop: 240p
This performance sits between the current VP8 performance (with "normal" loop filter used) and the current Theora performance (with Theora's decreased complexity), and in my opinion closes the gap enough that I'd be willing to use VP8 alone, with no Theora version, for adaptive streaming in the future. (Theora requires reapplying header packets when switching resolutions, and Ogg has weird page/packet properties that could make switching streams difficult. VP8 should work with off the shelf dash players, with only slight modifications.)
The simple deblocking filter is not as effective and can leave slightly more visible blocking in high-motion scenes and clear blue skies, etc. However it still usually looks better than Theora in my informal testing. Could increase bitrate moderately to compensate, but I'm not convinced it's necessary.
VP8 allows data to be 'sliced' or 'partitioned' so some parallelization for macro block decoding can be extracted. Like VP9's tiling, this only provides benefit to libvpx's decoder if the encoder used the appropriate option.
Not sure if there's a minimum size for partitions. Not 100% sure if the option is a straight number or a log2 like VP9's tiles. :P Trying passing '2' for low-res, '4' for high-res etc.
With ffmpeg, the option is -slices=<n>
.
It's probably possible to devise a non-shared-threading way of doing the multithreaded decoding, but likely very difficult. On the other hand, supporting libvpx's existing pthreads-based code with emscripten was mostly a matter of fixing emscripten to harmonize the USE_PTHREAD and MODULARIZE options.
However, this requires SharedArrayBuffer and Atomics support in the browser, which has not yet made it to any release version. Safari, Firefox, and Edge have it in their dev versions (Edge behind 'experimental JS features' flag). Chrome is adding soon.
Currently I'm building separate instances of the decoder modules for single-threaded and multi-threaded versions, since the runtime is larger and it has slightly different interfaces, though it would be possible to make a single version that works on both I think.
Code on 'mt' branch currently works in Safari and Firefox. Performs well in Safari, indeterminate in Firefox. Doesn't work in Edge, looks like a browser problem.
Note IE 11 will never gain the required interfaces, so can't rely on any boosts for IE here.
VP8 multithreaded decoding provides a noticeable boost to decode speed for 2-core and 2-core/4-thread scenarios in Safari Technical Preview on my Mac range:
- MacBook Pro 13" Mid-2010 - Core 2 Duo 2.4 GHz (2-core) - up to 720p24 plays well
- MacBook Pro 15" Mid-2010 - Core i7 first-gen 2.67 GHz (2-core/4-thread) - up to 1080p24
- MacBook Pro 13" Early-2015 - Core i7 fifth-gen (?) 3.1 GHz (2-core/4-thread) - up to 1080p24
In combination with profile 1 simplified CPU options, sliced encoding & multithreaded decoding provide somewhere in the 20-30% range boost to decode speed (approximate eyeing the graphs), which can take a machine over the line for a resolution boost or reduce the number of dropped frames.
Decode times for VP8 profile 1 + MT decoding are similar to those of single-core Theora decoding, making VP8 even more of an option for ogv.js usage.
Have not tested more than 2 full cores; full quad-core could benefit even more at high resolutions.
Note that VP9 seems to do better with its tile columns, but still doesn't reach full linear scaling per core.