-
Notifications
You must be signed in to change notification settings - Fork 12
WIP: dav1d ITX AVX2 Assembly Port #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can add definition |
I'm sorry I don't see where his code is from - there are currently no AVX2 optimisations implemented on this branch and it looks to me like the version of vvcdsp_init.c this is referencing is from my frankplow:idct-asm branch. |
Make the ITX DSP context inverse transform functions perform 2D transforms rather than 1D transforms. This will allow for more efficient assembly optimisations in future.
|
Rebase and re-target onto new main, to reflect #116. |
Previously the pixels were stored as `int`s. The maximum bit depth supported by VVC is 16, hence an `int16_t` is guaranteed to be sufficient. Smaller sizes allow for more pixels to be packed into an SIMD register and therefore faster assembly optimisations. Further optimisation could be achieved by using `int8_t`s if the bit depth of the current video is only 8.
This is laying groundwork for assembly optimisations. It is easier to perform a column transform using SIMD instructions than it is to perform a column transform. Typically, a transposition is required. Therefore, by expecting the output of the inverse transforms in column-major order, a transposition can be eliminated. This is also effectively free, with only a minor performance hit due to cache-friendliness.
When bit depth <= 10, sps_range_extension_flag must be 0, therefore sps_extended_precision_flag is not present and assumed to be 0. The derived Log2TransformRange, CoeffMin and CoeffMax are therefore constants for the 10-bit transform.
|
@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels? After 36e4176 the transform coefficients and pixels are allocated separately in the FFmpeg/libavcodec/vvc/vvcdec.c Lines 98 to 103 in 36e4176
However this is redundant — pixels is not used before the transform and coeffs is not used afterwards.
I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of FFmpeg/libavcodec/vvc/vvc_cabac.c Lines 2145 to 2152 in 2a7d709
Perhaps there is some way the same memory could be used for both |
|
@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers. Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2]. Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance. Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC. I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in ea7a0ce and 36e4176, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC [1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf) |
Not possible. coeffs has a different layout than pixels. Suppose you have 4 64x64 blocks. each block has 1 coeffs.
Could you explain why you want to do this? |
No worries. Please continue working on #114
|
vvdec has a very straightforward implementation in assembly. It does not use any fast algorithm like Hung, instead the transform is a simple matrix multiplication. I have read that in some instances this may be faster as it is more readily parallelisable, but I am not convinced — using I've also looked into adapting the assembly optimisations already in FFmpeg for HEVC. These are written directly in NASM assembly using x86inc.asm, so would require less work to port on that front. Additionally, they use the Hung algorithm, which theoretically should speed them up. The big obstacle in adapting them is the fact that they use 16-bit integers for both pixels and transform coefficients. This is not suitable for VVC as when
In either case, the 2x2, 64x64, 1D and rectangular transforms would also need to be implemented as these are not present in HEVC. Correction: My checkasm test was flawed and this is not the case. My work done porting these optimisations, as well as the corrected checkasm benchmark, can be found on #130. |
This PR will port dav1d's AVX2 inverse transform optimisations.