WIP: dav1d ITX AVX2 Assembly Port #117

frankplow · 2023-08-03T11:10:35Z

This PR will port dav1d's AVX2 inverse transform optimisations.

Jamaika1 · 2023-08-06T09:29:50Z

vvcdsp_init.c:264:33: error: assignment to expression with array type
  264 |     c->itx.itx[DCT2][0]         = ff_vvc_inv_dct2_2_##opt;                      \
      |                                 ^
vvcdsp_init.c:279:13: note: in expansion of macro 'IDCT2_INIT'
  279 |             IDCT2_INIT(avx2);
      |             ^~~~~~~~~~

Can add definition
#if HAVE_AVX2_EXTERNAL
...
#endif

frankplow · 2023-08-06T09:43:04Z

@Jamaika1

vvcdsp_init.c:264:33: error: assignment to expression with array type
  264 |     c->itx.itx[DCT2][0]         = ff_vvc_inv_dct2_2_##opt;                      \
      |                                 ^
vvcdsp_init.c:279:13: note: in expansion of macro 'IDCT2_INIT'
  279 |             IDCT2_INIT(avx2);
      |             ^~~~~~~~~~

Can add definition #if HAVE_AVX2_EXTERNAL ... #endif

I'm sorry I don't see where his code is from - there are currently no AVX2 optimisations implemented on this branch and it looks to me like the version of vvcdsp_init.c this is referencing is from my frankplow:idct-asm branch.

Make the ITX DSP context inverse transform functions perform 2D transforms rather than 1D transforms. This will allow for more efficient assembly optimisations in future.

frankplow · 2023-08-14T09:04:20Z

Rebase and re-target onto new main, to reflect #116.

Previously the pixels were stored as `int`s. The maximum bit depth supported by VVC is 16, hence an `int16_t` is guaranteed to be sufficient. Smaller sizes allow for more pixels to be packed into an SIMD register and therefore faster assembly optimisations. Further optimisation could be achieved by using `int8_t`s if the bit depth of the current video is only 8.

This is laying groundwork for assembly optimisations. It is easier to perform a column transform using SIMD instructions than it is to perform a column transform. Typically, a transposition is required. Therefore, by expecting the output of the inverse transforms in column-major order, a transposition can be eliminated. This is also effectively free, with only a minor performance hit due to cache-friendliness.

When bit depth <= 10, sps_range_extension_flag must be 0, therefore sps_extended_precision_flag is not present and assumed to be 0. The derived Log2TransformRange, CoeffMin and CoeffMax are therefore constants for the 10-bit transform.

frankplow · 2023-08-17T12:00:37Z

@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels?

After 36e4176 the transform coefficients and pixels are allocated separately in the VVCFrameContext:

FFmpeg/libavcodec/vvc/vvcdec.c

Lines 98 to 103 in 36e4176

    
           fc->tab.coeffs = av_malloc(ctu_count * sizeof(*fc->tab.coeffs) * ctu_size * VVC_MAX_SAMPLE_ARRAYS); 
        
           if (!fc->tab.coeffs) 
        
               return AVERROR(ENOMEM); 
        
           fc->tab.pixels = av_malloc(ctu_count * sizeof(*fc->tab.pixels) * ctu_size * VVC_MAX_SAMPLE_ARRAYS); 
        
           if (!fc->tab.pixels) 
        
               return AVERROR(ENOMEM);

However this is redundant — pixels is not used before the transform and coeffs is not used afterwards.

I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of itransform, however this is not possible as the CABAC must be able to put data into the coefficients array:

FFmpeg/libavcodec/vvc/vvc_cabac.c

Lines 2145 to 2152 in 2a7d709

    
           if (*abs_level) { 
        
               tb->coeffs[off] = *coeff_sign_level * *abs_level; 
        
               tb->max_scan_x = FFMAX(xc, tb->max_scan_x); 
        
               tb->max_scan_y = FFMAX(yc, tb->max_scan_y); 
        
               tb->min_scan_x = FFMIN(xc, tb->min_scan_x); 
        
               tb->min_scan_y = FFMIN(yc, tb->min_scan_y); 
        
           } else { 
        
               tb->coeffs[off] = 0;

Perhaps there is some way the same memory could be used for both pixels and coeffs, as before, but reinterpreted as int16_ts or ints at different points in the program? The size of a pixel is always <= the size of a transform coefficient in the current version of VVC, the extreme case being if the bit depth is 16 and extended_precision_flag is 0, then both are 16 bits. Would just setting tb->pixels = (int16_t *) tb->coeffs be acceptable given this is the case?

frankplow · 2023-08-21T12:19:20Z

@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers.

Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2].

Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance.

Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC.

I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in ea7a0ce and 36e4176, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC

[1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf)
[2] C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer, DCT Approximations Based on Chen's Factorization, DOI 10.1016/j.image.2017.06.014, (free preprint)
[3] Vivienne Sze, Madhukar Budagavi, Gary J. Sullivan, High Efficiency Video Coding (HEVC) Algorithms and Architectures, DOI 10.1007/978-3-319-06895-4

nuomi2021 · 2023-08-22T14:28:14Z

@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels?

After 36e4176 the transform coefficients and pixels are allocated separately in the VVCFrameContext:

FFmpeg/libavcodec/vvc/vvcdec.c

Lines 98 to 103 in 36e4176

fc->tab.coeffs = av_malloc(ctu_count * sizeof(*fc->tab.coeffs) * ctu_size * VVC_MAX_SAMPLE_ARRAYS);

if (!fc->tab.coeffs)

return AVERROR(ENOMEM);

fc->tab.pixels = av_malloc(ctu_count * sizeof(*fc->tab.pixels) * ctu_size * VVC_MAX_SAMPLE_ARRAYS);

if (!fc->tab.pixels)

return AVERROR(ENOMEM);

However this is redundant — pixels is not used before the transform and coeffs is not used afterwards.

Not possible. coeffs has a different layout than pixels. Suppose you have 4 64x64 blocks. each block has 1 coeffs.
We only use the first 4 coeffs. combine coeffs buffer with pixels will overite the second blocks coeffs

I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of itransform, however this is not possible as the CABAC must be able to put data into the coefficients array:

FFmpeg/libavcodec/vvc/vvc_cabac.c

Lines 2145 to 2152 in 2a7d709

if (*abs_level) {

tb->coeffs[off] = *coeff_sign_level * *abs_level;

tb->max_scan_x = FFMAX(xc, tb->max_scan_x);

tb->max_scan_y = FFMAX(yc, tb->max_scan_y);

tb->min_scan_x = FFMIN(xc, tb->min_scan_x);

tb->min_scan_y = FFMIN(yc, tb->min_scan_y);

} else {

tb->coeffs[off] = 0;

Perhaps there is some way the same memory could be used for both pixels and coeffs, as before, but reinterpreted as int16_ts or ints at different points in the program? The size of a pixel is always <= the size of a transform coefficient in the current version of VVC, the extreme case being if the bit depth is 16 and extended_precision_flag is 0, then both are 16 bits. Would just setting tb->pixels = (int16_t *) tb->coeffs be acceptable given this is the case?

Could you explain why you want to do this?
thank you

nuomi2021 · 2023-08-22T14:34:30Z

@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers.

Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2].

Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance.

Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC.

I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in ea7a0ce and 36e4176, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC
Thank you for your hardwork.

No worries. Please continue working on #114
In the meaning time, could you check vvdec c and asm code?

[1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf) [2] C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer, DCT Approximations Based on Chen's Factorization, DOI 10.1016/j.image.2017.06.014, (free preprint) [3] Vivienne Sze, Madhukar Budagavi, Gary J. Sullivan, High Efficiency Video Coding (HEVC) Algorithms and Architectures, DOI 10.1007/978-3-319-06895-4

frankplow · 2023-08-27T13:43:31Z

@nuomi2021

No worries. Please continue working on #114 In the meaning time, could you check vvdec c and asm code?

vvdec has a very straightforward implementation in assembly. It does not use any fast algorithm like Hung, instead the transform is a simple matrix multiplication. I have read that in some instances this may be faster as it is more readily parallelisable, but I am not convinced — using perf.py I measure only a 1.1% performance improvement when toggling vvdec's transform optimisations, which is less than the difference I got with my optimisations using the Hung algorithm in #114. Additionally, it would be quite a lot of work to port these, as they are written using intrinsics and C++ templates.

I've also looked into adapting the assembly optimisations already in FFmpeg for HEVC. These are written directly in NASM assembly using x86inc.asm, so would require less work to port on that front. Additionally, they use the Hung algorithm, which theoretically should speed them up. The big obstacle in adapting them is the fact that they use 16-bit integers for both pixels and transform coefficients. This is not suitable for VVC as when sps_extended_precision_flag is set, the transform coefficients are bit_depth + 6 and so can be up to 22-bit. I can see two ways around this:

The easiest fix is to convert the 32-bit transform coefficients to 16-bit within the functions when possible, and disable the optimisations otherwise. This is complicated by the fact that these optimisations do a lot of writing to and from memory, and so there is no one easy place where movas to load from memory can be replaced with packssdws for example. I have already implemented the naive approach of doing O(n²) packssdws at the beginning of the function, ~~however this degraded the performance to below the C implementation.~~ (see correction below)
Alternatively, the function prototype itself could be changed to use 16-bit transform coefficients, either when the bit depth <= 10 (this simplies sps_extended_precision_flag = 0) or based on sps_extended_precision_flag itself. This wouldn't require major changes to the assembly implementation, however there is currently no mechanism to have function prototypes change based on bit depth. I think this is the more worthwhile solution, as the ability to use smaller types when possible enables faster optimisations, but how much work do you think this would be?

In either case, the 2x2, 64x64, 1D and rectangular transforms would also need to be implemented as these are not present in HEVC.

Correction: My checkasm test was flawed and this is not the case. My work done porting these optimisations, as well as the corrected checkasm benchmark, can be found on #130.

frankplow changed the title ~~dav1d ITX AVX2 Assembly Port~~ WIP: dav1d ITX AVX2 Assembly Port Aug 3, 2023

frankplow added 4 commits August 14, 2023 09:50

lavc/vvc: Make ITX modular at 2D level

ea7a0ce

Make the ITX DSP context inverse transform functions perform 2D transforms rather than 1D transforms. This will allow for more efficient assembly optimisations in future.

lavc/vvc: Fix ITX non-zero width usage

76a20a7

libavutil/x86inc.asm: Add REPX from x264

2d8c399

tests/checkasm: Add VVC ITX test

b1ef0d3

frankplow force-pushed the dav1d-idct branch from ab168bb to b1ef0d3 Compare August 14, 2023 08:54

frankplow changed the base branch from 20230811 to main August 14, 2023 09:02

frankplow added 6 commits August 15, 2023 10:41

tests/checkasm/vvc_itx: Use int16_t dst arrays

719fe2a

tests/checkasm/vvc_itx: Fix dst stride

d1fa8b0

lavc/vvc: Port 4x4 DCT2/DCT2 10-bit from dav1d

df05755

lavc/vvc: Remove redundant warnings

2a7d709

When bit depth <= 10, sps_range_extension_flag must be 0, therefore sps_extended_precision_flag is not present and assumed to be 0. The derived Log2TransformRange, CoeffMin and CoeffMax are therefore constants for the 10-bit transform.

frankplow added 3 commits August 18, 2023 09:31

lavc/x86/vvc_itx: Fix YASM compilation

6a1dfca

lavc/x86/vvc_itx: Import remaining dav1d ASM

56dfa0f

lavc/x86/vvc_itx: Change matrix coefficients to match VVC

e5dc619

frankplow closed this Aug 21, 2023

frankplow mentioned this pull request Aug 30, 2023

FFmpeg HEVC IDCT port #130

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: dav1d ITX AVX2 Assembly Port #117

WIP: dav1d ITX AVX2 Assembly Port #117

Uh oh!

frankplow commented Aug 3, 2023

Uh oh!

Jamaika1 commented Aug 6, 2023

Uh oh!

frankplow commented Aug 6, 2023

Uh oh!

frankplow commented Aug 14, 2023

Uh oh!

frankplow commented Aug 17, 2023

Uh oh!

frankplow commented Aug 21, 2023 •

edited

Loading

Uh oh!

nuomi2021 commented Aug 22, 2023 •

edited

Loading

Uh oh!

nuomi2021 commented Aug 22, 2023

Uh oh!

frankplow commented Aug 27, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: dav1d ITX AVX2 Assembly Port #117

WIP: dav1d ITX AVX2 Assembly Port #117

Uh oh!

Conversation

frankplow commented Aug 3, 2023

Uh oh!

Jamaika1 commented Aug 6, 2023

Uh oh!

frankplow commented Aug 6, 2023

Uh oh!

frankplow commented Aug 14, 2023

Uh oh!

frankplow commented Aug 17, 2023

Uh oh!

frankplow commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nuomi2021 commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nuomi2021 commented Aug 22, 2023

Uh oh!

frankplow commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frankplow commented Aug 21, 2023 •

edited

Loading

nuomi2021 commented Aug 22, 2023 •

edited

Loading

frankplow commented Aug 27, 2023 •

edited

Loading