vvc_sao: Move accelerated HEVC SAO assembly code to VVC #53

zackerthescar · 2023-03-27T20:52:05Z

Successfully moved accelerated HEVC SAO assembly code to VVC. Both assembly files are essentially untouched from the HEVC SAO code, other than changing function names. This hasn't been tested yet, but compiles and if issue #45 is correct, should run just fine.

Function pointers are initialized to the exact index as they are in x86/hevcdsp_init.c. Hopefully hevcdsp.c fills in the other function pointers or this will segfault.

nuomi2021 · 2023-03-28T15:21:41Z

Hi @zackerthescar ,
Thank you for the patch.
Could you attach the 3 times average fps before and after the patch? Just like #42
It's wonderful if you can also provide checksam results for sao f94f48d

thank you

zackerthescar · 2023-03-29T20:52:28Z

Improvements of around 6%-12%, although it seems like there are some issues

Command is ./ffmpeg -y -i [clip name] -c:v mpeg4 -q:v 2 -f mp4 [output]
Results obtained on Intel Core i7-8700
Results are an average of 3 runs

Clip	C	AVX2	delta
BasketballDrive_1920x1080_50_10_420_37_RA.266	62.67	67	6.09%
RitualDance_1920x1080_60_10_420_32_LD.266	67.67	76	12.3%
RitualDance_1920x1080_60_10_420_37_RA.266	70	74.33	6.19%
Tango2_3840x2160_60_10_420_27_LD.266	17.33	19	9.63%

See attached images below, shows some issues. These are not present when using the C SAO filter.

~~I suspect the issue might be because I only populate [0-4] of the function pointer array. I'm not exactly sure what to populate pointers [5-8] with. Any hints or suggestions is welcome.~~ See below
I will work on checkasm for sao now. Hopefully that reveals some issues.

zackerthescar · 2023-03-29T20:53:20Z

Closed by accident, sorry.

zackerthescar · 2023-03-30T04:35:53Z

Added vvc_sao checks for checkasm. Revealed some issues, to say the least.

In the avx2 case, SAO band filter passes completely, but none of the edge filters pass.

./checkasm
...
AVX2:
...
 - vvc_sao.sao_band                [OK]
   vvc_sao_edge_32_8_avx2 (vvc_sao.c:136)
   vvc_sao_edge_48_8_avx2 (vvc_sao.c:136)
   vvc_sao_edge_64_8_avx2 (vvc_sao.c:136)
   vvc_sao_edge_8_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_16_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_32_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_48_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_64_10_avx2 (vvc_sao.c:136)
 - vvc_sao.sao_edge                [FAILED]
 ...

I'm pretty sure that's why I'm getting the artifacts on the output file.
EDIT: Just confirmed it by removing the AVX2 accelerated edge filter from init. Output is clean with AVX2 band filter and C edge filter.

nuomi2021 · 2023-03-30T13:03:44Z

Improvements of around 6%-12%, although it seems like there are some issues

Command is ./ffmpeg -y -i [clip name] -c:v mpeg4 -q:v 2 -f mp4 [output] Results obtained on Intel Core i7-8700 Results are an average of 3 runs

this will do the mprg4 encode. Please use
./ffmpeg_g -i Tango2_3840x2160_60_10_420_27_LD.266 -vsync 0 -f rawvideo /dev/null -y
to test

nuomi2021 · 2023-03-30T13:05:13Z

I'm pretty sure that's why I'm getting the artifacts on the output file. EDIT: Just confirmed it by removing the AVX2 accelerated edge filter from init. Output is clean with AVX2 band filter and C edge filter.

could you help debug why it's mismatched?

Jamaika1 · 2023-03-30T14:05:09Z

vvcdsp_init.c:166:32: warning: assignment to 'void (*)(uint8_t *, uint8_t *, ptrdiff_t,  int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, unsigned char *, long long int,  short int *, int,  int,  int)'} from incompatible pointer type 'void (*)(uint8_t *, const uint8_t *, ptrdiff_t,  const int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, const unsigned char *, long long int,  const short int *, int,  int,  int)'} [-Wincompatible-pointer-types]
  166 |     c->sao.edge_filter[1]      = ff_vvc_sao_edge_filter_16_##bitd##_##opt;  \
      |                                ^
vvcdsp_init.c:208:17: note: in expansion of macro 'SAO_EDGE_INIT'
  208 |                 SAO_EDGE_INIT(10, avx2);
      |                 ^~~~~~~~~~~~~

zackerthescar · 2023-03-30T18:45:27Z

I'm pretty sure that's why I'm getting the artifacts on the output file. EDIT: Just confirmed it by removing the AVX2 accelerated edge filter from init. Output is clean with AVX2 band filter and C edge filter.

could you help debug why it's mismatched?

Yep, working on this right now. Hoping to debug this soon. I will post the results w/ no encoder running soon.

zackerthescar · 2023-03-30T18:47:28Z

vvcdsp_init.c:166:32: warning: assignment to 'void (*)(uint8_t *, uint8_t *, ptrdiff_t,  int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, unsigned char *, long long int,  short int *, int,  int,  int)'} from incompatible pointer type 'void (*)(uint8_t *, const uint8_t *, ptrdiff_t,  const int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, const unsigned char *, long long int,  const short int *, int,  int,  int)'} [-Wincompatible-pointer-types]
  166 |     c->sao.edge_filter[1]      = ff_vvc_sao_edge_filter_16_##bitd##_##opt;  \
      |                                ^
vvcdsp_init.c:208:17: note: in expansion of macro 'SAO_EDGE_INIT'
  208 |                 SAO_EDGE_INIT(10, avx2);
      |                 ^~~~~~~~~~~~~

This is due to the fact that HEVC's SAO filter declares the source pointers const. I already have a build that doesn't have these pointers const'ed, and I will push it as soon as possible.

nuomi2021 · 2023-04-01T03:30:14Z

@zackerthescar , we are close to the GSOC deadline.
You can focus on small functions, like band_filter 4x4, and make it pass the checkasm test. then we can check the code in
You can write down the improvement plan in your proposal, and finish SAO and Deblock for the GSOC project.

thank you.

zackerthescar · 2023-04-02T03:25:21Z

I have added commit 4104b68 which removes the code to initialize the AVX2-accelerated edge functions. This allows checkasm to pass. I will post benchmarks in the next couple of hours, then work on the smaller band_filter 4x4 function and making sure it passes checkasm as you suggested.

I have written up a GSoC proposal and will submit it soon, within the next few hours.

Update:
Benchmark with only the SAO Band filter initialized to AVX2.
Command is ./ffmpeg -y -i [clip name] -vsync 0 -f rawvideo /dev/null
Results obtained on Intel Core i7-8700
Results are an average of 3 runs

Clip	C	AVX2	delta
BasketballDrive_1920x1080_50_10_420_37_RA.266	75	75	0%
RitualDance_1920x1080_60_10_420_32_LD.266	87.33	87.67	1.00%
RitualDance_1920x1080_60_10_420_37_RA.266	81.33	82.33	1.01%
Tango2_3840x2160_60_10_420_27_LD.266	21	21	0%

Thanks.

libavcodec/x86/vvcdsp_init.c

nuomi2021 · 2023-04-02T06:38:29Z

.vscode/settings.json

@@ -0,0 +1,5 @@
+{


we do not need this

This was something automatically added by vscode. Removed in my rebase.

nuomi2021 · 2023-04-02T06:40:23Z

libavcodec/x86/vvcdsp_init.c

@@ -120,19 +120,17 @@ static void alf_filter_chroma_8_avx2(uint8_t *dst, ptrdiff_t dst_stride, const u
        c->alf.filter[CHROMA] = alf_filter_chroma_##depth##_avx2;               \
    } while (0)

-#define SAO_BAND_FILTER_FUNCS(bitd, opt)                                                                                      \


We do not need to change this. how about we rebase and combine this with other commit

zackerthescar · 2023-04-04T05:09:04Z

Sorry about the delay, I've been busy with some schoolwork.

I've rebased and force-pushed only the needed changes based on your comments. checkasm should now pass, and

How about we make _src and sao_offset_val as const.

has been done.

Thanks.

nuomi2021 · 2023-04-06T12:52:54Z

Hi @zackerthescar ,
thank you for the update.
On my i7 4770k, vvc_sao_band_64_10_avx2 3 times slower than vvc_sao_band_64_10_c, are you happen to know why?
thank you

AVX2:

vvc_sao.sao_band [OK]
checkasm: all 10 tests passed
vvc_sao_band_8_8_c: 825.2
vvc_sao_band_8_8_avx2: 65.7
vvc_sao_band_8_10_c: 878.2
vvc_sao_band_8_10_avx2: 56.0
vvc_sao_band_16_8_c: 3047.2
vvc_sao_band_16_8_avx2: 182.2
vvc_sao_band_16_10_c: 3491.2
vvc_sao_band_16_10_avx2: 98.5
vvc_sao_band_32_8_c: 23144.7
vvc_sao_band_32_8_avx2: 345.0
vvc_sao_band_32_10_c: 13307.5
vvc_sao_band_32_10_avx2: 350.2
vvc_sao_band_48_8_c: 26881.7
vvc_sao_band_48_8_avx2: 1040.5
vvc_sao_band_48_10_c: 29488.7
vvc_sao_band_48_10_avx2: 765.7
vvc_sao_band_64_8_c: 47141.5
vvc_sao_band_64_8_avx2: 1302.0
vvc_sao_band_64_10_c: 53341.0
vvc_sao_band_64_10_avx2: 154115.2

zackerthescar · 2023-04-09T04:08:31Z

Hi @zackerthescar , thank you for the update. On my i7 4770k, vvc_sao_band_64_10_avx2 3 times slower than vvc_sao_band_64_10_c, are you happen to know why? thank you

AVX2:

* vvc_sao.sao_band [OK]
  checkasm: all 10 tests passed
  vvc_sao_band_8_8_c: 825.2
  vvc_sao_band_8_8_avx2: 65.7
  vvc_sao_band_8_10_c: 878.2
  vvc_sao_band_8_10_avx2: 56.0
  vvc_sao_band_16_8_c: 3047.2
  vvc_sao_band_16_8_avx2: 182.2
  vvc_sao_band_16_10_c: 3491.2
  vvc_sao_band_16_10_avx2: 98.5
  vvc_sao_band_32_8_c: 23144.7
  vvc_sao_band_32_8_avx2: 345.0
  vvc_sao_band_32_10_c: 13307.5
  vvc_sao_band_32_10_avx2: 350.2
  vvc_sao_band_48_8_c: 26881.7
  vvc_sao_band_48_8_avx2: 1040.5
  vvc_sao_band_48_10_c: 29488.7
  vvc_sao_band_48_10_avx2: 765.7
  vvc_sao_band_64_8_c: 47141.5
  vvc_sao_band_64_8_avx2: 1302.0
  vvc_sao_band_64_10_c: 53341.0
  **vvc_sao_band_64_10_avx2: 154115.2**

Hi, sorry for the late reply. I was able to confirm that vvc_sao_band_64_10_avx2 is indeed slower on my machine as well. I'm quite confused at this point, since the code is identical to the HEVC version and it doesn't exhibit this behaviour there.

I'm guessing this is due to the fact that we have 9 pointers in VVC (vvcdsp.h) versus 5 in HEVC (hevcdsp.h)

Perhaps we need more widths beyond 8, 16, 32, 48, 64 to fill all 9 func pointers?

nuomi2021 · 2023-04-09T05:49:43Z

Hi @zackerthescar ,
Thank you for the reply.
hevc support 64x64 block too. did you check hevc's 10 bits 64x64 block status?

zackerthescar · 2023-04-11T05:49:57Z

Hi @zackerthescar , Thank you for the reply. hevc support 64x64 block too. did you check hevc's 10 bits 64x64 block status?

Sorry for the late reply. As far as I know, the same code for 64x64 block in HEVC is faster in AVX2 than C.

$ tests/checkasm/checkasm --test=hevc_sao --benchmark

..
hevc_sao_band_64_10_c: 11517.3
hevc_sao_band_64_10_sse2: 2337.3

$ tests/checkasm/checkasm --test=vvc_sao --benchmark

...
vvc_sao_band_64_10_c: 9396.0
vvc_sao_band_64_10_avx2: 25947.0

This is why I believe that not populating the rest of the pointers is causing the issue. What are all of VVC's supported block sizes? I can't seem to find that in the ITU document...

nuomi2021 · 2023-04-11T15:00:40Z

why hevc uses sse2 and vvc uses avx2?
vvc's max CTU size is 128x128

zackerthescar · 2023-04-11T18:12:27Z

Seems to be a copy-paste error on my part (copy-pasted the wrong line). Will update the results when I get back home.

I'll try to make functions for all corresponding CTU sizes until 128x128, see if that helps.

zackerthescar mentioned this pull request Mar 27, 2023

Add AVX2 assembly code for SAO filter #45

Closed

zackerthescar closed this Mar 29, 2023

zackerthescar reopened this Mar 29, 2023

nuomi2021 reviewed Apr 2, 2023

View reviewed changes

libavcodec/x86/vvcdsp_init.c Outdated Show resolved Hide resolved

zackerthescar marked this pull request as ready for review April 2, 2023 05:23

nuomi2021 reviewed Apr 2, 2023

View reviewed changes

zackerthescar force-pushed the main branch from 3f52d7d to bfded13 Compare April 4, 2023 05:05

zackerthescar closed this May 22, 2023

zackerthescar force-pushed the main branch from f69ca7d to dc96762 Compare May 22, 2023 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vvc_sao: Move accelerated HEVC SAO assembly code to VVC #53

vvc_sao: Move accelerated HEVC SAO assembly code to VVC #53

zackerthescar commented Mar 27, 2023

nuomi2021 commented Mar 28, 2023

zackerthescar commented Mar 29, 2023 •

edited

Loading

zackerthescar commented Mar 29, 2023

zackerthescar commented Mar 30, 2023 •

edited

Loading

nuomi2021 commented Mar 30, 2023

nuomi2021 commented Mar 30, 2023

Jamaika1 commented Mar 30, 2023

zackerthescar commented Mar 30, 2023 •

edited

Loading

zackerthescar commented Mar 30, 2023

nuomi2021 commented Apr 1, 2023

zackerthescar commented Apr 2, 2023 •

edited

Loading

nuomi2021 Apr 2, 2023

zackerthescar Apr 4, 2023

nuomi2021 Apr 2, 2023

zackerthescar commented Apr 4, 2023 •

edited

Loading

nuomi2021 commented Apr 6, 2023

zackerthescar commented Apr 9, 2023

nuomi2021 commented Apr 9, 2023

zackerthescar commented Apr 11, 2023 •

edited

Loading

nuomi2021 commented Apr 11, 2023

zackerthescar commented Apr 11, 2023 •

edited

Loading

vvc_sao: Move accelerated HEVC SAO assembly code to VVC #53

vvc_sao: Move accelerated HEVC SAO assembly code to VVC #53

Conversation

zackerthescar commented Mar 27, 2023

nuomi2021 commented Mar 28, 2023

zackerthescar commented Mar 29, 2023 • edited Loading

zackerthescar commented Mar 29, 2023

zackerthescar commented Mar 30, 2023 • edited Loading

nuomi2021 commented Mar 30, 2023

nuomi2021 commented Mar 30, 2023

Jamaika1 commented Mar 30, 2023

zackerthescar commented Mar 30, 2023 • edited Loading

zackerthescar commented Mar 30, 2023

nuomi2021 commented Apr 1, 2023

zackerthescar commented Apr 2, 2023 • edited Loading

nuomi2021 Apr 2, 2023

Choose a reason for hiding this comment

zackerthescar Apr 4, 2023

Choose a reason for hiding this comment

nuomi2021 Apr 2, 2023

Choose a reason for hiding this comment

zackerthescar commented Apr 4, 2023 • edited Loading

nuomi2021 commented Apr 6, 2023

zackerthescar commented Apr 9, 2023

nuomi2021 commented Apr 9, 2023

zackerthescar commented Apr 11, 2023 • edited Loading

nuomi2021 commented Apr 11, 2023

zackerthescar commented Apr 11, 2023 • edited Loading

zackerthescar commented Mar 29, 2023 •

edited

Loading

zackerthescar commented Mar 30, 2023 •

edited

Loading

zackerthescar commented Mar 30, 2023 •

edited

Loading

zackerthescar commented Apr 2, 2023 •

edited

Loading

zackerthescar commented Apr 4, 2023 •

edited

Loading

zackerthescar commented Apr 11, 2023 •

edited

Loading

zackerthescar commented Apr 11, 2023 •

edited

Loading