Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vvc_sao: Move accelerated HEVC SAO assembly code to VVC #53

Closed
wants to merge 0 commits into from

Conversation

zackerthescar
Copy link
Contributor

Successfully moved accelerated HEVC SAO assembly code to VVC. Both assembly files are essentially untouched from the HEVC SAO code, other than changing function names. This hasn't been tested yet, but compiles and if issue #45 is correct, should run just fine.

Function pointers are initialized to the exact index as they are in x86/hevcdsp_init.c. Hopefully hevcdsp.c fills in the other function pointers or this will segfault.

@nuomi2021
Copy link
Member

Hi @zackerthescar ,
Thank you for the patch.
Could you attach the 3 times average fps before and after the patch? Just like #42
It's wonderful if you can also provide checksam results for sao f94f48d

thank you

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Mar 29, 2023

Improvements of around 6%-12%, although it seems like there are some issues

Command is ./ffmpeg -y -i [clip name] -c:v mpeg4 -q:v 2 -f mp4 [output]
Results obtained on Intel Core i7-8700
Results are an average of 3 runs

Clip C AVX2 delta
BasketballDrive_1920x1080_50_10_420_37_RA.266 62.67 67 6.09%
RitualDance_1920x1080_60_10_420_32_LD.266 67.67 76 12.3%
RitualDance_1920x1080_60_10_420_37_RA.266 70 74.33 6.19%
Tango2_3840x2160_60_10_420_27_LD.266 17.33 19 9.63%

See attached images below, shows some issues. These are not present when using the C SAO filter.
Screenshot from bb_ra mp4
Screenshot from ra_ld mp4
Screenshot from tg_ld mp4

I suspect the issue might be because I only populate [0-4] of the function pointer array. I'm not exactly sure what to populate pointers [5-8] with. Any hints or suggestions is welcome. See below
I will work on checkasm for sao now. Hopefully that reveals some issues.

@zackerthescar
Copy link
Contributor Author

Closed by accident, sorry.

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Mar 30, 2023

Added vvc_sao checks for checkasm. Revealed some issues, to say the least.

In the avx2 case, SAO band filter passes completely, but none of the edge filters pass.

./checkasm
...
AVX2:
...
 - vvc_sao.sao_band                [OK]
   vvc_sao_edge_32_8_avx2 (vvc_sao.c:136)
   vvc_sao_edge_48_8_avx2 (vvc_sao.c:136)
   vvc_sao_edge_64_8_avx2 (vvc_sao.c:136)
   vvc_sao_edge_8_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_16_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_32_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_48_10_avx2 (vvc_sao.c:136)
   vvc_sao_edge_64_10_avx2 (vvc_sao.c:136)
 - vvc_sao.sao_edge                [FAILED]
 ...

I'm pretty sure that's why I'm getting the artifacts on the output file.
EDIT: Just confirmed it by removing the AVX2 accelerated edge filter from init. Output is clean with AVX2 band filter and C edge filter.

@nuomi2021
Copy link
Member

Improvements of around 6%-12%, although it seems like there are some issues

Command is ./ffmpeg -y -i [clip name] -c:v mpeg4 -q:v 2 -f mp4 [output] Results obtained on Intel Core i7-8700 Results are an average of 3 runs

this will do the mprg4 encode. Please use
./ffmpeg_g -i Tango2_3840x2160_60_10_420_27_LD.266 -vsync 0 -f rawvideo /dev/null -y
to test

@nuomi2021
Copy link
Member

I'm pretty sure that's why I'm getting the artifacts on the output file. EDIT: Just confirmed it by removing the AVX2 accelerated edge filter from init. Output is clean with AVX2 band filter and C edge filter.

could you help debug why it's mismatched?

@Jamaika1
Copy link

vvcdsp_init.c:166:32: warning: assignment to 'void (*)(uint8_t *, uint8_t *, ptrdiff_t,  int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, unsigned char *, long long int,  short int *, int,  int,  int)'} from incompatible pointer type 'void (*)(uint8_t *, const uint8_t *, ptrdiff_t,  const int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, const unsigned char *, long long int,  const short int *, int,  int,  int)'} [-Wincompatible-pointer-types]
  166 |     c->sao.edge_filter[1]      = ff_vvc_sao_edge_filter_16_##bitd##_##opt;  \
      |                                ^
vvcdsp_init.c:208:17: note: in expansion of macro 'SAO_EDGE_INIT'
  208 |                 SAO_EDGE_INIT(10, avx2);
      |                 ^~~~~~~~~~~~~

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Mar 30, 2023

I'm pretty sure that's why I'm getting the artifacts on the output file. EDIT: Just confirmed it by removing the AVX2 accelerated edge filter from init. Output is clean with AVX2 band filter and C edge filter.

could you help debug why it's mismatched?

Yep, working on this right now. Hoping to debug this soon. I will post the results w/ no encoder running soon.

@zackerthescar
Copy link
Contributor Author

vvcdsp_init.c:166:32: warning: assignment to 'void (*)(uint8_t *, uint8_t *, ptrdiff_t,  int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, unsigned char *, long long int,  short int *, int,  int,  int)'} from incompatible pointer type 'void (*)(uint8_t *, const uint8_t *, ptrdiff_t,  const int16_t *, int,  int,  int)' {aka 'void (*)(unsigned char *, const unsigned char *, long long int,  const short int *, int,  int,  int)'} [-Wincompatible-pointer-types]
  166 |     c->sao.edge_filter[1]      = ff_vvc_sao_edge_filter_16_##bitd##_##opt;  \
      |                                ^
vvcdsp_init.c:208:17: note: in expansion of macro 'SAO_EDGE_INIT'
  208 |                 SAO_EDGE_INIT(10, avx2);
      |                 ^~~~~~~~~~~~~

This is due to the fact that HEVC's SAO filter declares the source pointers const. I already have a build that doesn't have these pointers const'ed, and I will push it as soon as possible.

@nuomi2021
Copy link
Member

@zackerthescar , we are close to the GSOC deadline.
You can focus on small functions, like band_filter 4x4, and make it pass the checkasm test. then we can check the code in
You can write down the improvement plan in your proposal, and finish SAO and Deblock for the GSOC project.

thank you.

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Apr 2, 2023

I have added commit 4104b68 which removes the code to initialize the AVX2-accelerated edge functions. This allows checkasm to pass. I will post benchmarks in the next couple of hours, then work on the smaller band_filter 4x4 function and making sure it passes checkasm as you suggested.

I have written up a GSoC proposal and will submit it soon, within the next few hours.

Update:
Benchmark with only the SAO Band filter initialized to AVX2.
Command is ./ffmpeg -y -i [clip name] -vsync 0 -f rawvideo /dev/null
Results obtained on Intel Core i7-8700
Results are an average of 3 runs

Clip C AVX2 delta
BasketballDrive_1920x1080_50_10_420_37_RA.266 75 75 0%
RitualDance_1920x1080_60_10_420_32_LD.266 87.33 87.67 1.00%
RitualDance_1920x1080_60_10_420_37_RA.266 81.33 82.33 1.01%
Tango2_3840x2160_60_10_420_27_LD.266 21 21 0%

Thanks.

@zackerthescar zackerthescar marked this pull request as ready for review April 2, 2023 05:23
@@ -0,0 +1,5 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something automatically added by vscode. Removed in my rebase.

@@ -120,19 +120,17 @@ static void alf_filter_chroma_8_avx2(uint8_t *dst, ptrdiff_t dst_stride, const u
c->alf.filter[CHROMA] = alf_filter_chroma_##depth##_avx2; \
} while (0)

#define SAO_BAND_FILTER_FUNCS(bitd, opt) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to change this. how about we rebase and combine this with other commit

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Apr 4, 2023

Sorry about the delay, I've been busy with some schoolwork.

I've rebased and force-pushed only the needed changes based on your comments. checkasm should now pass, and

How about we make _src and sao_offset_val as const.

has been done.

Thanks.

@nuomi2021
Copy link
Member

Hi @zackerthescar ,
thank you for the update.
On my i7 4770k, vvc_sao_band_64_10_avx2 3 times slower than vvc_sao_band_64_10_c, are you happen to know why?
thank you

AVX2:

  • vvc_sao.sao_band [OK]
    checkasm: all 10 tests passed
    vvc_sao_band_8_8_c: 825.2
    vvc_sao_band_8_8_avx2: 65.7
    vvc_sao_band_8_10_c: 878.2
    vvc_sao_band_8_10_avx2: 56.0
    vvc_sao_band_16_8_c: 3047.2
    vvc_sao_band_16_8_avx2: 182.2
    vvc_sao_band_16_10_c: 3491.2
    vvc_sao_band_16_10_avx2: 98.5
    vvc_sao_band_32_8_c: 23144.7
    vvc_sao_band_32_8_avx2: 345.0
    vvc_sao_band_32_10_c: 13307.5
    vvc_sao_band_32_10_avx2: 350.2
    vvc_sao_band_48_8_c: 26881.7
    vvc_sao_band_48_8_avx2: 1040.5
    vvc_sao_band_48_10_c: 29488.7
    vvc_sao_band_48_10_avx2: 765.7
    vvc_sao_band_64_8_c: 47141.5
    vvc_sao_band_64_8_avx2: 1302.0
    vvc_sao_band_64_10_c: 53341.0
    vvc_sao_band_64_10_avx2: 154115.2

@zackerthescar
Copy link
Contributor Author

Hi @zackerthescar , thank you for the update. On my i7 4770k, vvc_sao_band_64_10_avx2 3 times slower than vvc_sao_band_64_10_c, are you happen to know why? thank you

AVX2:

* vvc_sao.sao_band [OK]
  checkasm: all 10 tests passed
  vvc_sao_band_8_8_c: 825.2
  vvc_sao_band_8_8_avx2: 65.7
  vvc_sao_band_8_10_c: 878.2
  vvc_sao_band_8_10_avx2: 56.0
  vvc_sao_band_16_8_c: 3047.2
  vvc_sao_band_16_8_avx2: 182.2
  vvc_sao_band_16_10_c: 3491.2
  vvc_sao_band_16_10_avx2: 98.5
  vvc_sao_band_32_8_c: 23144.7
  vvc_sao_band_32_8_avx2: 345.0
  vvc_sao_band_32_10_c: 13307.5
  vvc_sao_band_32_10_avx2: 350.2
  vvc_sao_band_48_8_c: 26881.7
  vvc_sao_band_48_8_avx2: 1040.5
  vvc_sao_band_48_10_c: 29488.7
  vvc_sao_band_48_10_avx2: 765.7
  vvc_sao_band_64_8_c: 47141.5
  vvc_sao_band_64_8_avx2: 1302.0
  vvc_sao_band_64_10_c: 53341.0
  **vvc_sao_band_64_10_avx2: 154115.2**

Hi, sorry for the late reply. I was able to confirm that vvc_sao_band_64_10_avx2 is indeed slower on my machine as well. I'm quite confused at this point, since the code is identical to the HEVC version and it doesn't exhibit this behaviour there.

I'm guessing this is due to the fact that we have 9 pointers in VVC (vvcdsp.h) versus 5 in HEVC (hevcdsp.h)

Perhaps we need more widths beyond 8, 16, 32, 48, 64 to fill all 9 func pointers?

@nuomi2021
Copy link
Member

Hi @zackerthescar ,
Thank you for the reply.
hevc support 64x64 block too. did you check hevc's 10 bits 64x64 block status?

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Apr 11, 2023

Hi @zackerthescar , Thank you for the reply. hevc support 64x64 block too. did you check hevc's 10 bits 64x64 block status?

Sorry for the late reply. As far as I know, the same code for 64x64 block in HEVC is faster in AVX2 than C.

$ tests/checkasm/checkasm --test=hevc_sao --benchmark

..
hevc_sao_band_64_10_c: 11517.3
hevc_sao_band_64_10_sse2: 2337.3
$ tests/checkasm/checkasm --test=vvc_sao --benchmark

...
vvc_sao_band_64_10_c: 9396.0
vvc_sao_band_64_10_avx2: 25947.0

This is why I believe that not populating the rest of the pointers is causing the issue. What are all of VVC's supported block sizes? I can't seem to find that in the ITU document...

@nuomi2021
Copy link
Member

why hevc uses sse2 and vvc uses avx2?
vvc's max CTU size is 128x128

@zackerthescar
Copy link
Contributor Author

zackerthescar commented Apr 11, 2023

Seems to be a copy-paste error on my part (copy-pasted the wrong line). Will update the results when I get back home.

I'll try to make functions for all corresponding CTU sizes until 128x128, see if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants