AVX2 implementation of DMVR SAD for VVC #214

stone-d-chen · 2024-04-18T00:33:00Z

As per #212 (comment) this is based on ffmpeg/main and pulling into ffvvc/up

Adds AVX2 assembly for SAD used in DMVR (decoder-side motion vector refinement). The main difference is that in VVC, SAD is only calculated on even rows of the PU to reduce complexity. Implements SAD via min/max/sub for 16bit values.

DMVR is restricted to PUs whose width >= 8, height >=8 and width * height >= 128 (ie 8x8 is not a valid size).


AVX2:
 - vvc_sad.check_vvc_sad_8_16bpc   [OK]
 - vvc_sad.check_vvc_sad_16_16bpc  [OK]
 - vvc_sad.check_vvc_sad_32_16bpc  [OK]
 - vvc_sad.check_vvc_sad_64_16bpc  [OK]
 - vvc_sad.check_vvc_sad_128_16bpc [OK]
checkasm: all 5 tests passed
vvc_sad_8_16bpc_c: 122.5
vvc_sad_8_16bpc_avx2: 12.5
vvc_sad_16_16bpc_c: 262.5
vvc_sad_16_16bpc_avx2: 22.5
vvc_sad_32_16bpc_c: 1012.5
vvc_sad_32_16bpc_avx2: 92.5
vvc_sad_64_16bpc_c: 3922.5
vvc_sad_64_16bpc_avx2: 372.5
vvc_sad_128_16bpc_c: 16682.5
vvc_sad_128_16bpc_avx2: 1892.5

//before
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 80.3 |
Chimera_8bit_1080P_1000_frames.vvc | 157.3 |
NovosobornayaSquare_1920x1080.bin | 160.0 |
RitualDance_1920x1080_60_10_420_37_RA.266 | 146.7 |	

//after
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 81.3 |
Chimera_8bit_1080P_1000_frames.vvc | 165.0 |
NovosobornayaSquare_1920x1080.bin | 164.7 |
RitualDance_1920x1080_60_10_420_37_RA.266 | 150.0 |

Ran on AMD 7940HS

deblock, sao, alf skip, imtf, ipm, cqt_depth, cb_pos_x, cb_pos_y, cb_height, cp_mv, tb_pos_x0, tb_pos_y0, tb_width, tb_height

For luma, qp can only change at the CU level, so the qp tab size is related to the CU. For chroma, considering the joint CbCr, the QP tab size is related to the TU.

…threads memset tables in the main thread can become a bottleneck for the decoder. For example, if it takes 2% of the processing time for one core, the maximum achievable FPS will be 50. Move the memeset to worker threads will fix the issue.

Adds AVX2 assembly for SAD used in DMVR (decoder-side motion vector refinement). The main difference is that in VVC, SAD is only calculated on even rows of the PU to reduce complexity. Implements SAD via min/max/sub for 16bit values. DMVR is restricted to PUs whose width >= 8, height >=8 and width * height >= 128 (ie 8x8 is not a valid size).

stone-d-chen · 2024-04-18T00:35:46Z

Messed up the rebase again

nuomi2021 and others added 14 commits February 8, 2024 14:48

add github workflow

3dcc7fb

avcodec/vvcdec: refact, combine bs tab with tu tab

ad989c0

avcodec/vvcdec: remove unnecessary perframe initializations

5ea7e55

deblock, sao, alf skip, imtf, ipm, cqt_depth, cb_pos_x, cb_pos_y, cb_height, cp_mv, tb_pos_x0, tb_pos_y0, tb_width, tb_height

avcodec/vvcdec: do not zero frame ctu table

c4b03f3

avcodec/vvcdec: refact out is_available from is_a0_available

19b85df

avcodec/vvcdec: do not zero frame mvf table

037c1e8

avcodec/vvcdec: do not zero frame cpm table

d98f6b9

avcodec/vvcdec: do not zero frame msf mmi table

8acad1f

avcodec/vvcdec: do not zero frame qp table

44fcfa0

For luma, qp can only change at the CU level, so the qp tab size is related to the CU. For chroma, considering the joint CbCr, the QP tab size is related to the TU.

dsp: itx, use 2d itx instead 1d

a36ef5a

avcodec/vvcdec: add asm code

53449ee

avcodec/vvcdec: add checkasm

e81b6d7

stone-d-chen mentioned this pull request Apr 18, 2024

AVX2 implementation of DMVR SAD for VVC #213

Closed

stone-d-chen closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 implementation of DMVR SAD for VVC #214

AVX2 implementation of DMVR SAD for VVC #214

stone-d-chen commented Apr 18, 2024 •

edited

Loading

stone-d-chen commented Apr 18, 2024

AVX2 implementation of DMVR SAD for VVC #214

AVX2 implementation of DMVR SAD for VVC #214

Conversation

stone-d-chen commented Apr 18, 2024 • edited Loading

stone-d-chen commented Apr 18, 2024

stone-d-chen commented Apr 18, 2024 •

edited

Loading