Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 implementation of DMVR SAD for VVC #214

Closed
wants to merge 14 commits into from

Conversation

stone-d-chen
Copy link
Collaborator

@stone-d-chen stone-d-chen commented Apr 18, 2024

As per #212 (comment) this is based on ffmpeg/main and pulling into ffvvc/up

Adds AVX2 assembly for SAD used in DMVR (decoder-side motion vector refinement). The main difference is that in VVC, SAD is only calculated on even rows of the PU to reduce complexity. Implements SAD via min/max/sub for 16bit values.

DMVR is restricted to PUs whose width >= 8, height >=8 and width * height >= 128 (ie 8x8 is not a valid size).


AVX2:
 - vvc_sad.check_vvc_sad_8_16bpc   [OK]
 - vvc_sad.check_vvc_sad_16_16bpc  [OK]
 - vvc_sad.check_vvc_sad_32_16bpc  [OK]
 - vvc_sad.check_vvc_sad_64_16bpc  [OK]
 - vvc_sad.check_vvc_sad_128_16bpc [OK]
checkasm: all 5 tests passed
vvc_sad_8_16bpc_c: 122.5
vvc_sad_8_16bpc_avx2: 12.5
vvc_sad_16_16bpc_c: 262.5
vvc_sad_16_16bpc_avx2: 22.5
vvc_sad_32_16bpc_c: 1012.5
vvc_sad_32_16bpc_avx2: 92.5
vvc_sad_64_16bpc_c: 3922.5
vvc_sad_64_16bpc_avx2: 372.5
vvc_sad_128_16bpc_c: 16682.5
vvc_sad_128_16bpc_avx2: 1892.5
//before
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 80.3 |
Chimera_8bit_1080P_1000_frames.vvc | 157.3 |
NovosobornayaSquare_1920x1080.bin | 160.0 |
RitualDance_1920x1080_60_10_420_37_RA.266 | 146.7 |	

//after
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 81.3 |
Chimera_8bit_1080P_1000_frames.vvc | 165.0 |
NovosobornayaSquare_1920x1080.bin | 164.7 |
RitualDance_1920x1080_60_10_420_37_RA.266 | 150.0 |

Ran on AMD 7940HS

nuomi2021 and others added 14 commits February 8, 2024 14:48
deblock, sao, alf
skip, imtf, ipm, cqt_depth, cb_pos_x, cb_pos_y, cb_height, cp_mv,
tb_pos_x0, tb_pos_y0, tb_width, tb_height
For luma, qp can only change at the CU level, so the qp tab size is related to the CU.
For chroma, considering the joint CbCr, the QP tab size is related to the TU.
…threads

memset tables in the main thread can become a bottleneck for the decoder.
For example, if it takes 2% of the processing time for one core, the maximum achievable FPS will be 50.
Move the memeset to worker threads will fix the issue.
Adds AVX2 assembly for SAD used in DMVR (decoder-side motion vector refinement). The main difference is that in VVC, SAD is only calculated on even rows of the PU to reduce complexity. Implements SAD via min/max/sub for 16bit values.

DMVR is restricted to PUs whose width >= 8, height >=8 and width * height >= 128 (ie 8x8 is not a valid size).
@stone-d-chen
Copy link
Collaborator Author

Messed up the rebase again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants