Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source: Solve HLAE recording bottlenecking on one thread for potentially non-required format conversion #295

Closed
9 tasks done
dtugend opened this issue Mar 14, 2020 · 17 comments
Labels
enhancement New feature or request
Milestone

Comments

@dtugend
Copy link
Member

dtugend commented Mar 14, 2020

we are simply hitting the single-threaded CPU throughput limit here:

I am guessing the main culprit is HLAE itself here, since it converts rgb to bgr on CPU after downloading it from GPU on the drawing thread, meaning on one single thread:
https://github.com/advancedfx/advancedfx/blob/main/AfxHookSource/AfxStreams.cpp#L739

Edit: There's also unneeded conversion by CS:GO's readpixels itself (e.g. removing padding).
Edit2: The bottle neck is probably RAM frequency and not CPU.


Image by @Purple-CSGO:
20200314150145


  • remove format conversions not required for FFMPEG
  • prevent bottle-necking on the drawing thread, while we could at least use multiple threads for required post-processing
  • output multiple files ("lanes") at once from multiple threads (merged from Optimized / Faster recording speeds? #235)
  • also increase FFMPEG pipe buffer to 1 MB per stream
  • make TwinStream parallel
  • Fix OpenExr (depth24 + depthF)
  • stop wasting a GPU frame when recording
  • remove unneccessary texture allocation and deallocation
  • remove unneccessary blitting (stretchrect)

Cancelled:

  • make multi-settings multi-threaded(?) (too rare use case to justify the effort currently)
  • optimize Sampling system (can of worms on its own)
  • optimize GoldSrc too(can of worms on its own)

Related:

@dtugend dtugend added the enhancement New feature or request label Mar 14, 2020
@dtugend dtugend modified the milestones: upcoming, v3.2 Mar 17, 2020
@dtugend dtugend removed this from the v3.2 milestone Feb 13, 2021
@dtugend dtugend closed this as completed Feb 13, 2021
@advancedfx advancedfx locked and limited conversation to collaborators Feb 13, 2021
@advancedfx advancedfx unlocked this conversation Jul 5, 2021
@dtugend dtugend reopened this Jul 5, 2021
@dtugend dtugend closed this as not planned Won't fix, can't repro, duplicate, stale Sep 11, 2022
@dtugend dtugend reopened this Oct 9, 2022
@dtugend dtugend self-assigned this Dec 3, 2022
dtugend added a commit that referenced this issue Dec 24, 2022
@dtugend
Copy link
Member Author

dtugend commented Jan 6, 2023

branch: https://github.com/advancedfx/advancedfx/tree/faster-recording-295
commit: d15e263
current compile: AfxHookSource_20220106T1317Z.zip

improvements already implemented:

  • remove format conversions not required for FFMPEG
  • prevent bottle-necking on the drawing thread, while we could at least use multiple threads for required post-processing
  • output multiple files ("lanes") at once from multiple threads
  • also increase FFMPEG pipe buffer to 1 MB per stream

planned improvements (not implemented yet!):

  • Solve HLAE wasting a frame when recording even when it shouldn't
  • Mutli threaded sampling system support

breaking changes:

  • if you used vflip video filter in custom FFMPEG settings, you probably need to remove it!

Testing

If you only have time / motivation to test a subset of the tests, that's ok too.
Please report bugs that you find (even in combinations not covered by these tests)!

  • Each test should be done with and without MSAA and otherwise same settings.
  • Each test should be done with stable public version and the current compile above

A: Single stream performance

  • Add only a normal stream: mirv_streams add normal norm
  • mirv_streams settings edit afxDefault settings afxFfmpeg (if you want to use other presets that's ok, but please report you did so)
  • Record start + record end to make the FFMPEG binaries hot
  • measure start + recordstart, record end + measure end
  • report number of frames recorded (output FPS * output time recorded)

B: Multi stream performance

  • exec afx/updateWorkaround to add a default set of streams
  • mirv_streams settings edit afxDefault settings afxFfmpeg (if you want to use other presets that's ok, but please report you did so)
  • Record start + record end to make the FFMPEG binaries hot
  • measure start + recordstart, record end + measure end
  • report number of frames recorded (output FPS *output time recorded)

@Dechno1337
Copy link
Member

Dechno1337 commented Jan 7, 2023

TL;DR: MSAA is preventing streams from being recorded.

1) My usual mm cfg w/ highest quality settings (8x MSAA) and huffyuv ffmpeg recording, basefx & depth stream test:

  • AFXERROR: Captured image transform failed for stream baseFx.
  • The basefx stream did not get written.
  • The depth stream was recorded and written to a working huffyuv .avi file.

2) Did not execute my mm cfg, tried A: Single stream performance w/ afxFfmpeg (8x MSAA)

  • AFXERROR: Captured image transform failed for stream norm.
  • Only the audio.wav was written.

Disabling MSAA seems to have solved it.

3) B: multi stream performance w/ afxFfmpeg (8x MSAA)

Same as before, MSAA prevented the streams that use it from being recorded, while depth worked.

I didn't test the speed yet as I wanted to focus on this MSAA issue first.

@dtugend
Copy link
Member Author

dtugend commented Jan 7, 2023

Thanks to @DuKeM-CSGO and @Dechno1337 we fixed crash upon recording due to out of bound reads on the transforming thread(s) and MSAA not working:
Commit: 7251ff9
Download: AfxHookSource_20230107T1201Z.zip

For desired testing see my previous post.

@Dechno1337
Copy link
Member

Dechno1337 commented Jan 8, 2023

Compared the speed with 2 streams:

RecordCompare

(I will do a single stream comparison as well sometime) 😄 👇

@Dechno1337
Copy link
Member

Dechno1337 commented Jan 8, 2023

Single stream performance (SSD):

Record Compare 1 stream

Single stream performance (HDD):

Record Compare 1 stream HDD

Most likely a HDD bottleneck when the performance is basically the same.

@Purple-CSGO
Copy link
Member

Purple-CSGO commented Jan 10, 2023

Test Settings and results are listed below. There are about 50% Speed Improvment.

Hardware

13700K 5G+4.1G
32G Dual DDR4 3733MHz
RTX3080 12G 85% TDP Limit
NVME SSD (no bottleneck)

Config

// Basic Settings
sv_cheats                    1
mirv_campath enabled         1
cl_clock_correction          0
mirv_fix playerAnimState     1
mirv_fix blockObserverTarget 1
mirv_fov handlezoom enabled  1
mirv_streams record name    "D:\benchmark"

sv_disablefreezecam        1
sv_nomvp                   1
sv_nonemesis               1
sv_holiday_mode            0
fog_override               1
net_graph                  0
mat_postprocess_enable     0
mp_display_kill_assists    0
cl_showpos                 0
cl_show_observer_crosshair 0
cl_spec_follow_grenade_key 2
cl_updaterate              128
host_syncfps               1
hud_showtargetid           0
hud_drawhistory_time       0
engine_no_focus_sleep      0
mirv_cvar_unhide_all
demo_pause
fps_max                    0

spec_cameraman_disable_with_user_control 1;spec_cameraman_ui 0;spec_cameraman_xray 0;spec_cameraman_set_xray 0;

mirv_exec alias continue  "demo_resume";
mirv_exec alias rec       "HlaeRecord;demo_timescale 1;mirv_snd_timescale 1;host_timescale 0;fps_max 0;mirv_streams record start;echo {QUOTE}>>> HLAE录制开始{QUOTE}";
mirv_exec alias rec_end   "host_framerate 0;host_timescale 1;mirv_streams record end;echo {QUOTE}>>> HLAE录制结束{QUOTE}";

// Stream
mirv_streams add baseFx raw;mirv_streams edit raw drawHud -1;mirv_streams edit raw record 1;

// Recording
mirv_streams settings add ffmpeg p422   "-c:v prores  -profile:v 2 -pix_fmt yuv422p10le {QUOTE}{AFX_STREAM_PATH}.mov{QUOTE}"
mirv_streams settings add ffmpeg p4444  "-c:v prores  -profile:v 4 {QUOTE}{AFX_STREAM_PATH}.mov{QUOTE}"
mirv_streams settings add ffmpeg p0pro  "-c:v libx264 -preset 0 -qp 0  -g 300 -keyint_min 300 -pix_fmt yuv422p10le {QUOTE}{AFX_STREAM_PATH}.mp4{QUOTE}"
mirv_streams settings add ffmpeg p1     "-c:v libx264 -preset 1 -crf 2 -qmax 20 -g 300 -keyint_min 300 -x264-params ref=3:me=hex:subme=3:merange=12:b-adapt=1:aq-mode=2:aq-strength=0.9:no-fast-pskip=1 {QUOTE}{AFX_STREAM_PATH}.mp4{QUOTE}"
mirv_streams settings add ffmpeg x265   "-c:v libx265 -preset 1 -crf 8 -g 300 -pix_fmt yuv422p10le {QUOTE}{AFX_STREAM_PATH}.mp4{QUOTE}"
mirv_streams settings add ffmpeg n0     "-c:v h264_nvenc -g 300 -tune lossless -pix_fmt yuv444p {QUOTE}{AFX_STREAM_PATH}.mp4{QUOTE}"
mirv_streams settings add ffmpeg n1     "-c:v h264_nvenc -g 300 -preset medium -tune hq -rc constqp -qp 12 -pix_fmt yuv444p {QUOTE}{AFX_STREAM_PATH}.mp4{QUOTE}"
mirv_streams settings add ffmpeg n2     "-c:v hevc_nvenc -g 300 -preset medium -tune hq -rc constqp -qp 14 -pix_fmt yuv444p {QUOTE}{AFX_STREAM_PATH}.mp4{QUOTE}"

mirv_exec alias tga   "mirv_streams settings edit afxDefault settings afxClassic;echo;echo {QUOTE}Current Record Setting: afxClassic{QUOTE};echo;";
mirv_exec alias p422  "mirv_streams settings edit afxDefault settings p422 ;echo;echo {QUOTE}Current Record Setting: Prores 422{QUOTE};echo;";
mirv_exec alias p4444 "mirv_streams settings edit afxDefault settings p4444;echo;echo {QUOTE}Current Record Setting: Prores 4444{QUOTE};echo;";
mirv_exec alias p0pro "mirv_streams settings edit afxDefault settings p0pro;echo;echo {QUOTE}Current Record Setting: p0pro{QUOTE};echo;";
mirv_exec alias p1    "mirv_streams settings edit afxDefault settings p1   ;echo;echo {QUOTE}Current Record Setting: p1   {QUOTE};echo;";
mirv_exec alias x265  "mirv_streams settings edit afxDefault settings x265 ;echo;echo {QUOTE}Current Record Setting: x265 {QUOTE};echo;";
mirv_exec alias n0    "mirv_streams settings edit afxDefault settings n0;echo;echo {QUOTE}Current Record Setting: n0 - h264_nvenc lossess{QUOTE};echo;";
mirv_exec alias n1    "mirv_streams settings edit afxDefault settings n1;echo;echo {QUOTE}Current Record Setting: n1 - h264_nvenc cqp 12 yuv444{QUOTE};echo;";
mirv_exec alias n2    "mirv_streams settings edit afxDefault settings n2;echo;echo {QUOTE}Current Record Setting: n2 - h265_nvenc cqp 14 yuv444{QUOTE};echo;";

// jump to tick
demo_gototick 163900;
alias go "demo_gototick 163900"

// Fixed Start/End Tick
mirv_cmd addAtTick 164320 rec;
mirv_cmd addAtTick 168160 rec_end; // 30s
// mirv_cmd addAtTick 172000 rec_end; // 60s

Setting

  • Match: m0NESY 1v4 Inferno AWP Round 9 · MatchPage

  • tick: 164320 ~ 168160

  • duration: 30 seconds

  • Resolution: 1920x1080

  • Framerate: 300

Result

Preset Cost Time C.T. (Beta) Improved FileSize Per Minute
tga 149s 95s 57% 104 GB
p422 151s 103s 46% 15 GB
p4444 164s 117s 40% 31 GB
p0pro 156s 113s 38% 32 GB
p1 170s 128s 33% 6.5 GB
x265 229s 212s 8% 2.3 GB
n0 126s 78s 61% 22.2 GB
n1 124s 81s 53% 3.8 GB
n2 - 79s - 2.1 GB

@eirisocherry
Copy link

Screenshot_15

new dll doesn't work with alpha layers

@DuKeM-CSGO
Copy link

Using most of settings as #295 (comment), except the following though:

Hardware

i7-6700HQ 2.60GHz
16G SK Hynix DDR4 2133MHz
GTX960M
HDD (heavy bottleneck)

Config

Same.

Settings

Same except resolution is 1920 * 850.

Result

image

@Dechno1337
Copy link
Member

new dll doesn't work with alpha layers

Can confirm. I tried huffyuv / afxFfmpegHuffyuv, but I guess it's not related to the FFMPEG presets/settings.

dtugend added a commit that referenced this issue Jan 14, 2023
Regerding #295

- Fixed matte stream with new recording.
- Made matte stream merging parallel

Todo:
- make TwinStream parallel
- Test OpenExr / depth24
- make Sampling parallel
- stop wasting a frame
@dtugend
Copy link
Member Author

dtugend commented Jan 14, 2023

Dec requested a new dll:
AfxHookSource_20230114T1011Z.zip

This one has the matte stream fixed and optimized (parallel CPU threads) and has all features of the "upcoming" milestone as of today.

Stil left todo for me:

  • make TwinStream parallel
  • Test OpenExr / depth24
  • make Sampling parallel
  • stop wasting a frame

@dtugend
Copy link
Member Author

dtugend commented Jan 16, 2023

515a499
AfxHookSource_20230116T0516Z.zip

Fixed mirv_streams capture types depthF, depthFZIP, depth24, depth24ZIP

(Don't recommend to retest.)

@dtugend dtugend added this to the upcoming milestone Jan 16, 2023
dtugend added a commit that referenced this issue Jan 18, 2023
@dtugend
Copy link
Member Author

dtugend commented Jan 18, 2023

Stop wasting a GPU frame when recording
3b46279
AfxHookSource_20220118T2005Z.zip

This is the last bigger change planned before pre-release (TwinStream optimization and Sampling System optimization I consider smaller things).

This one has heavy stream recording logic changes and tries to be as optimal as possible in terms of not wasting GPU frames.

You can use __mirv_show_renderview_count 1 to get debug information about how many render passes are required for your current setup.
The last pass (highest m_DoRenderViewCount: number) should always have (m_ForceCacheFullSceneState: 0) all others should have (m_ForceCacheFullSceneState: 1), otherwise the system is bugged.

The following properities are known to have a relevant influence upon the test results and optimization:

  • mirv_streams mainStream, which determines how the mainStream is selected (defaults to first one active, but other options are possilbe)
  • the number of streams recorded (or not) and previewed (or not, you can preview multiple by giving a slot number after streamName for preview command, but this can not be optimized much)
  • if a stream is a truly "normal" stream or not (normal stream with default settings (no baseFx)
  • not sure if multi steams (e.g. matte stream with alpha) work as expected still, they should though

As a rule of thumb, the optimization will be best if you put the mainStream (default = first stream see above) also into preview or if it's a truly normal stream, in that case you will get the lowest possible number of GPU frame render passes.

(On fast GPUs the optimization might be not very notable, it will only be notable if GPU bottle-necked like my setup.)

@dtugend
Copy link
Member Author

dtugend commented Jan 19, 2023

The last version has a bug and loses MSAA under certain conditions due to a shortcut I took 🤦
Will try to fix on the weekend and implement my own render target push / pop instead of using the CS:GO one, I tried that earlier, but it didn't work too well.

dtugend added a commit that referenced this issue Jan 19, 2023
@dtugend
Copy link
Member Author

dtugend commented Jan 19, 2023

@dtugend
Copy link
Member Author

dtugend commented Jan 21, 2023

(Not doing 1. of #194 as part of this feature request, because it's a can of worms on it's own.)

@dtugend dtugend changed the title Solve HLAE recording bottlenecking on one thread for potentially non-required format conversion Source: Solve HLAE recording bottlenecking on one thread for potentially non-required format conversion Jan 21, 2023
@dtugend dtugend removed their assignment Jan 21, 2023
@dtugend dtugend closed this as completed Jan 21, 2023
dtugend added a commit that referenced this issue Jan 21, 2023
commit 4345557
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Jan 21 07:33:32 2023 +0100

    Optimize TwinStreams

commit 3b92d5c
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Thu Jan 19 22:25:07 2023 +0100

    Fix missing MSAA in certain conditions

    - Addresses #295

commit 3b46279
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Wed Jan 18 21:03:00 2023 +0100

    Stop wasting a GPU frame when recording

    - Addresses #295

commit 515a499
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Mon Jan 16 06:14:30 2023 +0100

    Fix mirv_streams capture types depthF(ZIP), depth24(ZIP)

commit e86c0e1
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Jan 14 11:18:14 2023 +0100

    Fix + optimize + todo

    Regerding #295

    - Fixed matte stream with new recording.
    - Made matte stream merging parallel

    Todo:
    - make TwinStream parallel
    - Test OpenExr / depth24
    - make Sampling parallel
    - stop wasting a frame

commit 3ebad81
Merge: 7251ff9 29679d6
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Jan 14 07:29:07 2023 +0100

    Merge branch 'main' into faster-recording-295

commit 7251ff9
Merge: 7d52c88 cb11366
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Jan 7 12:58:36 2023 +0100

    Merge branch 'main' into faster-recording-295

commit 7d52c88
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Jan 7 12:54:05 2023 +0100

    Fix MSAA recording not working

    Thanks for testing @Dechno1337

commit 5dea86c
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Jan 7 12:53:01 2023 +0100

    Fix out of bounds memory access in transform threads

    Thanks for testing @DuKeM-CSGO

commit d15e263
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Fri Jan 6 13:59:09 2023 +0100

    Backup

commit 658ed31
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Tue Jan 3 00:15:24 2023 +0100

    Backup (doesn't work yet ;)

commit 41c342a
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Fri Dec 30 14:07:09 2022 +0100

    Backup (work in progress)

    Will not compile or work.

commit bafdd2b
Author: Dominik Tugend <dominik@matrixstorm.com>
Date:   Sat Dec 24 16:22:21 2022 +0100

    WIP (will not compile), see #295
@dtugend dtugend reopened this Jan 22, 2023
@dtugend
Copy link
Member Author

dtugend commented Jan 22, 2023

d74215a
AfxHookSource_20230122T2100Z.zip

This should have most problems fixed.
I had random freezes upon stopping recording during programming, not sure if reproducible with the fixed version and if it's a duplicate of #806 or not.

I need to test:

  • Reshade_advancedfx
  • -afxInterop (unity)
  • -afxInteropLight (cefhud)

@Dechno1337
Copy link
Member

Dechno1337 commented Feb 12, 2023

Final graph with the official HLAE 2.144.2 version.

HLAE Recording 2023

Testing methodology:

Each session in CS:GO I would record the same scene (set up with mirv_cmd) three times; the first one would be without any caching, while the second would have the advantage of cached FFmpeg files/binaries (in my case it didn't give any meaningful improvement) and finally for the third pass I would preview the first (mainStream) stream, which with the newest HLAE version improves the recording speed even further, usually between 5-15%.

I used a stopwatch to count the time it would take to record the scene, but I would round it to the nearest half second, and then average all three passes. For the final graph numbers, I averaged the results for each stream setup.

After each session I would close CS:GO, check the recorded footage to see if it was intact, and then delete it. After I had tested all four presets for a specific amount of streams I would restart my PC. In hindsight I should've restarted my PC after each game session, but ultimately I don't think it would have mattered.

I forgot to mention that hardware-accelerated GPU scheduling was also turned on for both versions. When I tested earlier builds it was disabled and I noticed that it helped with the recording speeds in most cases when enabled.

Raw data:

[4 streams; updateWorkaround] [de_mirage] // 2023 HLAE version

afxFfmpeg: 123 sec // 122 sec // 115,5 sec (preview)
afxFfmpegRaw: 81 sec // 80 sec // 76,5 sec (preview)
huffyuv: 80,5 sec // 78,5 sec // 76 sec (preview)
tga: 69,5 sec // 69,5 sec // 64 sec (preview)

[4 streams; updateWorkaround] [de_mirage] // 2022 HLAE version

afxFfmpeg: 211 sec // 211 sec // 212,5 sec (preview)
afxFfmpegRaw: 169,5 sec // 171 sec // 170 sec (preview)
huffyuv: 177 sec // 188 sec // 175 sec (preview)
tga: 185,5 sec // 187,5 sec // 183,5 sec (preview)
[2 streams; basefx+depth] [de_mirage] // 2023 HLAE version

afxFfmpeg: 68,5 sec // 70 sec // 65,5 sec (preview)
afxFfmpegRaw: 41,5 sec // 42 sec // 35,5 sec (preview)
huffyuv: 43 sec // 43 sec // 36 sec (preview)
tga: 35 sec // 36 sec // 31 sec (preview)

[2 streams; basefx+depth] [de_mirage] // 2022 HLAE version

afxFfmpeg: 112,5 sec // 113 sec // 114 sec (preview)
afxFfmpegRaw: 90 sec // 90,5 sec // 90 sec (preview)
huffyuv: 90,5 sec // 96,5 sec // 92 sec (preview)
tga: 95 sec // 97,5 sec // 95,5 sec (preview)
[1 stream; basefx] [de_mirage] // 2023 HLAE version

afxFfmpeg: 46 sec // 46 sec // 43,5 sec (preview)
afxFfmpegRaw: 22 sec // 21,5 sec // 19 sec (preview)
huffyuv: 23 sec // 23 sec // 20 sec (preview)
tga: 19,5 sec // 19,5 sec // 17 sec (preview)

[1 stream; basefx] [de_mirage] // 2022 HLAE version

afxFfmpeg: 68 sec // 65 sec // 66 sec (preview)
afxFfmpegRaw: 50,5 sec // 51 sec // 52 sec (preview)
huffyuv: 51 sec // 51 sec // 52,5 sec (preview)
tga: 51 sec // 53,5 sec // 54 sec (preview)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants