New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertex streaming #3504

Merged
merged 2 commits into from Mar 17, 2018

Conversation

Projects
None yet
10 participants
@degasus
Contributor

degasus commented Mar 11, 2018

Use ARB_buffer_storage with persistent mappings to stream the vertex data to the GPU.

Update:

I've picked the helper classes from #3499. This code uses GL_sync to block if the buffer is full instead of orphaning. The performance however should be close to identical. Too big draw call, which won't fit in the buffer, will be splitted in small chucks (tested with a tiny buffer).


This change is Reviewable

@degasus degasus force-pushed the degasus:vertex_streaming branch 2 times, most recently from 0ba7f7f to 2ad1485 Mar 11, 2018

@@ -50,6 +50,18 @@ RasterizerOpenGL::RasterizerOpenGL() : shader_dirty(true) {
state.draw.uniform_buffer = uniform_buffer.handle;
state.Apply();
// Allocate vertex buffer
if (GLAD_GL_ARB_buffer_storage) {

This comment has been minimized.

@lioncash

lioncash Mar 11, 2018

Member

Can't this and the other added code below be moved to its own separate functions?

This comment has been minimized.

@degasus

degasus Mar 15, 2018

Contributor

Yes, this functions are picked from #3499

@tywald

This comment has been minimized.

tywald commented Mar 14, 2018

Profiled Monster Hunter Generations.

i7-3770K @ 4.3GHz
GTX 680
vtxstream_gtx680
nightly_gtx680

i7-3770K @ 4.3GHz
Intel HD 4000
vtxstream_intel_hd_4000
nightly_intel_hd_4000

Will add more games in the next few days. Let me know if I should change anything.
Life complications, will need to postpone about adding more games.

@Subv

Friendly reminder that this supports a max size of 1024 * 1024. There should probably be an assert about that

@degasus degasus force-pushed the degasus:vertex_streaming branch from 2ad1485 to 6b7d900 Mar 15, 2018

@cluezbot

This comment has been minimized.

cluezbot commented Mar 15, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-03-15T06:16:11Z: 0ad38cf...degasus:6b7d9000d9e9f1d2b51187c796e52ec4ddf90665

@degasus degasus force-pushed the degasus:vertex_streaming branch from 6b7d900 to c2d6037 Mar 15, 2018

@cluezbot

This comment has been minimized.

cluezbot commented Mar 15, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-03-15T06:49:13Z: 0ad38cf...degasus:c2d60373da5baa5e8a64b0dd08f66a083934fd0d

@degasus

This comment has been minimized.

Contributor

degasus commented Mar 15, 2018

@tywald Thanks for testing this PR. I've picked some common code from #3499, but this should not affect the performance (feel free to validate it through). But those -4ms on intel and -8ms on nvidia seems very worth it. Do you have the stats how many vertices are rendered in this frame?

@Subv The new code splits too big draw calls, so there is no need for an assert any more.

}
OGLStreamBuffer::~OGLStreamBuffer() {
Release();

This comment has been minimized.

@wwylele

wwylele Mar 15, 2018

Member

This may not do what you want. Calling virtual function in constructor/destructor is dangerous, as it always calls the function that belongs to the constructing/destructing type (OGLStreamBuufer::Release(){/*empty*/} in this case), not the one overridden by derived classes.

Instead, you should just override the destructor in derived classes.

This comment has been minimized.

@degasus

degasus Mar 15, 2018

Contributor

heh, indeed. I shouldn't have just copied those classes. Where shall I amend this fixes, here or in #3499 ?

This comment has been minimized.

@wwylele

wwylele Mar 15, 2018

Member

I'd say amend this here and let @jroweboy rebase #3499

}
OGLStreamBuffer::~OGLStreamBuffer() {
Release();

This comment has been minimized.

@wwylele

wwylele Mar 15, 2018

Member

I'd say amend this here and let @jroweboy rebase #3499

auto map = vertex_buffer->Map(vertex_size, 1);
memcpy(map.first, vertex_batch.data() + base_vertex, vertex_size);
vertex_buffer->Unmap();
glDrawArrays(GL_TRIANGLES, map.second / sizeof(HardwareVertex), (GLsizei)vertices);

This comment has been minimized.

@wwylele

wwylele Mar 15, 2018

Member

Question: if a glDrawArrays has an imcomplete triangle at the end, would the next glDrawArrays calls continue this triangle or restart the primitive? sorry didn't see the /3 *3 above

@Subv

Subv approved these changes Mar 15, 2018

LGTM

@degasus The "Shader" row's "Count" column in Microprofile is the number of vertices that were processed in the CPU vertex shader, the values are around 69700 for the scenes @tywald posted

@degasus degasus force-pushed the degasus:vertex_streaming branch from c2d6037 to be649d2 Mar 15, 2018

@cluezbot

This comment has been minimized.

cluezbot commented Mar 15, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-03-15T23:02:39Z: a0f7091...degasus:be649d206b86ff8813c2a7dcb448db231b616832

public:
OGLSync() = default;
OGLSync(OGLSync&& o) {
std::swap(handle, o.handle);

This comment has been minimized.

@wwylele

wwylele Mar 16, 2018

Member

Sorry I missed this.Please use std::exchange the same way in #3506. Also in the assignment operator

This comment has been minimized.

@degasus

degasus Mar 16, 2018

Contributor

done.

@degasus degasus force-pushed the degasus:vertex_streaming branch from be649d2 to 7312499 Mar 16, 2018

@cluezbot

This comment has been minimized.

cluezbot commented Mar 16, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-03-16T23:36:04Z: 935bcdb...degasus:731249967b851f2b9e7ef769332b5627a8fc0436

Release();
}
OGLSync& operator=(OGLSync&& o) {
handle = std::exchange(o.handle, nullptr);

This comment has been minimized.

@wwylele

wwylele Mar 16, 2018

Member

Need to release the handle before assignment

This comment has been minimized.

@degasus

degasus Mar 17, 2018

Contributor

sorry... I should go to bed

@degasus degasus force-pushed the degasus:vertex_streaming branch from 7312499 to ac92664 Mar 17, 2018

@cluezbot

This comment has been minimized.

cluezbot commented Mar 17, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-03-17T01:02:48Z: 935bcdb...degasus:ac92664aa7ed7983b7fed3c4face6af143369b52

@jroweboy jroweboy merged commit 8c44447 into citra-emu:master Mar 17, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@degasus degasus deleted the degasus:vertex_streaming branch Mar 17, 2018

@kemenaran

This comment has been minimized.

Contributor

kemenaran commented Mar 22, 2018

@degasus Thanks for this work!

Although, for what is worth, this PR triggered a noticeable performance regression for me (from 45 to 27 FPS in A Link Between Worlds).

Intel Core i7 @ 2.8 GHz
Intel HD 5100
macOS 10.12.6

Before Vertex Streaming (935bcdb)

before vertex streaming

After Vertex Streaming (8c44447)

after vertex streaming

@degasus

This comment has been minimized.

Contributor

degasus commented Mar 22, 2018

@kemenaran Thanks for the OSX test, and I'm sorry for the performance regression. Honestly, I was aware of this performance issue as there are two vendors without buffer_storage support: OSX and Mali. Both got a huge performance hit. But the good point, OSX supports fast unsynchronized mappings, so the performance can be improved a lot again, but I've decided not to touch the streaming buffer code here. Please keep in mind that this PR picked the big commit from #3499.

I would like to rewrite the streaming buffer, but I'm not sure if I shall do this now, or after the merge of #3499. I'd like to replace the syncing with orphaning (simpler code), to separate allocation and commited size (required for unknown size buffers, makes syncing harder), to provide an additional flag for invalidation of older uploads (required for texture buffer streaming), and to replace the buffer_storage fallback to use unsynchronized mappings instead of glBufferSubData. The updated interface might be simple to patch in #3499, the actual usage of those features however isn't.

@kemenaran

This comment has been minimized.

Contributor

kemenaran commented Mar 22, 2018

@degasus Oh, right, I didn't notice this was a part of #3499. No problem anyway, this was just for reporting, not for complaining :) Indeed there are still plenty of things that can affect final performances, so let's see how it goes as long as commits extracted from #3499 are merged.

Thanks for the informations!

@degasus

This comment has been minimized.

Contributor

degasus commented Mar 22, 2018

@kemenaran

This comment has been minimized.

Contributor

kemenaran commented Mar 23, 2018

@degasus sure! This code on my machine gives a solid 30% performance increase comparing to the initial vertex streaming PR.

On A Link between Worlds opening screen:

  • Before this PR: 45 FPS
  • After this PR: 27 FPS
  • Using your fork on degasus/vertex_streaming: 34 FPS

degasus vertex_streaming

@wwylele

This comment has been minimized.

Member

wwylele commented Apr 9, 2018

@degasus I also encountered on AMD that storage buffer is much slower than orphan buffer, and I like your idea. Feel free to open a new PR any time. I'll manage rebasing #3499

@degasus

This comment has been minimized.

Contributor

degasus commented May 6, 2018

@kemenaran Has the performance on master changed for you again? The merged PR #3711 might have an impact again.

@kemenaran

This comment has been minimized.

Contributor

kemenaran commented May 6, 2018

@degasus I just tested with a recent build including the code merged in #3711. Performance is about the same than when I previously tested on the degasus/vertex_streaming branch.

  • Before this PR: 45 FPS
  • Just after this PR: 27 FPS
  • Using your fork on degasus/vertex_streaming: 34 FPS
  • Recent build with #3711 merged: 35 FPS

capture d ecran 2018-05-06 a 20 36 35

@Dragios

This comment has been minimized.

Contributor

Dragios commented May 19, 2018

Apparently with this PR merged with #3499, it cause graphic corruption to happen again (a regression). Two games that I found affected are Monster Hunter Generation and Metal Gear Solid: Snake Eater 3D.
Tested Canary-481 instead due to previous tag does not have any build generated. I have check the commits between Canary-480 and Canary-481 and find it irrelevant since those two commits did not touch video core at all.

Canary-478 Canary-481
Monster Hunter Generation honeycam 2018-05-19 12-46-40 honeycam 2018-05-19 11-02-13
Metal Gear Solid: Snake Eater 3D honeycam 2018-05-19 13-22-24 honeycam 2018-05-19 13-34-14

Note: Canary-478 is same as hardware.

@degasus

This comment has been minimized.

Contributor

degasus commented May 19, 2018

@Dragios Might you try the workaround in issue #3731 ? I fear both issues appeared because we use undefined behavior here. Improving the performance makes this issue more likely :/

@Dragios

This comment has been minimized.

Contributor

Dragios commented May 19, 2018

@degasus With the workaround suggested, the artifact corruption in Monster Hunter Generation still occur on native but disappear when using higher internal resolution.

For Metal Gear Solid: Snake Eater 3D it looks better with the workaround (the blooming effect is working but flickers.) This is how it looks like when I used auto scale.
Sample video

@wwylele

This comment has been minimized.

Member

wwylele commented May 19, 2018

@Dragios it is unrelated to this PR anyway as we already discussed. Please continue the discussion in #3731

@degasus

This comment has been minimized.

Contributor

degasus commented May 21, 2018

@kemenaran May I ask you to try it again? Both with nightly (hardware shader) and #3759 (lighting lut updates). I hope it is now as fast as before we've touched half of our GL code.

@kemenaran

This comment has been minimized.

Contributor

kemenaran commented May 23, 2018

@degasus unfortunately, due to #3772, I can't test hardware-shaders on macOS.

However I tested Hardware shaders on Windows (45aa950, without changes from #3759) : the performance was quite good. The same game was running around 55-57 FPS, probably better than before the refactoring started (although it's difficult to compare a Windows performances tests to macOS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment