Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-level description of RAVU algorithm? #9

Closed
haasn opened this issue Jul 28, 2017 · 9 comments
Closed

High-level description of RAVU algorithm? #9

haasn opened this issue Jul 28, 2017 · 9 comments
Labels

Comments

@haasn
Copy link
Collaborator

haasn commented Jul 28, 2017

I'd like to see if I can figure out any tricks to make it faster, especially with compute shaders.

How does RAVU work on a high level? It has four passes, what do those passes do? Why the weird weights texture?

@bjin
Copy link
Owner

bjin commented Jul 28, 2017

Actually I have already got the idea how to improve RAVU with compute shader, but current ravu.py is a mess and I need to do some refactoring first to make adding compute shader routines easier.

Anyway, RAVU works like this. The convolution kernel is of size 2*r x 2*r. It's a square of size 2*r, and the texel to be interpolated is at the intersection of two diagonal line of this square. RAVU assumes the texel to be interpolated is weighted sum of all 4*r*r known texel (samples), thus a convolution kernel.

RAVU will upscale an image lowres with size NxM to an image highres with size 2Nx2M. It divides all texels in highres into four classes, with texel at (x, y) go to class (x%2, y%2). Texels from each of those four classes would all form a image with size NxM if those texels are combined. RAVU assumes the combined image from class (0, 0) is lowres. Each of the first three passes of RAVU will render combined image from another class.

  1. Use texels from (0, 0) and the convolution kernel, all texels from class (1, 1) can be interpolated.
  2. Use texels from both (0, 0) and (1, 1), and a convolution kernel rotated by 45 degrees, all texels from class (0, 1) can be interpolated.
  3. Repeat step 2 but with a slight offset, all texels from class (1, 0) can also be interpolated.
  4. Combine texels from all four classes to produce highres.

It's same as how superxbr works, but explicitly divide it into four passes and use intermediate texture to save results is much faster, since all pixel rendering thread now take roughly the same time (each pixel rendered in first three pass requires applying convolution kernel exactly once) to finish and no thread will finish super early and being idle the whole time.

The convolution kernels are extracted from weights texture based on key (angle, strength, coherence), which is calculated from eigenvalues and eigenvectors of local gradient matrix. Basically samples in training set is divided into quant_angle * quant_strength * quant_coherence classes based on the key, samples in each class (along with the label) are solved with a linear regression model to find best convolution kernel weights that minimize squared error (PSNR). All those kernels are store in weights texture indexed by the key.

Let me know if you have further questions.

@bjin
Copy link
Owner

bjin commented Jul 28, 2017

My idea to use compute shader for RAVU is basically the same as your compute shader sample with 9x9 averaging kernel. It's a bit complicated for RAVU since for pass 2 and pass 3, we are sampling and perform convolution kernel from two textures instead of one. But for each texture, the texels to sample are continuous (I actually utilized this and applied textureGatherOffset for these two passes as well). This makes life way easier than dealing with odd and even texels separately.

@haasn
Copy link
Collaborator Author

haasn commented Jul 28, 2017

It sounds like you're essentially doing three big convolutions, and then combining the three intermediate results in the final image - yes?

This seems like it involve a bunch of redundant sampling work. It seems like it would be faster if you do the sampling once and generate all three intermediate results at the same time, then write them out to the resulting image from a single thread.

Basically, what you could do (in principle) is this:

  1. Sample all of the input texels. (Ideally, use an array in shmem to share work across threads, but doing it per-thread is also an option for now)
  2. Compute the weighted sum for (1,1); then for (0,1) and (1,0)
  3. Write class (x,y) to coordinate ivec2(gl_GlobalInvocationID) + ivec2(x,y) in the output image

Right now, the user convolution shaders implicitly insert an imageStore(out_image, ivec2(gl_GlobalInvocationID), color); at the end of the shader, but I could very easily drop this implicit behavior and let you do the output sample writing yourself. I would also need to give you some mechanism of making the compute shader dispatch size != the output texture size.

@bjin
Copy link
Owner

bjin commented Jul 29, 2017

It's an interesting idea. With direct access to out_image and control of dispatch size, I think I can make RAVU run in just two pass (not just one as you suggest, I will explain the reason later). Since there is no other potential user of compute user shader at this point, I think we are fine to make some backward incompatible change (mostly regarding out_image).

How about this

  1. Add two optional arguments to COMPUTE header. So if Compute bw bh lw lh is specified, dispatch bw * bh threads per workgroup as usual, but with a total number of ceil(width / lw) * ceil(height / lh) workgroups instead (with (lw, lh) as the logical block size)
  2. Remove use of hook() function, user shader will manipulate out_image directly. This is also necessary since gl_GlobalInvocationID is useless with dispatch size changed.
  3. Add a mechanism to allow user shader to enable GLSL extensions (for GL_ARB_arrays_of_arrays in my case). A simple EXTENSION header will be fine. Can be used multiple times to enable more than one extension.

The reason we still need two pass is that

  1. Pass 1 requires sampling from original texture only. Pass 2 and Pass 3 requires sampling from two texture. The way how they work and the kernel offset range (pass 1 has much larger kernel offset) is quite different.
  2. Pass 2 and Pass 3 requires to sample from pixel Pass 1 produced from other workgroup.

I think a compute shader based 2 pass version of RAVU will bring a huge performance boost.

  1. Reduce redundant sampling significantly like you suggest.
  2. Remove pass 4 completely. It's actually the bottleneck for me to use ravu-r4-gather on my Macbook Pro. Pass 4 alone takes 3.4ms, while the first three passes each take around 2.8ms. With 2 pass compute shader version, at least we just have to dispatch 1/4 of threads. (I can't use compute shader on my Macbook Pro, though)

@bjin
Copy link
Owner

bjin commented Jul 29, 2017

@haasn Actually, without considering the size of shared memory, we could even share calculation of gradients (gx and gy variables in current generated shaders) as well, among threads in the same workgroup, and also among pass 2 and pass 3 (after they are combined).

Could you share more information about the size limit of shared memory? How large is it typically, maybe on different cards? Is it hard limit or soft limit? What would happen if the size exceeds, huge performance drop? By How much?

@haasn
Copy link
Collaborator Author

haasn commented Jul 29, 2017

Add a mechanism to allow user shader to enable GLSL extensions (for GL_ARB_arrays_of_arrays in my case). A simple EXTENSION header will be fine. Can be used multiple times to enable more than one extension.

I think you can already do that by just inserting the appropriate GLSL in to the top of the pass.

@haasn
Copy link
Collaborator Author

haasn commented Jul 29, 2017

Could you share more information about the size limit of shared memory?

Typically 32-64 kB. On my nvidia GPU it's 49 kB. You can see a list here: http://opengl.gpuinfo.org/gl_stats_caps_single.php?listreportsbycap=GL_MAX_COMPUTE_SHARED_MEMORY_SIZE

What would happen if the size exceeds, huge performance drop? By How much?

Compile failure

@bjin
Copy link
Owner

bjin commented Jul 29, 2017

I think you can already do that by just inserting the appropriate GLSL in to the top of the pass.

See https://www.khronos.org/opengl/wiki/Core_Language_(GLSL)#Extensions, these headers are required to be immediately after #version declaration.

"You should put these definitions before any other language features, but after the version declaration."

Typically 32-64 kB.

Seems to be enough for RAVU, but good to have it in mind.

@bjin
Copy link
Owner

bjin commented Aug 2, 2017

Closing since compute shader port was implemented and optimized to a degree

@haasn feel free to ignore the //!EXTENSION proposal. I don't use my old laptop very often now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants