Skip to content
This repository has been archived by the owner on Sep 27, 2023. It is now read-only.

Black squares in output. #78

Closed
rewolff opened this issue May 9, 2019 · 16 comments
Closed

Black squares in output. #78

rewolff opened this issue May 9, 2019 · 16 comments
Labels

Comments

@rewolff
Copy link

rewolff commented May 9, 2019

On the Nouveau graphics driver with GF119 [GeForce GT 610 as the hardware, I get black boxes in the output. It happens when the normal of the surface is perpendicular to the viewing angle.

The boxes are 4x8 pixels (4 wide, 8 high).
I don't do drag-n-drop. So I can't attach a picture here. I've uploaded it. http://prive.bitwizard.nl/curv_black_boxes.png

@doug-moen
Copy link
Member

Hypothesis:

  • The Nouveau GPU driver subdivides the viewport into 4x8 tiles, and assigns a group of 32 cores to each tile, running in lockstep as a SIMD group (single instruction, multiple data). If that group runs into trouble, the computation is cancelled and the tile is coloured black.
  • Curv renders a 3D shape using a ray-casting algorithm, and employs a loop that marches a ray from the eye point to the intersection of the ray with the shape. I think that if the loop runs for too many iterations, then Nouveau aborts the shader program and colours the tile black.
  • The max number of iterations in the ray-caster is controlled by two magic constants. If I expose those as configuration values, maybe @rewolff can tweak the constants to lower values in order to make the visual glitches go away.

image

@rewolff
Copy link
Author

rewolff commented May 9, 2019

Hypothesis: It looks to me as if the core-group runs into a division-by-zero because the normal is perpendicular to the ray. (say: "what is the tan of the angle that the ray hits the pixel?" is answered by "infinity" or a division by-zero. In the example above, you can see that where mathematically a precisely perpendicular ray would exist on each edge pixel, only a few get the division-by-zero: This is because the actual ray is likely to miss the precise perpendicular point.

@doug-moen
Copy link
Member

The math is done in floating point. Division by zero on a GPU doesn't cause an exception, it produces either Infinity or NaN values, and the computation keeps running. There's nothing that the Nouveau driver can do to change this behaviour. The Nvidia proprietary driver doesn't produce these glitches, and nobody has reported this problem before. So we are looking for a mechanism that is specific to Nouveau.

Those glitches appear in places where the sphere-tracing algorithm that I use for ray-marching will normally use an increased number of iterations. We can test my hypothesis once I make those constants configurable.

@doug-moen
Copy link
Member

I just added two new command line options:

curv -Oray_max_iter=200 -Oray_max_depth=400 ...

You can try reducing ray_max_iter from its default value of 200 to see if the glitches go away in the rainbow.curv model. This value directly limits the number of iterations in the ray-marching loop.

ray_max_depth indirectly limits the number of iterations. If the ray has travelled more than 400 units (or whatever value you supply), then the loop terminates early. I don't think this one will fix the glitches in rainbow.curv. However, reducing the value may speed up rendering and reduce glitching in some other shapes.

@doug-moen doug-moen added the GPU label May 9, 2019
@rewolff
Copy link
Author

rewolff commented May 9, 2019

Thanks! Pulled, compiled and tested. Starting from the default you show above, neither of the two parameters seems to do anything. I tested 2x more and 2x less.

Edit: Ah! the "change by a factor of two" was not aggressive enough. 80: no change. 40: Only two black blobs. 20: no blobs. (5 is the "normal" number of blobs for the rainbow-cylinder that I use to test).

@doug-moen
Copy link
Member

doug-moen commented May 9, 2019 via email

@rewolff
Copy link
Author

rewolff commented May 9, 2019

yeah. With 40 "some" are fixed, but at 20 they are all fixed and render correctly. So 40 is "on the edge" and 20 is enough....

@rewolff
Copy link
Author

rewolff commented May 9, 2019

Testing with: rainbow.curv, counting only the black blocks on one side of the cylinder:

0-19 does not render correctly (part of the cylinder is missing) .
20-37 renders correctly
38 renders with one black block.
39-42 renders with two black blocks.
43 renders with three black blocks
44 renders with four black blocks
45 - 200 renders with five black blocks.

@doug-moen
Copy link
Member

All GPU drivers render by partitioning the viewport into tiles, and rendering the pixels within each tile in parallel. Multiple tiles are also rendered in parallel, depending on how many cores you have.

In Curv, the time required to compute a tile can vary greatly. Background tiles are usually very fast. Certain tiles, like the rounded edges of the rainbow cylinder, can be slow. There could easily be a 50 to 1 or 100 to 1 difference in rendering times between tiles, depending on the shape, but if the slow tiles are rare, then you still get fast average tile rendering times, and the user can't tell the difference.

The Nouveau driver appears to impose a hard limit on the rendering time of each tile. It is the slow tiles that are turning black, and we can eliminate the problem by speeding up the ray marcher. I would guess that Nouveau attempts to guarantee 30 frames per second, based on the assumption that all tiles take the same time to render, and imposes a hard time limit based on these assumptions. If the slowest tile in a Curv program is required to meet this deadline, then the net effect is as if the GPU is 10 or 50 times slower than it actually is.

There is at least one more simple trick for getting a bit more performance out of the ray marcher, but no easy to way to 10x or 50x more performance. I think that Nouveau is not suitable for use with Curv, and I recommend installing the Nvidia proprietary GPU driver.

@doug-moen
Copy link
Member

I couldn't find a clear explanation why Nouveau works this way. No other Mesa based GPU driver has this "black rectangle" bug.

But, we do know that Nouveau suffers performance problems because Nvidia is blocking the Nouveau project from doing thermal management. (Those APIs are blocked, due to a requirement for digitally signed firmware on some hardware models, and due to implied legal threats if they reverse engineer the proprietary driver.) This means that Nouveau must be careful to avoid doing anything that would cause your GPU to overheat and become damaged. This is consistent with my theory that Nouveau aborts a SIMD group if it runs too long.

I looked to see if there is a way of disabling the "black rectangle" behaviour, but I couldn't find anything.

@rewolff
Copy link
Author

rewolff commented May 10, 2019

So, now we have a "workaround in curv" and possibly a demonstration case, I think it is time to report this as a bug in Nouveau.
When I find that 20-37 "renders correctly" I suspect that this holds true for the very simple case of the colored cylinder and not for more complex objects, and that "> 100" is likely required for realistic 3D-printable objects. That's why you set the default here to 200.

For my understanding: you have a "ray_max_depth" that says how far from the viewpoint the rays can be broken off. This explains why some things that look infinite seem to have an end, but when you move the viewpoint the actual end stays just as far as the ray depth is measured from the camera position. Right?

@doug-moen
Copy link
Member

A suggestion from the Nouveau bug tracker is to use this environment variable:

export LIBGL_ALWAYS_SOFTWARE=1 

This will disable the Nouveau GPU driver and use software rendering of OpenGL calls instead (meaning the work is done on the CPU). The results may be unacceptably slow, but there should be no rendering artifacts.

This is not a serious or practical suggestion, due to the loss of rendering performance, but I'm including it for completeness of the historical record.

@doug-moen
Copy link
Member

The Nouveau driver is not supported until this issue is resolved upstream. I think that it isn't just a simple bug fix, that instead Nvidia will need to change their corporate policy and support the Nouveau project, before the issue can be resolved.

@rewolff
Copy link
Author

rewolff commented May 24, 2019

Might I make a suggestion?
The "not supported" means to you: "won't work without issues". When I first read that I interpreted it as: "you don't stand a chance of getting that to work".

I think the difference is important: I almost gave up on "giving curv a test-run" because of your "not supported" status. While in fact it is quite usable, if you know that the black rectangles are a rendering artifact.

Getting people to test-drive curv and subsequently interested in curv works both ways: With a bit of luck someone might fix the nouveau bugs that cause this issue, or maybe someone fixes it by modifying curv in such a way that the nouveau issues no longer occur.

@doug-moen
Copy link
Member

Here's sort of good news, a way to work around the Nouveau driver bug. But in the end, it's still easier and safer to just install the Nvidia proprietary driver.

More information about the Nouveau bug:

The main blocker for making the open-source NVIDIA driver viable for Linux desktop users and gamers though is re-clocking support for newer generations of hardware... With the GeForce GTX 900 Maxwell series and newer, there isn't yet any re-clocking support so graphics cards are stuck to operating at their boot frequencies, which generally is quite low compared to their rated base/boost clock frequencies. Until NVIDIA releases the signed PMU firmware or the Nouveau developers achieve a workaround, any GPUs newer than the GTX 600/700 Kepler or GTX 750 Maxwell series is a no-go if you want decent performance. It's not known if/when a solution will be in place for better supporting these newer generations of NVIDIA GPUs.

Thus for now the best Nouveau open-source driver support remains with the GTX 600/700 Kepler series since at least there the graphics card can be manually re-clocked by writing a value to DebugFS... Still no automatic/dynamic re-clocking, but at least users can force their Kepler (and GTX 750 Maxwell1) parts to the rated frequencies.

And here is the official Nouveau web site.

It looks like the "black squares" performance problem can be mitigated by "manual reclocking", at least on the older pre-GTX-900 GPUs that support this. This is a risky procedure that involves setting nouveau.pstate=1 (for kernels earlier than 4.5) and then writing a value to /sys/... (the path is kernel dependent). How to do this? The Nouveau wiki provides some information:

WARNING: Power management is a very experimental feature and is not expected to work. If you decided to upclock your GPU, please acknowledge that your card may overheat. Please check the temperature of your GPU at all time!

Raising the card performance mode might help. Ask on IRC, #nouveau channel, how to do that. Instructions are not given here, because in the worst case, it may destroy your card, because power management is still a work in progress.

Phoronix provides more helpful instructions: https://www.phoronix.com/scan.php?page=news_item&px=linux-4.5-nouveu-pstate-howto

I don't recommend following this procedure. It's far less difficult, and far less risky, to install the Nvidia proprietary driver. And you'll get better results than with the Nouveau driver + reclocking.

@doug-moen
Copy link
Member

@rewolff

Might I make a suggestion?
The "not supported" means to you: "won't work without issues". When I first read that I interpreted it as: "you don't stand a chance of getting that to work".

I updated the GPU requirement section of the README with better wording and more information. Thanks for the suggestion.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants