Bad GPU results #17

adam-m-jcbs · 2016-10-14T18:28:09Z

Many of the results from GPU-accelerated unit-test code appear to be wrong. As a concrete example, I've built an accelerated and CPU-only executable of the test_react unit test.

Build and execute an accelerated binary, move output for later comparison (note that I've supressed the output of commands):

cd $MICROPHYSICS_HOME/unit_test/test_react
make COMP=PGI NETWORK_DIR=ignition_simple ACC=t -j6
./main.Linux.PGI.acc.exe inputs_ignition.BS
mv react_ignition_test_react.BS react_ignition_test_react.BS.ACC

Build and execute a CPU-only binary:

make COMP=PGI NETWORK_DIR=ignition_simple -j6
./main.Linux.PGI.exe inputs_ignition.BS

If I now compare the two output files, we see they're very different:

fcompare.Linux.gfortran.exe --infile1 react_ignition_test_react.BS --infile2 react_ignition_test_react.BS.ACC

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------
 level =  1
 density                           0.2384185791E-06          0.1192092896E-15
 temperature                       0.6854534149E-06          0.9792191642E-15
 Xnew_carbon-12                    0.9999999997              0.9999999999    
 Xnew_oxygen-16                    0.7999999999              0.9999999999    
 Xnew_magnesium-24                 0.9999999997               9.999436761    
 Xold_carbon-12                    0.9999999997              0.9999999999    
 Xold_oxygen-16                    0.7999999999              0.9999999999    
 Xold_magnesium-24                 0.9999999997               9.999999997    
 wdot_carbon-12                    0.2812178371E-03           1.000000000    
 wdot_oxygen-16                    0.1110223025E-14           1.000000000    
 wdot_magnesium-24                 0.2812178371E-03           1.000000000    
 rho_Hnuc                          0.3150192097E+24           1.000000000

So while many networks and integrators seem to be able to compile and run without crashing, it's not clear how many are generating correct physical results. I've seen a similar issue with the VBDF integrator, so it doesn't appear to be specific to an integrator or network. These results are from bender, which has PGI 16.9 and a GeForce GTX 960 GPU (with CUDA 8.0 drivers and CUDA 7.5 compilers).

The text was updated successfully, but these errors were encountered:

zingale · 2016-10-14T18:33:02Z

the fact that the density is different is telling -- nothing should be changing the density in this unit test.

although it is roundoff-level different

dwillcox · 2016-10-15T18:02:57Z

I've compared temp_zone and dens_zone calculated inside the loop between PGI serial (debug) and gfortran serial. (with aprox13's input, haven't tried it for ignition_simple)

Printing those variables with 15 sf shows they can differ by about 1E-7, which is the absolute difference in density you see above. My guess is that the roundoff error differs between the log10 or power operations implemented for PGI vs GNU.

Doing the same for PGI-serial vs PGI-acc, I see smaller differences, but at least one difference nonetheless, e.g.

dens_zone = 3414548.873833601

vs.

dens_zone = 3414548.873833600

That suggests the difference you see in density, as Mike said, is roundoff, and that the integrator may not necessarily be doing anything to the density.

zingale · 2016-10-15T18:04:44Z

good -- forgot that we are doing the exponentiation there.

adam-m-jcbs · 2016-10-17T16:12:14Z

Some sleuthing indicates that xn_zone contains junk on the GPU. After adding a print statement to main.f90 and doing

make COMP=PGI  NDEBUG=t   -j6
./main.Linux.PGI.exe inputs_3alpha.BS

indicates that all xn_zone values are bounded by 0 <= xn_zone <= 1.0, as they should be. However,

make COMP=PGI ACC=t NDEBUG=t   -j6
./main.Linux.PGI.acc.exe inputs_3alpha.BS.ACC

indicates bad output, such as

...
 j, kk, xn_zone:             4           13    1584893192.466650     
 j, kk, xn_zone:             4           13    1584893192.466650     
 j, kk, xn_zone:             4           13    1584893192.466650     
 j, kk, xn_zone:             4           13    1584893192.466650     
 j, kk, xn_zone:             4           13    1584893192.466650     
 j, kk, xn_zone:             4           13    1584893192.466650     
 j, kk, xn_zone:             1           15   1.1651353297957938E-004
 j, kk, xn_zone:             1           15   1.1651353297957938E-004
...

I'll continue investigating the origin of this, but wanted to note it in the issue thread.

adam-m-jcbs · 2016-10-19T00:48:37Z

A quick note for the record: Max and I looked into this in depth and the origin of the issue appears to be the fact that 1) we're using pf on the GPU without ever having it in a data statement (seems PGI should've complained) and 2) pf has a Fortran character array (not supported by PGI on GPU) and a bound procedure (also not something I would expect to work on the GPU, though we don't actually try to use the procedure or character array). It's not clear how, but using this type on the GPU seems to be messing with memory, which may be why xn_zone contains garbage. Will look into this more tomorrow.

zingale · 2016-10-19T00:57:37Z

oh fun.

adam-m-jcbs · 2016-10-28T17:38:18Z

After the code was changed to not use pf, the error appears to have gone away. GPU and CPU comparison now yields

fcompare react_ignition_test_react.VBDF react_ignition_test_react.VBDF.ACC/

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------
 level =  1
 density                            0.000000000               0.000000000    
 temperature                        0.000000000               0.000000000    
 Xnew_carbon-12                    0.1465605415E-10          0.4396816244E-10
 Xnew_oxygen-16                     0.000000000               0.000000000    
 Xnew_magnesium-24                 0.1465605415E-10          0.4396775531E-10
 Xold_carbon-12                     0.000000000               0.000000000    
 Xold_oxygen-16                     0.000000000               0.000000000    
 Xold_magnesium-24                  0.000000000               0.000000000    
 wdot_carbon-12                    0.1465605415E-09          0.4748322189E-05
 wdot_oxygen-16                     0.000000000               0.000000000    
 wdot_magnesium-24                 0.1465605415E-09          0.4748322189E-05
 rho_Hnuc                          0.1581619200E+18          0.4574365704E-05

Relative errors are at most about 5e-6 between CPU and GPU.

adam-m-jcbs added bug GPU labels Oct 14, 2016

adam-m-jcbs self-assigned this Oct 14, 2016

adam-m-jcbs closed this as completed Oct 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad GPU results #17

Bad GPU results #17

adam-m-jcbs commented Oct 14, 2016

zingale commented Oct 14, 2016 •

edited

Loading

dwillcox commented Oct 15, 2016 •

edited

Loading

zingale commented Oct 15, 2016

adam-m-jcbs commented Oct 17, 2016

adam-m-jcbs commented Oct 19, 2016

zingale commented Oct 19, 2016

adam-m-jcbs commented Oct 28, 2016 •

edited

Loading

Bad GPU results #17

Bad GPU results #17

Comments

adam-m-jcbs commented Oct 14, 2016

zingale commented Oct 14, 2016 • edited Loading

dwillcox commented Oct 15, 2016 • edited Loading

zingale commented Oct 15, 2016

adam-m-jcbs commented Oct 17, 2016

adam-m-jcbs commented Oct 19, 2016

zingale commented Oct 19, 2016

adam-m-jcbs commented Oct 28, 2016 • edited Loading

zingale commented Oct 14, 2016 •

edited

Loading

dwillcox commented Oct 15, 2016 •

edited

Loading

adam-m-jcbs commented Oct 28, 2016 •

edited

Loading