Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3020] Initial GPU junit tests #1317

Closed
wants to merge 4 commits into from

Conversation

corepointer
Copy link
Contributor

More tests will be added as we go. For the tests to run it is advisable to not start multiple test cases simultaneously in the same JVM. To run from command line use something like this:

mvn -ntp test -DenableGPU=true -Dmaven.test.skip=false -Dtest-parallel=suites -Dtest-threadCount=1 -Dtest-forkCount=1 -D automatedtestbase.outputbuffering=false -Dtest=org.apache.sysds.test.gpu.**

  • This test suite should not be included in the automated testing for the time being as we don't have GPU testing infrastructure set up.
  • Two of the tests are failing atm - working on that ;-)

@phaniarnab
Copy link
Contributor

Thanks, @corepointer for the gpu junit test infrastructure.
I did not start executing the tests yet. But I have some questions regarding the way it is done.

  1. Is this setup allows writing feature tests on gpu, where the baseline is also with -gpu (e.g. compare -gpu -lineage with -gpu)?
  2. I think this is an easy enough way to run regression tests in a gpu. But effect-wise, how is it different from adding -gpu to the existing test classes?

@corepointer
Copy link
Contributor Author

1. Is this setup allows writing feature tests on gpu, where the baseline is also with `-gpu` (e.g. compare `-gpu -lineage` with `-gpu`)?

All tests add -gpu. It is up to your test to add more. So -lineage -gpu is definitely possible.

2. I think this is an easy enough way to run regression tests in a gpu. But effect-wise, how is it different from adding `-gpu` to the existing test classes?

At the moment it is no different (other than the test checking for the appearance of the corresponding gpu instruction in the heavy hitter output). The content of new tests is up to its author - anything's possible ;-)

@j143
Copy link
Contributor

j143 commented Nov 8, 2021

Hi @corepointer ,

the testing does not work on this runner. Any pointers on how to resolve this one.

image

cuda, cudnn is available from the command line:

run results here: https://github.com/j143/systemds/runs/4137852455


This one runs fine

java -Xmx4g -Xms4g -Xmn400m -cp target/SystemDS.jar:target/lib/*:target/SystemDS-*.jar org.apache.sysds.api.DMLScript -f ../main.dml -exec singlenode -gpu

corepointer and others added 4 commits November 8, 2021 23:37
This commit is part of the GPU test suite epic [SYSTEMDS-3019] and introduces:
* the gpu test java package
* tests for cellwise/rowwise codegen
* test for unary builtin functions (incomplete)
* Move some maven surefire plugin settings to the properties section (with same defaults as before) to make them settable from command line (need to reduce thread count for GPU tests)
* provide an integer when appending "-stats" to a test run (used to crash some tests without it)
* Conv2DTest requests "recompile_runtime" explain mode without adding "-explain" so output would not print
@corepointer
Copy link
Contributor Author

Hi @corepointer ,

the testing does not work on this runner. Any pointers on how to resolve this one.

image

cuda, cudnn is available from the command line:

run results here: https://github.com/j143/systemds/runs/4137852455

You're not rebuilding the binaries yet with cmake, are you? Because then there might have been the issue that Jitify is not there. You need to clone with --recursive to fetch the external dependency. But that is an issue once the current binaries run the test.

Two things you could check: Is it CUDA version 10.2 that is installed? This is at the moment the latest version we support (I'm working on CUDA11.x support - it's almost there). The other thing to check: Is a CUDA capable device visible to your VM? You could add the command "nvidia-smi" to your runner. That should print the available CUDA devices.

This one runs fine

java -Xmx4g -Xms4g -Xmn400m -cp target/SystemDS.jar:target/lib/*:target/SystemDS-*.jar org.apache.sysds.api.DMLScript -f ../main.dml -exec singlenode -gpu

This one isn't using any GPU instructions though ;-)

PS: I've rebased to current main branch and cherry picked the commits you added.

@j143
Copy link
Contributor

j143 commented Nov 9, 2021

  • 1. Checking for installation of cuda
ubuntu@ip-10-0-0-4:~/repo/systemds$ nvidia-smi
Tue Nov  9 06:50:01 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0    70W / 149W |    243MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9955      C   .../lib/jvm/java-11-openjdk-amd64/bin/java   232MiB |
+-----------------------------------------------------------------------------+

cudnn is at /usr/include/cudnn.h version 7.6.5 as per our docs.
I could successfully run the cuda samples.

@corepointer
Copy link
Contributor Author

| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |

Sorry my bad. I didn't realize that I raised the bar to compute capability 6 and above when I introduced atomicAdd() for double values (that's exactly the function that is called in the line referenced by the error messages). The Tesla K80 has a compute capability of 3.7 [1]. So for the time being code gen is for cc 6+ only if we don't find a workaround.

[1] https://developer.nvidia.com/cuda-gpus

@j143
Copy link
Contributor

j143 commented Nov 17, 2021

This LGTM. 👍

The workaround for the last failing test can be resolved later. :)

@corepointer
Copy link
Contributor Author

This LGTM. +1

The workaround for the last failing test can be resolved later. :)

Thank you! I'll fix it and merge it in after my next paper deadline on Dec 10.

@j143
Copy link
Contributor

j143 commented Nov 25, 2021

or should I comment out the failed test and merge the remaining. This would avoid rebasing on my gpu runner test fork.

If that is okay for you.

@corepointer
Copy link
Contributor Author

or should I comment out the failed test and merge the remaining. This would avoid rebasing on my gpu runner test fork.

You can use the "@ignore" functionality that we already use in the row template test case (I think test #18).

If that is okay for you.

Yes, please go ahead.

@j143 j143 self-requested a review November 28, 2021 18:45
@asfgit asfgit closed this in d8dd694 Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants