[SYSTEMML-1034] implemented gpu solve by nakul02 · Pull Request #476 · apache/systemds

nakul02 · 2017-04-28T07:24:01Z

Implemented the GPU solve() function.
Ping @niketanpansare, @bertholdreinwald, @dusenberrymw

@iyounus - can you please try this out? (and also check for correctness?, I've checked on smaller data)

This will benefit us in some sense. I see it being used in these algorithms (based on a simple grep search):

ALS-DS.dml
CsplineDS.dml
LinearRegDS.dml
StepLinearRegDS

For me, I seem to get a 30x speedup in an example that I tried on my own machine (core i7 quad core, 32gb ram, GTX1070).

Program:

m = 12345
n = 4321

A = rand(rows=m, cols=n)
B = rand(rows=m, cols=1)

x = solve(A,B)
write(x, "xout")%

Output

➜  incubator-systemml git:(gpu_solve) ✗ bin/systemml solve.dml -gpu force -stats
================================================================================
Output dir: /home/njindal/git/incubator-systemml/temp
================================================================================
17/04/28 00:14:42 INFO api.DMLScript: BEGIN DML run 04/28/2017 00:14:42
17/04/28 00:14:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/28 00:14:42 INFO context.GPUContext: Initializing CUDA
17/04/28 00:14:43 INFO context.GPUContext:  GPU memory - Total: 8506.769408 MB, Available: 6889.2098559999995 MB on GPUContext{deviceNum=0}
17/04/28 00:14:43 INFO context.GPUContext: Total number of GPUs on the machine: 1
17/04/28 00:14:46 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:		4.561 sec.
Total compilation time:		0.348 sec.
Total execution time:		4.213 sec.
Number of compiled MR Jobs:	0.
Number of executed MR Jobs:	0.
CUDA/CuLibraries init time:	0.660/0.514 sec.
Number of executed GPU inst:	1.
GPU mem tx time  (alloc/dealloc/set0/toDev/fromDev):	0.004/0.000/0.000/0.051/0.000 sec.
GPU mem tx count (alloc/dealloc/set0/toDev/fromDev/evict):	10/0/11/0/2/1/0.
GPU conversion time  (sparseConv/sp2dense/dense2sp):	0.000/0.000/0.000 sec.
GPU conversion count (sparseConv/sp2dense/dense2sp):	0/0/0.
Cache hits (Mem, WB, FS, HDFS):	2/0/0/0.
Cache writes (WB, FS, HDFS):	3/0/1.
Cache times (ACQr/m, RLS, EXP):	0.000/0.000/0.001/0.040 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		0.577 sec.
Total JVM GC count:		0.
Total JVM GC time:		0.0 sec.
Heavy hitter instructions (name, time, count):
-- 1) 	gpu_solve 	3.118 sec 	1	
-- 2) 	rand 	0.392 sec 	2	
-- 3) 	write 	0.040 sec 	1	
-- 4) 	createvar 	0.001 sec 	3	
-- 5) 	rmvar 	0.000 sec 	3	

17/04/28 00:14:46 INFO api.DMLScript: END DML run 04/28/2017 00:14:46
➜  incubator-systemml git:(gpu_solve) ✗ bin/systemml solve.dml  -stats       
================================================================================
Output dir: /home/njindal/git/incubator-systemml/temp
================================================================================
17/04/28 00:14:53 INFO api.DMLScript: BEGIN DML run 04/28/2017 00:14:53
17/04/28 00:14:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/28 00:17:06 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:		132.744 sec.
Total compilation time:		0.345 sec.
Total execution time:		132.398 sec.
Number of compiled MR Jobs:	0.
Number of executed MR Jobs:	0.
Cache hits (Mem, WB, FS, HDFS):	2/0/0/0.
Cache writes (WB, FS, HDFS):	3/0/1.
Cache times (ACQr/m, RLS, EXP):	0.000/0.000/0.001/0.031 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		1.233 sec.
Total JVM GC count:		2.
Total JVM GC time:		0.448 sec.
Heavy hitter instructions (name, time, count):
-- 1) 	solve 	131.950 sec 	1	
-- 2) 	rand 	0.413 sec 	2	
-- 3) 	write 	0.031 sec 	1	
-- 4) 	createvar 	0.001 sec 	3	
-- 5) 	rmvar 	0.000 sec 	3	

17/04/28 00:17:06 INFO api.DMLScript: END DML run 04/28/2017 00:17:06

akchinSTC · 2017-04-28T07:26:07Z

Build failed, see build log for details

akchinSTC · 2017-04-28T07:26:07Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1411/

mboehm7 · 2017-04-28T07:29:47Z

well, I could indeed imagine such as speedup as we're currently only calling out to commons-math because solve is by far not the bottleneck in ALS or LinregDS (only called for tiny matrices in the rank or number of features).

nakul02 · 2017-04-28T07:42:12Z

@mboehm7 - understood, still this PR provides value. The more operations in a loop that are on the GPU, the lesser the ping pong of data between host and device memories.

mboehm7 · 2017-04-28T07:51:41Z

sure - this is absolutely fine; I'm just setting the expectations straight: for example for LinregDS, it's called once and is even for 1k features in the sub-second range. However, down the road, once we have a distributed solve, there might be more algorithms that could benefit from it.

nakul02 · 2017-04-28T15:24:46Z

distributing solve is a great idea. In fact, that is exactly what @iyounus is trying to do through DML using the single node builtin functions qr, cholesky and lu. I think #368 is some work towards that. There was an earlier version which used parfor to do the distribution.

akchinSTC · 2017-04-28T17:34:26Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1412/

deroneriksson · 2017-04-29T00:18:18Z

LGTM. If this is GPU, I feel @nakul02 and @niketanpansare are the owners of this area and need to merge and move forward for our 1.0.0 release.

nakul02 · 2017-04-29T00:19:10Z

thanks @deroneriksson !

iyounus · 2017-04-29T00:34:51Z

I've checked the results from gpu solve and these are correct.

niketanpansare · 2017-04-30T18:18:25Z

LGTM, Thanks Nakul 👍

nakul02 · 2017-05-01T04:43:16Z

Thanks, I shall merge.

Closes apache#476

nakul02 added 2 commits April 27, 2017 22:57

Initial implementation of "solve" for GPU

71264ea

Bug fixes

359847e

Fixes for failed style/rat checks

85ce957

asfgit closed this in e8fbc75 May 1, 2017

j143-zz pushed a commit to j143-zz/systemml that referenced this pull request Nov 4, 2017

[SYSTEMML-1034] Initial implementation of "solve" for GPU

7891cc5

Closes apache#476

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMML-1034] implemented gpu solve#476

[SYSTEMML-1034] implemented gpu solve#476
nakul02 wants to merge 3 commits intoapache:masterfrom
nakul02:gpu_solve

nakul02 commented Apr 28, 2017

Uh oh!

akchinSTC commented Apr 28, 2017

Uh oh!

akchinSTC commented Apr 28, 2017

Uh oh!

mboehm7 commented Apr 28, 2017

Uh oh!

nakul02 commented Apr 28, 2017

Uh oh!

mboehm7 commented Apr 28, 2017

Uh oh!

nakul02 commented Apr 28, 2017

Uh oh!

akchinSTC commented Apr 28, 2017

Uh oh!

deroneriksson commented Apr 29, 2017

Uh oh!

nakul02 commented Apr 29, 2017

Uh oh!

iyounus commented Apr 29, 2017

Uh oh!

niketanpansare commented Apr 30, 2017

Uh oh!

nakul02 commented May 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

nakul02 commented Apr 28, 2017

Uh oh!

akchinSTC commented Apr 28, 2017

Uh oh!

akchinSTC commented Apr 28, 2017

Uh oh!

mboehm7 commented Apr 28, 2017

Uh oh!

nakul02 commented Apr 28, 2017

Uh oh!

mboehm7 commented Apr 28, 2017

Uh oh!

nakul02 commented Apr 28, 2017

Uh oh!

akchinSTC commented Apr 28, 2017

Uh oh!

deroneriksson commented Apr 29, 2017

Uh oh!

nakul02 commented Apr 29, 2017

Uh oh!

iyounus commented Apr 29, 2017

Uh oh!

niketanpansare commented Apr 30, 2017

Uh oh!

nakul02 commented May 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants