Skip to content

[SYSTEMML-1034] implemented gpu solve#476

Closed
nakul02 wants to merge 3 commits intoapache:masterfrom
nakul02:gpu_solve
Closed

[SYSTEMML-1034] implemented gpu solve#476
nakul02 wants to merge 3 commits intoapache:masterfrom
nakul02:gpu_solve

Conversation

@nakul02
Copy link
Copy Markdown
Member

@nakul02 nakul02 commented Apr 28, 2017

Implemented the GPU solve() function.
Ping @niketanpansare, @bertholdreinwald, @dusenberrymw

@iyounus - can you please try this out? (and also check for correctness?, I've checked on smaller data)

This will benefit us in some sense. I see it being used in these algorithms (based on a simple grep search):

  • ALS-DS.dml
  • CsplineDS.dml
  • LinearRegDS.dml
  • StepLinearRegDS

For me, I seem to get a 30x speedup in an example that I tried on my own machine (core i7 quad core, 32gb ram, GTX1070).

Program:

m = 12345
n = 4321

A = rand(rows=m, cols=n)
B = rand(rows=m, cols=1)

x = solve(A,B)
write(x, "xout")%                     

Output

➜  incubator-systemml git:(gpu_solve) ✗ bin/systemml solve.dml -gpu force -stats
================================================================================
Output dir: /home/njindal/git/incubator-systemml/temp
================================================================================
17/04/28 00:14:42 INFO api.DMLScript: BEGIN DML run 04/28/2017 00:14:42
17/04/28 00:14:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/28 00:14:42 INFO context.GPUContext: Initializing CUDA
17/04/28 00:14:43 INFO context.GPUContext:  GPU memory - Total: 8506.769408 MB, Available: 6889.2098559999995 MB on GPUContext{deviceNum=0}
17/04/28 00:14:43 INFO context.GPUContext: Total number of GPUs on the machine: 1
17/04/28 00:14:46 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:		4.561 sec.
Total compilation time:		0.348 sec.
Total execution time:		4.213 sec.
Number of compiled MR Jobs:	0.
Number of executed MR Jobs:	0.
CUDA/CuLibraries init time:	0.660/0.514 sec.
Number of executed GPU inst:	1.
GPU mem tx time  (alloc/dealloc/set0/toDev/fromDev):	0.004/0.000/0.000/0.051/0.000 sec.
GPU mem tx count (alloc/dealloc/set0/toDev/fromDev/evict):	10/0/11/0/2/1/0.
GPU conversion time  (sparseConv/sp2dense/dense2sp):	0.000/0.000/0.000 sec.
GPU conversion count (sparseConv/sp2dense/dense2sp):	0/0/0.
Cache hits (Mem, WB, FS, HDFS):	2/0/0/0.
Cache writes (WB, FS, HDFS):	3/0/1.
Cache times (ACQr/m, RLS, EXP):	0.000/0.000/0.001/0.040 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		0.577 sec.
Total JVM GC count:		0.
Total JVM GC time:		0.0 sec.
Heavy hitter instructions (name, time, count):
-- 1) 	gpu_solve 	3.118 sec 	1	
-- 2) 	rand 	0.392 sec 	2	
-- 3) 	write 	0.040 sec 	1	
-- 4) 	createvar 	0.001 sec 	3	
-- 5) 	rmvar 	0.000 sec 	3	

17/04/28 00:14:46 INFO api.DMLScript: END DML run 04/28/2017 00:14:46
➜  incubator-systemml git:(gpu_solve) ✗ bin/systemml solve.dml  -stats       
================================================================================
Output dir: /home/njindal/git/incubator-systemml/temp
================================================================================
17/04/28 00:14:53 INFO api.DMLScript: BEGIN DML run 04/28/2017 00:14:53
17/04/28 00:14:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/28 00:17:06 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:		132.744 sec.
Total compilation time:		0.345 sec.
Total execution time:		132.398 sec.
Number of compiled MR Jobs:	0.
Number of executed MR Jobs:	0.
Cache hits (Mem, WB, FS, HDFS):	2/0/0/0.
Cache writes (WB, FS, HDFS):	3/0/1.
Cache times (ACQr/m, RLS, EXP):	0.000/0.000/0.001/0.031 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		1.233 sec.
Total JVM GC count:		2.
Total JVM GC time:		0.448 sec.
Heavy hitter instructions (name, time, count):
-- 1) 	solve 	131.950 sec 	1	
-- 2) 	rand 	0.413 sec 	2	
-- 3) 	write 	0.031 sec 	1	
-- 4) 	createvar 	0.001 sec 	3	
-- 5) 	rmvar 	0.000 sec 	3	

17/04/28 00:17:06 INFO api.DMLScript: END DML run 04/28/2017 00:17:06

@akchinSTC
Copy link
Copy Markdown
Contributor

Build failed, see build log for details

@akchinSTC
Copy link
Copy Markdown
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1411/

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Apr 28, 2017

well, I could indeed imagine such as speedup as we're currently only calling out to commons-math because solve is by far not the bottleneck in ALS or LinregDS (only called for tiny matrices in the rank or number of features).

@nakul02
Copy link
Copy Markdown
Member Author

nakul02 commented Apr 28, 2017

@mboehm7 - understood, still this PR provides value. The more operations in a loop that are on the GPU, the lesser the ping pong of data between host and device memories.

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Apr 28, 2017

sure - this is absolutely fine; I'm just setting the expectations straight: for example for LinregDS, it's called once and is even for 1k features in the sub-second range. However, down the road, once we have a distributed solve, there might be more algorithms that could benefit from it.

@nakul02
Copy link
Copy Markdown
Member Author

nakul02 commented Apr 28, 2017

distributing solve is a great idea. In fact, that is exactly what @iyounus is trying to do through DML using the single node builtin functions qr, cholesky and lu. I think #368 is some work towards that. There was an earlier version which used parfor to do the distribution.

@akchinSTC
Copy link
Copy Markdown
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1412/

@deroneriksson
Copy link
Copy Markdown
Member

LGTM. If this is GPU, I feel @nakul02 and @niketanpansare are the owners of this area and need to merge and move forward for our 1.0.0 release.

@nakul02
Copy link
Copy Markdown
Member Author

nakul02 commented Apr 29, 2017

thanks @deroneriksson !

@iyounus
Copy link
Copy Markdown
Contributor

iyounus commented Apr 29, 2017

I've checked the results from gpu solve and these are correct.

@niketanpansare
Copy link
Copy Markdown
Contributor

LGTM, Thanks Nakul 👍

@nakul02
Copy link
Copy Markdown
Member Author

nakul02 commented May 1, 2017

Thanks, I shall merge.

@asfgit asfgit closed this in e8fbc75 May 1, 2017
j143-zz pushed a commit to j143-zz/systemml that referenced this pull request Nov 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants