[SPARK-3066][MLLIB] Support recommendAll in matrix factorization model #5829

mengxr · 2015-05-01T06:27:01Z

This is based on #3098 from @debasish83.

BLAS' GEMM is used to compute inner products.
Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
~~Fixed a bug in topByKey~~

Closes #3098

@debasish83 @coderxiang

…pute map measure along with rmse

…tric for movielens dataset

… BoundedPriorityQueue similar to RDD.top

…for product recommendation per user using randomized split

…batch predict APIs in matrix factorization

debasish83 · 2015-05-01T06:44:31Z

@mengxr looks good to me...I will fix SPARK-4321 based on this merge...I need blockify for rowSimilarities (tall skinny sparse matrices for row similarities)...should we extract it out to IndexedRow ? I can do that cleanup in my row similarities PR...

SparkQA · 2015-05-01T07:35:13Z

Test build #31524 has finished for PR 5829 at commit 49953de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

debasish83 · 2015-05-01T07:37:50Z

mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala

I have to look closely into it tomorrow...I have been using topByKey internally and did not remember seeing this bug...

yup topByKey behavior as implemented was correct...

It was correct. Pushed an update.

SparkQA · 2015-05-01T07:40:46Z

Test build #31525 has finished for PR 5829 at commit 336202d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-01T08:14:15Z

Test build #31530 has finished for PR 5829 at commit 389b381.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

debasish83 · 2015-05-01T08:25:41Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

Normally items are skinny ~ 1M...and ranks are low...50...so 1Mx50 bytes ~ 50 MB...with 8M products, its 400 MB...I still think that cartesian will be slower than the version I added in terms of runtime....did you run any benchmark with the old code ?

That depends on the data. It is also common to have near-squared rating matrix. This should provide similar performance if the items/products are not super small, but I didn't test the performance. The advantage is that this approach doesn't touch the driver, so it could be more scalable.

I also like it better as it should scale fine assuming cartesian keys are under control...say to 100M x 10M with 400 factors....

SparkQA · 2015-05-01T09:33:05Z

Test build #31541 has finished for PR 5829 at commit 22e6a87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-05-01T15:27:35Z

Okay, I'm merging this in. Please submit a PR for SPARK-4231 if you get time to work on it. Thanks!

This is based on apache#3098 from debasish83. 1. BLAS' GEMM is used to compute inner products. 2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR. 3. ~~Fixed a bug in topByKey~~ Closes apache#3098 debasish83 coderxiang Author: Debasish Das <debasish.das@one.verizon.com> Author: Xiangrui Meng <meng@databricks.com> Closes apache#5829 from mengxr/SPARK-3066 and squashes the following commits: 22e6a87 [Xiangrui Meng] topByKey was correct. update its usage 389b381 [Xiangrui Meng] fix indentation 49953de [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3066 cb9799a [Xiangrui Meng] revert MovieLensALS f864f5e [Xiangrui Meng] update test and fix a bug in topByKey c5e0181 [Xiangrui Meng] use GEMM and topByKey 3a0c4eb [Debasish Das] updated with spark master 98fa424 [Debasish Das] updated with master ee99571 [Debasish Das] addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization 3f97c49 [Debasish Das] fixed spark coding style for imports 7163a5c [Debasish Das] Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split d144f57 [Debasish Das] recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top f38a1b5 [Debasish Das] use sampleByKey for per user sampling 10cbb37 [Debasish Das] provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset 9fa063e [Debasish Das] import scala.math.round 4bbae0f [Debasish Das] comments fixed as per scalastyle cd3ab31 [Debasish Das] merged with AbstractParams serialization bug 9b3951f [Debasish Das] validate user/product on MovieLens dataset through user input and compute map measure along with rmse

Debasish Das and others added 16 commits November 4, 2014 17:23

validate user/product on MovieLens dataset through user input and com…

9b3951f

…pute map measure along with rmse

merged with AbstractParams serialization bug

cd3ab31

comments fixed as per scalastyle

4bbae0f

import scala.math.round

9fa063e

provide ratio for topN product validation; generate MAP and prec@k me…

10cbb37

…tric for movielens dataset

use sampleByKey for per user sampling

f38a1b5

recommendAll API to MatrixFactorizationModel, uses topK finding using…

d144f57

… BoundedPriorityQueue similar to RDD.top

Added API for batch user and product recommendation; MAP calculation …

7163a5c

…for product recommendation per user using randomized split

fixed spark coding style for imports

3f97c49

addressed initial review comments;merged with master;added tests for …

ee99571

…batch predict APIs in matrix factorization

updated with master

98fa424

updated with spark master

3a0c4eb

use GEMM and topByKey

c5e0181

update test and fix a bug in topByKey

f864f5e

revert MovieLensALS

cb9799a

Merge remote-tracking branch 'apache/master' into SPARK-3066

49953de

fix indentation

389b381

mengxr force-pushed the SPARK-3066 branch from 336202d to 389b381 Compare May 1, 2015 07:05

debasish83 reviewed May 1, 2015
View reviewed changes

topByKey was correct. update its usage

22e6a87

debasish83 reviewed May 1, 2015
View reviewed changes

asfgit closed this in 3b514af May 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3066][MLLIB] Support recommendAll in matrix factorization model #5829

[SPARK-3066][MLLIB] Support recommendAll in matrix factorization model #5829

Uh oh!

mengxr commented May 1, 2015

Uh oh!

debasish83 commented May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

debasish83 May 1, 2015

Uh oh!

debasish83 May 1, 2015

Uh oh!

mengxr May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

debasish83 May 1, 2015

Uh oh!

mengxr May 1, 2015

Uh oh!

debasish83 May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

mengxr commented May 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-3066][MLLIB] Support recommendAll in matrix factorization model #5829

[SPARK-3066][MLLIB] Support recommendAll in matrix factorization model #5829

Uh oh!

Conversation

mengxr commented May 1, 2015

Uh oh!

debasish83 commented May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

debasish83 May 1, 2015

Choose a reason for hiding this comment

Uh oh!

debasish83 May 1, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr May 1, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

debasish83 May 1, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr May 1, 2015

Choose a reason for hiding this comment

Uh oh!

debasish83 May 1, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

mengxr commented May 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants