New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3066][MLLIB] Support recommendAll in matrix factorization model #5829

Closed
wants to merge 18 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@mengxr
Contributor

mengxr commented May 1, 2015

This is based on #3098 from @debasish83.

  1. BLAS' GEMM is used to compute inner products.
  2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
  3. Fixed a bug in topByKey

Closes #3098

@debasish83 @coderxiang

@debasish83

This comment has been minimized.

Show comment
Hide comment
@debasish83

debasish83 May 1, 2015

@mengxr looks good to me...I will fix SPARK-4321 based on this merge...I need blockify for rowSimilarities (tall skinny sparse matrices for row similarities)...should we extract it out to IndexedRow ? I can do that cleanup in my row similarities PR...

debasish83 commented May 1, 2015

@mengxr looks good to me...I will fix SPARK-4321 based on this merge...I need blockify for rowSimilarities (tall skinny sparse matrices for row similarities)...should we extract it out to IndexedRow ? I can do that cleanup in my row similarities PR...

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 1, 2015

Test build #31524 has finished for PR 5829 at commit 49953de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 1, 2015

Test build #31524 has finished for PR 5829 at commit 49953de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 1, 2015

Test build #31525 has finished for PR 5829 at commit 336202d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 1, 2015

Test build #31525 has finished for PR 5829 at commit 336202d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 1, 2015

Test build #31530 has finished for PR 5829 at commit 389b381.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 1, 2015

Test build #31530 has finished for PR 5829 at commit 389b381.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
num: Int): RDD[(Int, Array[(Int, Double)])] = {
val srcBlocks = blockify(rank, srcFeatures)
val dstBlocks = blockify(rank, dstFeatures)
val ratings = srcBlocks.cartesian(dstBlocks).flatMap {

This comment has been minimized.

@debasish83

debasish83 May 1, 2015

Normally items are skinny ~ 1M...and ranks are low...50...so 1Mx50 bytes ~ 50 MB...with 8M products, its 400 MB...I still think that cartesian will be slower than the version I added in terms of runtime....did you run any benchmark with the old code ?

@debasish83

debasish83 May 1, 2015

Normally items are skinny ~ 1M...and ranks are low...50...so 1Mx50 bytes ~ 50 MB...with 8M products, its 400 MB...I still think that cartesian will be slower than the version I added in terms of runtime....did you run any benchmark with the old code ?

This comment has been minimized.

@mengxr

mengxr May 1, 2015

Contributor

That depends on the data. It is also common to have near-squared rating matrix. This should provide similar performance if the items/products are not super small, but I didn't test the performance. The advantage is that this approach doesn't touch the driver, so it could be more scalable.

@mengxr

mengxr May 1, 2015

Contributor

That depends on the data. It is also common to have near-squared rating matrix. This should provide similar performance if the items/products are not super small, but I didn't test the performance. The advantage is that this approach doesn't touch the driver, so it could be more scalable.

This comment has been minimized.

@debasish83

debasish83 May 1, 2015

I also like it better as it should scale fine assuming cartesian keys are under control...say to 100M x 10M with 400 factors....

@debasish83

debasish83 May 1, 2015

I also like it better as it should scale fine assuming cartesian keys are under control...say to 100M x 10M with 400 factors....

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 1, 2015

Test build #31541 has finished for PR 5829 at commit 22e6a87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 1, 2015

Test build #31541 has finished for PR 5829 at commit 22e6a87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr May 1, 2015

Contributor

Okay, I'm merging this in. Please submit a PR for SPARK-4231 if you get time to work on it. Thanks!

Contributor

mengxr commented May 1, 2015

Okay, I'm merging this in. Please submit a PR for SPARK-4231 if you get time to work on it. Thanks!

@asfgit asfgit closed this in 3b514af May 1, 2015

jeanlyn added a commit to jeanlyn/spark that referenced this pull request May 28, 2015

[SPARK-3066] [MLLIB] Support recommendAll in matrix factorization model
This is based on #3098 from debasish83.

1. BLAS' GEMM is used to compute inner products.
2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
3. ~~Fixed a bug in topByKey~~

Closes #3098

debasish83 coderxiang

Author: Debasish Das <debasish.das@one.verizon.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5829 from mengxr/SPARK-3066 and squashes the following commits:

22e6a87 [Xiangrui Meng] topByKey was correct. update its usage
389b381 [Xiangrui Meng] fix indentation
49953de [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3066
cb9799a [Xiangrui Meng] revert MovieLensALS
f864f5e [Xiangrui Meng] update test and fix a bug in topByKey
c5e0181 [Xiangrui Meng] use GEMM and topByKey
3a0c4eb [Debasish Das] updated with spark master
98fa424 [Debasish Das] updated with master
ee99571 [Debasish Das] addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization
3f97c49 [Debasish Das] fixed spark coding style for imports
7163a5c [Debasish Das] Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split
d144f57 [Debasish Das] recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top
f38a1b5 [Debasish Das] use sampleByKey for per user sampling
10cbb37 [Debasish Das] provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset
9fa063e [Debasish Das] import scala.math.round
4bbae0f [Debasish Das] comments fixed as per scalastyle
cd3ab31 [Debasish Das] merged with AbstractParams serialization bug
9b3951f [Debasish Das] validate user/product on MovieLens dataset through user input and compute map measure along with rmse

jeanlyn added a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015

[SPARK-3066] [MLLIB] Support recommendAll in matrix factorization model
This is based on #3098 from debasish83.

1. BLAS' GEMM is used to compute inner products.
2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
3. ~~Fixed a bug in topByKey~~

Closes #3098

debasish83 coderxiang

Author: Debasish Das <debasish.das@one.verizon.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5829 from mengxr/SPARK-3066 and squashes the following commits:

22e6a87 [Xiangrui Meng] topByKey was correct. update its usage
389b381 [Xiangrui Meng] fix indentation
49953de [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3066
cb9799a [Xiangrui Meng] revert MovieLensALS
f864f5e [Xiangrui Meng] update test and fix a bug in topByKey
c5e0181 [Xiangrui Meng] use GEMM and topByKey
3a0c4eb [Debasish Das] updated with spark master
98fa424 [Debasish Das] updated with master
ee99571 [Debasish Das] addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization
3f97c49 [Debasish Das] fixed spark coding style for imports
7163a5c [Debasish Das] Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split
d144f57 [Debasish Das] recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top
f38a1b5 [Debasish Das] use sampleByKey for per user sampling
10cbb37 [Debasish Das] provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset
9fa063e [Debasish Das] import scala.math.round
4bbae0f [Debasish Das] comments fixed as per scalastyle
cd3ab31 [Debasish Das] merged with AbstractParams serialization bug
9b3951f [Debasish Das] validate user/product on MovieLens dataset through user input and compute map measure along with rmse

nemccarthy added a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015

[SPARK-3066] [MLLIB] Support recommendAll in matrix factorization model
This is based on #3098 from debasish83.

1. BLAS' GEMM is used to compute inner products.
2. Reverted changes to MovieLensALS. SPARK-4231 should be addressed in a separate PR.
3. ~~Fixed a bug in topByKey~~

Closes #3098

debasish83 coderxiang

Author: Debasish Das <debasish.das@one.verizon.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5829 from mengxr/SPARK-3066 and squashes the following commits:

22e6a87 [Xiangrui Meng] topByKey was correct. update its usage
389b381 [Xiangrui Meng] fix indentation
49953de [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3066
cb9799a [Xiangrui Meng] revert MovieLensALS
f864f5e [Xiangrui Meng] update test and fix a bug in topByKey
c5e0181 [Xiangrui Meng] use GEMM and topByKey
3a0c4eb [Debasish Das] updated with spark master
98fa424 [Debasish Das] updated with master
ee99571 [Debasish Das] addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization
3f97c49 [Debasish Das] fixed spark coding style for imports
7163a5c [Debasish Das] Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split
d144f57 [Debasish Das] recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top
f38a1b5 [Debasish Das] use sampleByKey for per user sampling
10cbb37 [Debasish Das] provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset
9fa063e [Debasish Das] import scala.math.round
4bbae0f [Debasish Das] comments fixed as per scalastyle
cd3ab31 [Debasish Das] merged with AbstractParams serialization bug
9b3951f [Debasish Das] validate user/product on MovieLens dataset through user input and compute map measure along with rmse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment