[SPARK-3541][MLLIB] New ALS implementation with improved storage #3720

mengxr · 2014-12-17T07:26:54Z

This PR adds a new ALS implementation to spark.ml using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under spark.mllib, the new implementation

uses the same algorithm,
uses float type for ratings,
uses primitive arrays to avoid GC,
sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance.

The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html):

I keep the spark.mllib's ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under spark.mllib and then make it a wrapper of the new implementation, in a separate PR.

TODO:

Add unit tests for implicit preferences.

SparkQA · 2014-12-17T08:47:06Z

Test build #24537 has finished for PR 3720 at commit 3f2d81a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(user: Int, product: Int, rating: Float)
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- trait ParquetTest

SparkQA · 2014-12-19T20:09:19Z

Test build #24653 has finished for PR 3720 at commit 4937fd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams

SparkQA · 2014-12-30T23:59:51Z

Test build #24911 has finished for PR 3720 at commit 213d163.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams

add more doc and comments

SparkQA · 2015-01-08T03:33:18Z

Test build #25195 has finished for PR 3720 at commit 2a8deb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

mengxr · 2015-01-08T07:58:55Z

@srowen @coderxiang This PR is almost ready, pending few unit tests. Would you be interested in reviewing the code?

SparkQA · 2015-01-08T08:29:45Z

Test build #25207 has finished for PR 3720 at commit a76da7b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

srowen · 2015-01-08T13:53:14Z

examples/src/main/scala/org/apache/spark/examples/ml/MovieLensALS.scala

+        val err = rating.toDouble - prediction
+        val err2 = err * err
+        if (err2.isNaN) {
+          Iterator.empty


Tiny: would it be clearer to return Some and None? This works too of course.

SparkQA · 2015-01-09T20:33:45Z

Test build #25337 has finished for PR 3720 at commit b84f41c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

SparkQA · 2015-01-10T02:34:11Z

Test build #25353 has finished for PR 3720 at commit dd0d0e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

mengxr · 2015-01-13T20:48:22Z

test this please

SparkQA · 2015-01-13T21:39:02Z

Test build #25485 has finished for PR 3720 at commit dd0d0e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

mengxr · 2015-01-19T18:17:18Z

test this please

mengxr · 2015-01-19T18:18:17Z

@srowen @coderxiang T on the implementation?

SparkQA · 2015-01-19T18:28:31Z

Test build #25765 has finished for PR 3720 at commit dd0d0e8.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

SparkQA · 2015-01-19T20:12:02Z

Test build #25767 has finished for PR 3720 at commit 1b9e852.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
- case class Movie(movieId: Int, title: String, genres: Seq[String])
- case class Params(
- class ALS extends Estimator[ALSModel] with ALSParams
- case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], ratings: Array[Float])

mengxr · 2015-01-22T16:21:08Z

@srowen @coderxiang Do you have more comments? I'm thinking about merging this and then port nonnegative support. After that, we can replace the ALS implementation under "spark.mllib".

srowen · 2015-01-22T16:50:21Z

@mengxr Given how familiar you are with this implementation, and the tests, I can only be pretty sure it works. I didn't see any style issues, and thought through some of the loops for speed/correctness but every spot check was fine. It is looking good to me.

coderxiang · 2015-01-22T20:38:53Z

@mengxr the logic and style also look good to me.

mengxr · 2015-01-23T06:09:58Z

Thanks! I've merged this into master. I'll send follow-up PRs soon.

commit ea74365 Author: Xiangrui Meng <meng@databricks.com> Date: Thu Jan 22 22:09:13 2015 -0800 [SPARK-3541][MLLIB] New ALS implementation with improved storage This PR adds a new ALS implementation to `spark.ml` using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under `spark.mllib`, the new implementation 1. uses the same algorithm, 2. uses float type for ratings, 3. uses primitive arrays to avoid GC, 4. sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance. The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html): ![als-wip](https://cloud.githubusercontent.com/assets/829644/5659447/4c4ff8e0-96c7-11e4-87a9-73c1c63d07f3.png) I keep the `spark.mllib`'s ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under `spark.mllib` and then make it a wrapper of the new implementation, in a separate PR. TODO: - [X] Add unit tests for implicit preferences. Author: Xiangrui Meng <meng@databricks.com> Closes apache#3720 from mengxr/SPARK-3541 and squashes the following commits: 1b9e852 [Xiangrui Meng] fix compile 5129be9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3541 dd0d0e8 [Xiangrui Meng] simplify test code c627de3 [Xiangrui Meng] add tests for implicit feedback b84f41c [Xiangrui Meng] address comments a76da7b [Xiangrui Meng] update ALS tests 2a8deb3 [Xiangrui Meng] add some ALS tests 857e876 [Xiangrui Meng] add tests for rating block and encoded block d3c1ac4 [Xiangrui Meng] rename some classes for better code readability add more doc and comments 213d163 [Xiangrui Meng] org imports 771baf3 [Xiangrui Meng] chol doc update ca9ad9d [Xiangrui Meng] add unit tests for chol b4fd17c [Xiangrui Meng] add unit tests for NormalEquation d0f99d3 [Xiangrui Meng] add tests for LocalIndexEncoder 80b8e61 [Xiangrui Meng] fix imports 4937fd4 [Xiangrui Meng] update ALS example 56c253c [Xiangrui Meng] rename product to item bce8692 [Xiangrui Meng] doc for parameters and project the output columns 3f2d81a [Xiangrui Meng] add doc 1efaecf [Xiangrui Meng] add example code 8ae86b5 [Xiangrui Meng] add a working copy of the new ALS implementation commit e0f7fb7 Author: jerryshao <saisai.shao@intel.com> Date: Thu Jan 22 22:04:21 2015 -0800 [SPARK-5315][Streaming] Fix reduceByWindow Java API not work bug `reduceByWindow` for Java API is actually not Java compatible, change to make it Java compatible. Current solution is to deprecate the old one and add a new API, but since old API actually is not correct, so is keeping the old one meaningful? just to keep the binary compatible? Also even adding new API still need to add to Mima exclusion, I'm not sure to change the API, or deprecate the old API and add a new one, which is the best solution? Author: jerryshao <saisai.shao@intel.com> Closes apache#4104 from jerryshao/SPARK-5315 and squashes the following commits: 5bc8987 [jerryshao] Address the comment c7aa1b4 [jerryshao] Deprecate the old one to keep binary compatible 8e9dc67 [jerryshao] Fix JavaDStream reduceByWindow signature error commit 3c3fa63 Author: jerryshao <saisai.shao@intel.com> Date: Thu Jan 22 21:58:53 2015 -0800 [SPARK-5233][Streaming] Fix error replaying of WAL introduced bug Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233). Author: jerryshao <saisai.shao@intel.com> Closes apache#4032 from jerryshao/SPARK-5233 and squashes the following commits: f0b0c0b [jerryshao] Further address the comments a237c75 [jerryshao] Address the comments e356258 [jerryshao] Fix bug in unit test 558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure

hy2014 · 2015-04-15T08:21:24Z

hi, we use als example, but we found same data, same input arguments, but output different result, spark 1.2.0 can result data we want, but spark 1.3.0 cannot return right data, it all return zero matrix in userFeature.

mengxr added 3 commits December 16, 2014 21:31

add a working copy of the new ALS implementation

8ae86b5

add example code

1efaecf

add doc

3f2d81a

mengxr added 3 commits December 18, 2014 16:08

doc for parameters and project the output columns

bce8692

rename product to item

56c253c

update ALS example

4937fd4

mengxr added 6 commits December 30, 2014 11:12

fix imports

80b8e61

add tests for LocalIndexEncoder

d0f99d3

add unit tests for NormalEquation

b4fd17c

add unit tests for chol

ca9ad9d

chol doc update

771baf3

org imports

213d163

mengxr added 3 commits January 7, 2015 14:32

rename some classes for better code readability

d3c1ac4

add more doc and comments

add tests for rating block and encoded block

857e876

add some ALS tests

2a8deb3

update ALS tests

a76da7b

mengxr changed the title ~~[WIP][SPARK-3541][MLLIB] New ALS implementation with improved storage~~ [SPARK-3541][MLLIB] New ALS implementation with improved storage Jan 8, 2015

srowen reviewed Jan 8, 2015
View reviewed changes

address comments

b84f41c

mengxr force-pushed the SPARK-3541 branch from 7ac5dcb to b84f41c Compare January 9, 2015 19:12

mengxr added 2 commits January 9, 2015 17:14

add tests for implicit feedback

c627de3

simplify test code

dd0d0e8

mengxr added 2 commits January 19, 2015 10:54

Merge remote-tracking branch 'apache/master' into SPARK-3541

5129be9

fix compile

1b9e852

asfgit closed this in ea74365 Jan 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3541][MLLIB] New ALS implementation with improved storage #3720

[SPARK-3541][MLLIB] New ALS implementation with improved storage #3720

mengxr commented Dec 17, 2014

SparkQA commented Dec 17, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 30, 2014

SparkQA commented Jan 8, 2015

mengxr commented Jan 8, 2015

SparkQA commented Jan 8, 2015

srowen Jan 8, 2015

mengxr Jan 9, 2015

SparkQA commented Jan 9, 2015

SparkQA commented Jan 10, 2015

mengxr commented Jan 13, 2015

SparkQA commented Jan 13, 2015

mengxr commented Jan 19, 2015

mengxr commented Jan 19, 2015

SparkQA commented Jan 19, 2015

SparkQA commented Jan 19, 2015

mengxr commented Jan 22, 2015

srowen commented Jan 22, 2015

coderxiang commented Jan 22, 2015

mengxr commented Jan 23, 2015

hy2014 commented Apr 15, 2015

[SPARK-3541][MLLIB] New ALS implementation with improved storage #3720

[SPARK-3541][MLLIB] New ALS implementation with improved storage #3720

Conversation

mengxr commented Dec 17, 2014

SparkQA commented Dec 17, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 30, 2014

SparkQA commented Jan 8, 2015

mengxr commented Jan 8, 2015

SparkQA commented Jan 8, 2015

srowen Jan 8, 2015

Choose a reason for hiding this comment

mengxr Jan 9, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 9, 2015

SparkQA commented Jan 10, 2015

mengxr commented Jan 13, 2015

SparkQA commented Jan 13, 2015

mengxr commented Jan 19, 2015

mengxr commented Jan 19, 2015

SparkQA commented Jan 19, 2015

SparkQA commented Jan 19, 2015

mengxr commented Jan 22, 2015

srowen commented Jan 22, 2015

coderxiang commented Jan 22, 2015

mengxr commented Jan 23, 2015

hy2014 commented Apr 15, 2015