SPARK-2465. Use long as user / item ID for ALS #1393

srowen · 2014-07-13T14:58:02Z

I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation.

The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions.

It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers.

On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.

Thoughts?

SparkQA · 2014-07-13T15:02:43Z

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull

witgo · 2014-07-13T15:11:19Z

The overall increase how much memory? Have a detailed contrast?

srowen · 2014-07-13T15:16:38Z

I think the most significant change is the Rating object. It goes from 8 + (ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 (object) + 4 (long) + 4 (long) + 8 (double) = 40 bytes

SparkQA · 2014-07-13T16:38:41Z

QA results for PR 1393:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull

SparkQA · 2014-07-13T17:02:34Z

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull

SparkQA · 2014-07-13T18:37:25Z

QA results for PR 1393:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull

SparkQA · 2014-07-13T20:07:36Z

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull

SparkQA · 2014-07-13T21:43:39Z

QA results for PR 1393:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull

mateiz · 2014-07-15T03:34:55Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

@@ -53,7 +53,7 @@ class MatrixFactorizationModel private[mllib] (
    * @param usersProducts  RDD of (user, product) pairs.
    * @return RDD of Ratings.
    */
-  def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
+  def predict(usersProducts: RDD[(Long, Long)]): RDD[Rating] = {


This is a breaking API change unfortunately, and because of type erasure, users who compiled against an old MLlib (passing RDD[(Int, Int)]) will have their code link against the new one and get a ClassCastException

I had understood all of this to be an @Experimental API, though it is not consistently marked. For example, Rating is experimental but its API is actually bound to this API here.

Interesting, I think Rating shouldn't have been marked experimental. We should definitely fix that.

mateiz · 2014-07-15T03:39:58Z

I'm somewhat worried about this because it seems difficult to do without breaking API changes. Do users really want to rely on luck to avoid collisions? It might be better to add some efficient way to assign unique IDs.

ScrapCodes · 2014-07-15T07:35:06Z

I am trying to figure out why it happend, this might not be my conclusion but at the moment I feel that since this class has a private [mllib] constructor, there is an entry in ignores file as follows org.apache.spark.mllib.recommendation.MatrixFactorizationModel.<init>. This particular entry makes the whole class ignored by mima tool.

ScrapCodes · 2014-07-15T07:36:31Z

And to my surprise also has org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict not sure why it has that.

ScrapCodes · 2014-07-15T07:43:43Z

Ahh understood (please ignore my previous theory.), so it happened because we have a function which is @developerApi in the same class with same name. So this was added by our GenerateMimaIgnoreTool to ignores file and as a result MIMA check for all predict methods was disabled. Not sure if mima can disambiguate them. I will check that.

srowen · 2014-07-15T10:23:42Z

Yes you could also tell callers to track their own user-ID mapping and maintain it consistently everywhere. Callers have to share that state then somehow. Hashing is easier, and 64 bits makes it work for practical purposes.

A caller has to do something like these to deal with real-world identifiers because an Int ID API by itself doesn't quite work. This is an instance of a meta-concern I have, if an API which (from my perspective) is going to be problematic at scale is already unchangeable before battle-testing. (I actually thought all of MLlib was de facto @Experimental?)

Yeah however you can layer on other APIs to fix it, or use @deprecated in cases like this to keep existing methods but add new signatures too. I think that would be the simplest solution to this particular concern.

The question of serialized size is still out there. That is worth weighing in on.

mateiz · 2014-07-15T18:01:23Z

No, MLlib is not experimental, only the parts annotated with @Exprimental are. The reason is that we felt we could continue supporting these low-level APIs indefinitely and add other ones later if we need to. Again, for real users, API stability matters much more than you'd think -- there's nothing more annoying than having to change your app to implement a software upgrade, and it causes fragmentation of the userbase as users stick to an older version instead of upgrading.

In this particular case, there are a few things we can do. We can think of additions to the API here that preserve the old methods but add new versions of predict. We can add a new class called LongALS or something like that, and have these ones call it and get back a LongMatrixFactorizationModel. Or we can offer a utility to generate unique IDs.

The reason I was asking about hash collisions above is that even with 64-bit IDs, you're not guaranteed to be collision-free. With 2-3 billion users you actually have a good chance of a collision. So if applications care about that, they may not be okay with this solution either.

srowen · 2014-07-15T18:25:58Z

Yeah API stability is very important. I keep banging on about the flip-side -- freezing an API that may still need to change. You get a different important problem. I'm sure everyone gets that, and it's a judgment call and trade-off.

I will change the PR to preserve the existing methods and add new ones. That's the thing we can consider and merge or not. I'm not offended if nobody else is feeling this one. I can always fork/wrap this aspect to fit what I need it do. (And I have other API suggestions I'd rather spend time on if anything.)

I wouldn't want to add the overhead of a separate set of implementations just for 64 bit values. Users would have a hard time understanding the difference and choosing.

3 billion people is a lot! It could happen, yes. Maybe not with people but with, say, URLs. If collisions mattered much, then with many billions of things, you can't use the ALS implementation as it stands, since most IDs would collide no matter how you map or hash. That's the best motivation I can offer for this change.

…DD[(Long,Long)]) becomes predictRDD because it can't overload predict(RDD[(Int,Int)])

SparkQA · 2014-07-15T22:07:58Z

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16694/consoleFull

mateiz · 2014-07-15T23:08:11Z

I didn't suggest having a new implementation for long IDs, only a new API. They can run on the same implementation (e.g. the current Int-based one transforms the Ints to Longs and calls that one). This is a much more sensible way to evolve the API and it's very common in other software. All our MLlib APIs were designed to support this kind of evolution (e.g. you set your parameters using a builder pattern, where we can add new methods, and the top-level API is just functions you can call that we can easily map to more complex versions of the functions).

The place I'm coming from is that there are far more complex APIs than ours that have retained backwards compatibility over decades, and were maintained by a similar-sized team. One great example is Java's class library, which is not only a great library but has also been compatible since 1.0. There are well-known ways to retain compatibility while still improving the API, such as adding a new package (e.g. java.nio vs java.io). I would be totally fine doing that with MLlib as we gain experience with it, but there's no reason to break the old API in the process. Again, I feel that people from today's tech company world think way too much about "perfecting" an API by repeatedly tweaking it, and while that works within a single engineering team, it doesn't work in software that you expect someone else to use.

srowen · 2014-07-15T23:10:07Z

Wise words Matei! anyway here's another cut that preserves the original API. Tests are still running. Up to you guys' judgment on whether it's worthwhile.

mateiz · 2014-07-15T23:24:49Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

 */
 @Experimental
-case class Rating(user: Int, product: Int, rating: Double)
+case class Rating(user: Long, product: Long, rating: Double)


This is also a breaking change; we need to add another constructor that takes Ints. You may be able to add one like this:

case class Rating(user: Long, product: Long, rating: Double) { def this(u: Int, p: Int, r: Double) = this(u.toLong, p.toLong, r) }

Or if it doesn't work, you need to turn this into a class instead of a case class and add extends Serializable and the various constructors yourself.

Actually this is kind of tricky, we probably need a separate LongRating of some kind. This is because "user" and "product" and such are public fields and they used to be Int.

srowen · 2014-07-15T23:37:13Z

@mateiz yeah it was @Experimental but if you want to preserve the API of this class I can only imagine it would have a Long field internally but still also expose accessors of type Int as well with the current name, which just truncate or something. It still works the same for values that fit in an Int I suppose. I can try that to see how it looks.

I'd not be offended if this is better considered for 2.x, after more evidence exists whether this is really causing problems or not.

SparkQA · 2014-07-15T23:45:32Z

QA results for PR 1393:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16694/consoleFull

mateiz · 2014-07-16T00:29:53Z

Yeah, this is tricky. Maybe we need to create a LongRating class, even though it's ugly. @mengxr what do you think? I do think we'd like to have support for Longs here.

mengxr · 2014-07-16T07:10:48Z

Besides breaking the API, I'm also worried about two things:

The increase in storage. We had some discussion before v1.0 about whether we should switch to long or not. ALS is not computation heavy for small k but communication heavy. I posted some screenshots on the JIRA page, where ALS shuffles ~200GB data in each iteration. With Long ids, this number may become ~300GB and hence ALS may slow down by 50%. Instead of upgrading the id type to Long, I'm actually thinking about downgrading the rating type to Float.
Is collision really bad? ALS needs somewhat "dense" matrix to compute good recommendations. If there are 3 billion users but each user only gives 1 or 2 ratings, ALS is very likely to overfit. In this case, making a random projection on the user side would certainly help, while hashing is one of the commonly used techniques for random projection. There will be bad recommendations no matter whether there exist hash collisions or not. So I'm really interested in some measurements on the downside of hash collision.

srowen · 2014-07-16T10:49:14Z

Let me close this PR for now. I will fork or wrap as necessary. Keep it in mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more problems with the Rating class retrofit anyway.)

Yes storage is the downside. Your comments on JIRA about effects of serialization in compressing away the difference are promising. I completely agree with using Float for ratings and even feature vectors.

Yes I understand why random projections are helpful. It doesn't help accuracy, but may only trivially hurt it in return for some performance gain. If I have just 1 rating, it doesn't make my recs better to arbitrarily add your ratings to mine. Sure that's denser, and maybe you're getting less overfitting, but it's fitting the wrong input for both of us.

A collision here and there is probably acceptable. One in a million customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to decide. If I'm an end user of MLlib bringing even millions of things to my model, I have to decide. And if it's a problem, have to maintain a lookup table to use it.

It seemed simplest to moot the problem with a much bigger key space and engineer around the storage issue. A bit more memory is cheap; accuracy and engineer time are expensive.

mateiz · 2014-07-16T17:51:07Z

Sean, I'd still be okay with adding a LongALS class if you see benefit for it in some use cases. Let's just see how it works in comparison.

srowen · 2014-07-16T18:02:33Z

@mateiz I think it would mean mostly cloning ALS.scala, as the Rating object is woven throughout. Probably some large chunks could be refactored and shared. Is that what you mean? even I'm not sure if two APIs are worth the trouble. When I get to this point and see what it takes to make a 64-bit key implementation, yes I can propose what that looks like.

mateiz · 2014-07-16T18:58:27Z

Yeah, that's what I meant, we can clone it at first but we might be able to share code later (at least the math code we run on each block, or stuff like that). But let's do it only if you find out it's worth it.

mateiz · 2014-07-16T18:59:07Z

BTW this could also be a place to use the dreaded Scala @specialized annotation to template the code for Ints vs Longs, though as far as I know that's being deprecated by the Scala developers.

Use long instead of int for user/product IDs

286571b

mateiz reviewed Jul 15, 2014
View reviewed changes

Re-add existing MatrixFactorizationModel predict() methods. predict(R…

7754ece

…DD[(Long,Long)]) becomes predictRDD because it can't overload predict(RDD[(Int,Int)])

mateiz reviewed Jul 15, 2014
View reviewed changes

srowen closed this Jul 16, 2014

srowen deleted the SPARK-2465 branch August 1, 2014 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2465. Use long as user / item ID for ALS #1393

SPARK-2465. Use long as user / item ID for ALS #1393

srowen commented Jul 13, 2014

SparkQA commented Jul 13, 2014

witgo commented Jul 13, 2014

srowen commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

mateiz Jul 15, 2014

srowen Jul 15, 2014

mateiz Jul 15, 2014

mateiz commented Jul 15, 2014

ScrapCodes commented Jul 15, 2014

ScrapCodes commented Jul 15, 2014

ScrapCodes commented Jul 15, 2014

srowen commented Jul 15, 2014

mateiz commented Jul 15, 2014

srowen commented Jul 15, 2014

SparkQA commented Jul 15, 2014

mateiz commented Jul 15, 2014

srowen commented Jul 15, 2014

mateiz Jul 15, 2014

mateiz Jul 15, 2014

srowen commented Jul 15, 2014

SparkQA commented Jul 15, 2014

mateiz commented Jul 16, 2014

mengxr commented Jul 16, 2014

srowen commented Jul 16, 2014

mateiz commented Jul 16, 2014

srowen commented Jul 16, 2014

mateiz commented Jul 16, 2014

mateiz commented Jul 16, 2014

SPARK-2465. Use long as user / item ID for ALS #1393

SPARK-2465. Use long as user / item ID for ALS #1393

Conversation

srowen commented Jul 13, 2014

SparkQA commented Jul 13, 2014

witgo commented Jul 13, 2014

srowen commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

SparkQA commented Jul 13, 2014

mateiz Jul 15, 2014

Choose a reason for hiding this comment

srowen Jul 15, 2014

Choose a reason for hiding this comment

mateiz Jul 15, 2014

Choose a reason for hiding this comment

mateiz commented Jul 15, 2014

ScrapCodes commented Jul 15, 2014

ScrapCodes commented Jul 15, 2014

ScrapCodes commented Jul 15, 2014

srowen commented Jul 15, 2014

mateiz commented Jul 15, 2014

srowen commented Jul 15, 2014

SparkQA commented Jul 15, 2014

mateiz commented Jul 15, 2014

srowen commented Jul 15, 2014

mateiz Jul 15, 2014

Choose a reason for hiding this comment

mateiz Jul 15, 2014

Choose a reason for hiding this comment

srowen commented Jul 15, 2014

SparkQA commented Jul 15, 2014

mateiz commented Jul 16, 2014

mengxr commented Jul 16, 2014

srowen commented Jul 16, 2014

mateiz commented Jul 16, 2014

srowen commented Jul 16, 2014

mateiz commented Jul 16, 2014

mateiz commented Jul 16, 2014