Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2465. Use long as user / item ID for ALS #1393

Closed
wants to merge 2 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Jul 13, 2014

I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation.

The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions.

It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers.

On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.

Thoughts?

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull

@witgo
Copy link
Contributor

witgo commented Jul 13, 2014

The overall increase how much memory? Have a detailed contrast?

@srowen
Copy link
Member Author

srowen commented Jul 13, 2014

I think the most significant change is the Rating object. It goes from 8 + (ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 (object) + 4 (long) + 4 (long) + 8 (double) = 40 bytes

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA results for PR 1393:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA results for PR 1393:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA results for PR 1393:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull

@@ -53,7 +53,7 @@ class MatrixFactorizationModel private[mllib] (
* @param usersProducts RDD of (user, product) pairs.
* @return RDD of Ratings.
*/
def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
def predict(usersProducts: RDD[(Long, Long)]): RDD[Rating] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change unfortunately, and because of type erasure, users who compiled against an old MLlib (passing RDD[(Int, Int)]) will have their code link against the new one and get a ClassCastException

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had understood all of this to be an @Experimental API, though it is not consistently marked. For example, Rating is experimental but its API is actually bound to this API here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I think Rating shouldn't have been marked experimental. We should definitely fix that.

@mateiz
Copy link
Contributor

mateiz commented Jul 15, 2014

I'm somewhat worried about this because it seems difficult to do without breaking API changes. Do users really want to rely on luck to avoid collisions? It might be better to add some efficient way to assign unique IDs.

@ScrapCodes
Copy link
Member

I am trying to figure out why it happend, this might not be my conclusion but at the moment I feel that since this class has a private [mllib] constructor, there is an entry in ignores file as follows org.apache.spark.mllib.recommendation.MatrixFactorizationModel.<init>. This particular entry makes the whole class ignored by mima tool.

@ScrapCodes
Copy link
Member

And to my surprise also has org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict not sure why it has that.

@ScrapCodes
Copy link
Member

Ahh understood (please ignore my previous theory.), so it happened because we have a function which is @developerApi in the same class with same name. So this was added by our GenerateMimaIgnoreTool to ignores file and as a result MIMA check for all predict methods was disabled. Not sure if mima can disambiguate them. I will check that.

@srowen
Copy link
Member Author

srowen commented Jul 15, 2014

Yes you could also tell callers to track their own user-ID mapping and maintain it consistently everywhere. Callers have to share that state then somehow. Hashing is easier, and 64 bits makes it work for practical purposes.

A caller has to do something like these to deal with real-world identifiers because an Int ID API by itself doesn't quite work. This is an instance of a meta-concern I have, if an API which (from my perspective) is going to be problematic at scale is already unchangeable before battle-testing. (I actually thought all of MLlib was de facto @Experimental?)

Yeah however you can layer on other APIs to fix it, or use @deprecated in cases like this to keep existing methods but add new signatures too. I think that would be the simplest solution to this particular concern.

The question of serialized size is still out there. That is worth weighing in on.

@mateiz
Copy link
Contributor

mateiz commented Jul 15, 2014

No, MLlib is not experimental, only the parts annotated with @Exprimental are. The reason is that we felt we could continue supporting these low-level APIs indefinitely and add other ones later if we need to. Again, for real users, API stability matters much more than you'd think -- there's nothing more annoying than having to change your app to implement a software upgrade, and it causes fragmentation of the userbase as users stick to an older version instead of upgrading.

In this particular case, there are a few things we can do. We can think of additions to the API here that preserve the old methods but add new versions of predict. We can add a new class called LongALS or something like that, and have these ones call it and get back a LongMatrixFactorizationModel. Or we can offer a utility to generate unique IDs.

The reason I was asking about hash collisions above is that even with 64-bit IDs, you're not guaranteed to be collision-free. With 2-3 billion users you actually have a good chance of a collision. So if applications care about that, they may not be okay with this solution either.

@srowen
Copy link
Member Author

srowen commented Jul 15, 2014

Yeah API stability is very important. I keep banging on about the flip-side -- freezing an API that may still need to change. You get a different important problem. I'm sure everyone gets that, and it's a judgment call and trade-off.

I will change the PR to preserve the existing methods and add new ones. That's the thing we can consider and merge or not. I'm not offended if nobody else is feeling this one. I can always fork/wrap this aspect to fit what I need it do. (And I have other API suggestions I'd rather spend time on if anything.)

I wouldn't want to add the overhead of a separate set of implementations just for 64 bit values. Users would have a hard time understanding the difference and choosing.

3 billion people is a lot! It could happen, yes. Maybe not with people but with, say, URLs. If collisions mattered much, then with many billions of things, you can't use the ALS implementation as it stands, since most IDs would collide no matter how you map or hash. That's the best motivation I can offer for this change.

…DD[(Long,Long)]) becomes predictRDD because it can't overload predict(RDD[(Int,Int)])
@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA tests have started for PR 1393. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16694/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Jul 15, 2014

I didn't suggest having a new implementation for long IDs, only a new API. They can run on the same implementation (e.g. the current Int-based one transforms the Ints to Longs and calls that one). This is a much more sensible way to evolve the API and it's very common in other software. All our MLlib APIs were designed to support this kind of evolution (e.g. you set your parameters using a builder pattern, where we can add new methods, and the top-level API is just functions you can call that we can easily map to more complex versions of the functions).

The place I'm coming from is that there are far more complex APIs than ours that have retained backwards compatibility over decades, and were maintained by a similar-sized team. One great example is Java's class library, which is not only a great library but has also been compatible since 1.0. There are well-known ways to retain compatibility while still improving the API, such as adding a new package (e.g. java.nio vs java.io). I would be totally fine doing that with MLlib as we gain experience with it, but there's no reason to break the old API in the process. Again, I feel that people from today's tech company world think way too much about "perfecting" an API by repeatedly tweaking it, and while that works within a single engineering team, it doesn't work in software that you expect someone else to use.

@srowen
Copy link
Member Author

srowen commented Jul 15, 2014

Wise words Matei! anyway here's another cut that preserves the original API. Tests are still running. Up to you guys' judgment on whether it's worthwhile.

*/
@Experimental
case class Rating(user: Int, product: Int, rating: Double)
case class Rating(user: Long, product: Long, rating: Double)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a breaking change; we need to add another constructor that takes Ints. You may be able to add one like this:

case class Rating(user: Long, product: Long, rating: Double) {
  def this(u: Int, p: Int, r: Double) = this(u.toLong, p.toLong, r)
}

Or if it doesn't work, you need to turn this into a class instead of a case class and add extends Serializable and the various constructors yourself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is kind of tricky, we probably need a separate LongRating of some kind. This is because "user" and "product" and such are public fields and they used to be Int.

@srowen
Copy link
Member Author

srowen commented Jul 15, 2014

@mateiz yeah it was @Experimental but if you want to preserve the API of this class I can only imagine it would have a Long field internally but still also expose accessors of type Int as well with the current name, which just truncate or something. It still works the same for values that fit in an Int I suppose. I can try that to see how it looks.

I'd not be offended if this is better considered for 2.x, after more evidence exists whether this is really causing problems or not.

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA results for PR 1393:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Rating(user: Long, product: Long, rating: Double)

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16694/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Jul 16, 2014

Yeah, this is tricky. Maybe we need to create a LongRating class, even though it's ugly. @mengxr what do you think? I do think we'd like to have support for Longs here.

@mengxr
Copy link
Contributor

mengxr commented Jul 16, 2014

Besides breaking the API, I'm also worried about two things:

  1. The increase in storage. We had some discussion before v1.0 about whether we should switch to long or not. ALS is not computation heavy for small k but communication heavy. I posted some screenshots on the JIRA page, where ALS shuffles ~200GB data in each iteration. With Long ids, this number may become ~300GB and hence ALS may slow down by 50%. Instead of upgrading the id type to Long, I'm actually thinking about downgrading the rating type to Float.
  2. Is collision really bad? ALS needs somewhat "dense" matrix to compute good recommendations. If there are 3 billion users but each user only gives 1 or 2 ratings, ALS is very likely to overfit. In this case, making a random projection on the user side would certainly help, while hashing is one of the commonly used techniques for random projection. There will be bad recommendations no matter whether there exist hash collisions or not. So I'm really interested in some measurements on the downside of hash collision.

@srowen
Copy link
Member Author

srowen commented Jul 16, 2014

Let me close this PR for now. I will fork or wrap as necessary. Keep it in mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more problems with the Rating class retrofit anyway.)

Yes storage is the downside. Your comments on JIRA about effects of serialization in compressing away the difference are promising. I completely agree with using Float for ratings and even feature vectors.

Yes I understand why random projections are helpful. It doesn't help accuracy, but may only trivially hurt it in return for some performance gain. If I have just 1 rating, it doesn't make my recs better to arbitrarily add your ratings to mine. Sure that's denser, and maybe you're getting less overfitting, but it's fitting the wrong input for both of us.

A collision here and there is probably acceptable. One in a million customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to decide. If I'm an end user of MLlib bringing even millions of things to my model, I have to decide. And if it's a problem, have to maintain a lookup table to use it.

It seemed simplest to moot the problem with a much bigger key space and engineer around the storage issue. A bit more memory is cheap; accuracy and engineer time are expensive.

@mateiz
Copy link
Contributor

mateiz commented Jul 16, 2014

Sean, I'd still be okay with adding a LongALS class if you see benefit for it in some use cases. Let's just see how it works in comparison.

@srowen
Copy link
Member Author

srowen commented Jul 16, 2014

@mateiz I think it would mean mostly cloning ALS.scala, as the Rating object is woven throughout. Probably some large chunks could be refactored and shared. Is that what you mean? even I'm not sure if two APIs are worth the trouble. When I get to this point and see what it takes to make a 64-bit key implementation, yes I can propose what that looks like.

@mateiz
Copy link
Contributor

mateiz commented Jul 16, 2014

Yeah, that's what I meant, we can clone it at first but we might be able to share code later (at least the math code we run on each block, or stuff like that). But let's do it only if you find out it's worth it.

@mateiz
Copy link
Contributor

mateiz commented Jul 16, 2014

BTW this could also be a place to use the dreaded Scala @specialized annotation to template the code for Ints vs Longs, though as far as I know that's being deprecated by the Scala developers.

@srowen srowen closed this Jul 16, 2014
@srowen srowen deleted the SPARK-2465 branch August 1, 2014 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants