New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2465. Use long as user / item ID for ALS #1393
Conversation
QA tests have started for PR 1393. This patch merges cleanly. |
The overall increase how much memory? Have a detailed contrast? |
I think the most significant change is the Rating object. It goes from 8 + (ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 (object) + 4 (long) + 4 (long) + 8 (double) = 40 bytes |
QA results for PR 1393: |
QA tests have started for PR 1393. This patch merges cleanly. |
QA results for PR 1393: |
QA tests have started for PR 1393. This patch merges cleanly. |
QA results for PR 1393: |
@@ -53,7 +53,7 @@ class MatrixFactorizationModel private[mllib] ( | |||
* @param usersProducts RDD of (user, product) pairs. | |||
* @return RDD of Ratings. | |||
*/ | |||
def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = { | |||
def predict(usersProducts: RDD[(Long, Long)]): RDD[Rating] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change unfortunately, and because of type erasure, users who compiled against an old MLlib (passing RDD[(Int, Int)]) will have their code link against the new one and get a ClassCastException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had understood all of this to be an @Experimental
API, though it is not consistently marked. For example, Rating
is experimental but its API is actually bound to this API here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I think Rating shouldn't have been marked experimental. We should definitely fix that.
I'm somewhat worried about this because it seems difficult to do without breaking API changes. Do users really want to rely on luck to avoid collisions? It might be better to add some efficient way to assign unique IDs. |
I am trying to figure out why it happend, this might not be my conclusion but at the moment I feel that since this class has a private [mllib] constructor, there is an entry in ignores file as follows |
And to my surprise also has |
Ahh understood (please ignore my previous theory.), so it happened because we have a function which is |
Yes you could also tell callers to track their own user-ID mapping and maintain it consistently everywhere. Callers have to share that state then somehow. Hashing is easier, and 64 bits makes it work for practical purposes. A caller has to do something like these to deal with real-world identifiers because an Yeah however you can layer on other APIs to fix it, or use The question of serialized size is still out there. That is worth weighing in on. |
No, MLlib is not experimental, only the parts annotated with @Exprimental are. The reason is that we felt we could continue supporting these low-level APIs indefinitely and add other ones later if we need to. Again, for real users, API stability matters much more than you'd think -- there's nothing more annoying than having to change your app to implement a software upgrade, and it causes fragmentation of the userbase as users stick to an older version instead of upgrading. In this particular case, there are a few things we can do. We can think of additions to the API here that preserve the old methods but add new versions of predict. We can add a new class called LongALS or something like that, and have these ones call it and get back a LongMatrixFactorizationModel. Or we can offer a utility to generate unique IDs. The reason I was asking about hash collisions above is that even with 64-bit IDs, you're not guaranteed to be collision-free. With 2-3 billion users you actually have a good chance of a collision. So if applications care about that, they may not be okay with this solution either. |
Yeah API stability is very important. I keep banging on about the flip-side -- freezing an API that may still need to change. You get a different important problem. I'm sure everyone gets that, and it's a judgment call and trade-off. I will change the PR to preserve the existing methods and add new ones. That's the thing we can consider and merge or not. I'm not offended if nobody else is feeling this one. I can always fork/wrap this aspect to fit what I need it do. (And I have other API suggestions I'd rather spend time on if anything.) I wouldn't want to add the overhead of a separate set of implementations just for 64 bit values. Users would have a hard time understanding the difference and choosing. 3 billion people is a lot! It could happen, yes. Maybe not with people but with, say, URLs. If collisions mattered much, then with many billions of things, you can't use the ALS implementation as it stands, since most IDs would collide no matter how you map or hash. That's the best motivation I can offer for this change. |
…DD[(Long,Long)]) becomes predictRDD because it can't overload predict(RDD[(Int,Int)])
QA tests have started for PR 1393. This patch merges cleanly. |
I didn't suggest having a new implementation for long IDs, only a new API. They can run on the same implementation (e.g. the current Int-based one transforms the Ints to Longs and calls that one). This is a much more sensible way to evolve the API and it's very common in other software. All our MLlib APIs were designed to support this kind of evolution (e.g. you set your parameters using a builder pattern, where we can add new methods, and the top-level API is just functions you can call that we can easily map to more complex versions of the functions). The place I'm coming from is that there are far more complex APIs than ours that have retained backwards compatibility over decades, and were maintained by a similar-sized team. One great example is Java's class library, which is not only a great library but has also been compatible since 1.0. There are well-known ways to retain compatibility while still improving the API, such as adding a new package (e.g. java.nio vs java.io). I would be totally fine doing that with MLlib as we gain experience with it, but there's no reason to break the old API in the process. Again, I feel that people from today's tech company world think way too much about "perfecting" an API by repeatedly tweaking it, and while that works within a single engineering team, it doesn't work in software that you expect someone else to use. |
Wise words Matei! anyway here's another cut that preserves the original API. Tests are still running. Up to you guys' judgment on whether it's worthwhile. |
*/ | ||
@Experimental | ||
case class Rating(user: Int, product: Int, rating: Double) | ||
case class Rating(user: Long, product: Long, rating: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a breaking change; we need to add another constructor that takes Ints. You may be able to add one like this:
case class Rating(user: Long, product: Long, rating: Double) {
def this(u: Int, p: Int, r: Double) = this(u.toLong, p.toLong, r)
}
Or if it doesn't work, you need to turn this into a class instead of a case class and add extends Serializable and the various constructors yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is kind of tricky, we probably need a separate LongRating of some kind. This is because "user" and "product" and such are public fields and they used to be Int.
@mateiz yeah it was I'd not be offended if this is better considered for 2.x, after more evidence exists whether this is really causing problems or not. |
QA results for PR 1393: |
Yeah, this is tricky. Maybe we need to create a LongRating class, even though it's ugly. @mengxr what do you think? I do think we'd like to have support for Longs here. |
Besides breaking the API, I'm also worried about two things:
|
Let me close this PR for now. I will fork or wrap as necessary. Keep it in mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more problems with the Yes storage is the downside. Your comments on JIRA about effects of serialization in compressing away the difference are promising. I completely agree with using Yes I understand why random projections are helpful. It doesn't help accuracy, but may only trivially hurt it in return for some performance gain. If I have just 1 rating, it doesn't make my recs better to arbitrarily add your ratings to mine. Sure that's denser, and maybe you're getting less overfitting, but it's fitting the wrong input for both of us. A collision here and there is probably acceptable. One in a million customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to decide. If I'm an end user of MLlib bringing even millions of things to my model, I have to decide. And if it's a problem, have to maintain a lookup table to use it. It seemed simplest to moot the problem with a much bigger key space and engineer around the storage issue. A bit more memory is cheap; accuracy and engineer time are expensive. |
Sean, I'd still be okay with adding a LongALS class if you see benefit for it in some use cases. Let's just see how it works in comparison. |
@mateiz I think it would mean mostly cloning |
Yeah, that's what I meant, we can clone it at first but we might be able to share code later (at least the math code we run on each block, or stuff like that). But let's do it only if you find out it's worth it. |
BTW this could also be a place to use the dreaded Scala @specialized annotation to template the code for Ints vs Longs, though as far as I know that's being deprecated by the Scala developers. |
I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation.
The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions.
It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers.
On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.
Thoughts?