With a small number of raters, refreshing similarities and recommendations is quite fast and works great. Unfortunately my system needs to support more than half a million raters. My refresh time for a single rater increases significantly with the total number of raters. Even at 10,000 raters, refreshing can take between 1-3 minutes.
I've tried tweaking (and lowering) the config.nearest_neighbors, config.furthest_neighbors, and config.recomendations_to_store settings without seeing much improvement.
Obviously, 1-3 minutes will only get much longer as the number of raters gets closer to half a million. So here are a few questions:
First, I'm wondering if this situation is unique to me? Is this a typical situation?
Is there a way to limit the similar raters calculation to use a subset of the total raters in the system?
Are there any other places you can suggest I could look to decrease refreshing time?
I noticed a few places in the recommendable code where Redis operations are called inside a loop; is there any reason not to multi or pipeline in those situations?
Any direction you can offer is appreciated!
Any recommender system will have degraded performance as the data set rises, but I do suspect that it's particularly not great with Recommendable. With regards to multi/pipelining, this has actually been something I want to try, but I simply haven't gotten around to it yet... Though it's the sort of thing that would be awesome to get a PR for if you'd be willing to help out with it? I haven't really worked with Redis pipelining before
I'll pick through the code and see if I can't get a PR together to minimize Redis connections. Although I suspect that will only marginally boost performance. I'm inclined to think to really handle large data sets, it might need to support something along the lines of set sampling.
In the meantime, do you know of anyone using successfully using recommendable with larger data sets in the wild?
If any of my users have large data sets, they haven't let me know... I think that Recommendable definitely needs some performance improvements but, truth be told, I'm not really up-to-date on my knowledge of the computer science behind really performant and scalable recommender systems. Stuff's not easy... When I built Recommendable, my primary focus was a really nice and elegant interface and I figured I'd improve the performance as I went along. I've done so bit by bit, but at this point I only have a few ideas of how I could improve the current performance of the Redis logic. And like you said, I worry that they'd only marginally boost performance. One of those ideas is using multi/pipelining, the other idea is putting all of that logic into a Lua script that Redis can simply run directly.
Though for the next major version I've been meaning to support alternate ways (with a pluggable system) of generating similarity values and recommendations that don't rely on Redis.
pipeline independent redis commands
There's the first stab at it. I'll keep plugging away at it and attempt some benchmarking.
pipeline + extract method
ensure similarity values are floats
Surprisingly pipelining has yielded a significant boost in performance (at least on my data set).
Updating similarities and recommendations 10 times
user system total real
836.470000 113.540000 950.010000 (953.499563)
user system total real
200.050000 48.630000 248.680000 (250.104407)
Daaaaang please PR that.
The extra .to_f on the end probably isn't needed since each element in the array is already a float now.
@mhuggins it would seem that way, but the tests fail without it. The reason is sometimes there are no members in the liked_by_set/disliked_by_set. That would mean the similarity values array is empty.
.map(&:to_f) returns the same empty array, but .reduce(&:+) returns nil. So we need that to_f to ensure in the case of empty liked sets we end up with 0.0. It threw me for a loop when I was first writing that line of code.
@davidcelis Isn't this already a PR?
Oh it is. Crazy. I forgot you could turn issues into a PR. I'll take a closer look and probably merge when I'm not on a bus!
@ajjahn makes sense RE: to_f :)
Rad! Thanks! I'll keep digging for other ways to optimize, but I'd say this was good start.