Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum inner product for Annoy #10

Closed
erikbern opened this issue Mar 9, 2018 · 8 comments
Closed

Maximum inner product for Annoy #10

erikbern opened this issue Mar 9, 2018 · 8 comments

Comments

@erikbern
Copy link

erikbern commented Mar 9, 2018

Just read http://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/

How are you running Annoy for maximum inner product? Annoy doesn't support it out of the box, so I wonder if that's the reason it doesn't perform well

@benfred
Copy link
Owner

benfred commented Mar 9, 2018

It is weird with Annoy - I'm not sure why it didn't work well here (given that say for a cosine lookup instead of inner product on the same dataset I found it performed roughly the same as faiss).

The code to get annoy working with inner product search is here: https://github.com/benfred/implicit/blob/master/implicit/approximate_als.py . It just requires transforming the item factors so that each item has the same norm by adding an extra dimension (and then setting that extra dimension to 0 for users):

def augment_inner_product_matrix(factors):
    """ This function transforms a factor matrix such that an angular nearest neighbours search
    will return top related items of the inner product.
    This involves transforming each row by adding one extra dimension as suggested in the paper:
    "Speeding Up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product
    Spaces" https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf
    Basically this involves transforming each feature vector so that they have the same norm, which
    means the cosine of this transformed vector is proportional to the dot product (if the other
    vector in the cosine has a 0 in the extra dimension). """
    norms = numpy.linalg.norm(factors, axis=1)
    max_norm = norms.max()

    # add an extra dimension so that the norm of each row is the same
    # (max_norm)
    extra_dimension = numpy.sqrt(max_norm ** 2 - norms ** 2)
    return max_norm, numpy.append(factors, extra_dimension.reshape(norms.shape[0], 1), axis=1)

After applying that transformation to the item factors matrix, I indexed the transformed matrix with annoy and then query like https://github.com/benfred/implicit/blob/master/implicit/approximate_als.py#L248 .

The code to get this working with ann-benchmarks is a bit of a hack, and I don't think it will still work with your latest changes =( I just dumped out a npz file in the format that ann-benchmarks used to use with the transformed data: https://github.com/benfred/bens-blog-code/blob/master/approximate_als/create_ann_benchmarks_data.py#L51 . I think you changed to hdf5 so I doubt this will still work.

@benfred
Copy link
Owner

benfred commented Mar 9, 2018

Totally unrelated, but I read your post this morning https://erikbern.com/2018/03/07/lessons-from-content-marketing-myself-aka-blogging-for-five-years.html - and totally agree with everything you said there.

One small thing though, even with more than 1K followers on feedly you can still get the exact count in a tooltip by hovering over the count:
screen shot 2018-03-09 at 11 11 55 am

@erikbern
Copy link
Author

erikbern commented Mar 9, 2018

ok, let me think about that. are you doing the same thing for all the libraries, or is this in particular for annoy?

thanks for the feedly thing!

@benfred
Copy link
Owner

benfred commented Mar 9, 2018

I'm doing the same thing for NMSLIB and Faiss in that post to be consistent. I could have special cased faiss (since it has an inner product index option) but it was easier to treat faiss the same as nmslib/annoy for that comparison with ann-benchmarks.

However, I'm using faiss's inner product index option in my implicit library for generating recommendations instead of doing the transform myself.

@erikbern
Copy link
Author

erikbern commented Mar 9, 2018

ok, interesting, not sure why annoy performs so much worse than the other ones. if i have time i might try to reproduce your case (tempted to add it to ann-benchmarks actually)

@erikbern
Copy link
Author

erikbern commented Mar 9, 2018

guess we can close this for now

@erikbern erikbern closed this as completed Mar 9, 2018
@erikbern
Copy link
Author

erikbern commented Jul 24, 2018

just rediscovered this thread because someone else brought up maximum inner product search for annoy in spotify/annoy#303

random thought i had – wonder if it would make sense to add your dataset to ann-benchmarks?

@benfred
Copy link
Owner

benfred commented Aug 5, 2018

@erikbern: I think its a good idea to add to ann-benchmarks! Someone else was pinging me about reproducing my results in that post - but I accidentally deleted my fork of ann-benchmarks with the changes required to reproduce 🙁

I created a PR here: erikbern/ann-benchmarks#91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants