Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance-weighted random sampling of non-neighbor negatives #67

Open
sheffe opened this issue Apr 27, 2020 · 2 comments
Open

Distance-weighted random sampling of non-neighbor negatives #67

sheffe opened this issue Apr 27, 2020 · 2 comments

Comments

@sheffe
Copy link
Contributor

sheffe commented Apr 27, 2020

Not a fully-baked feature request, just a directional hunch. I've found the conclusions from this paper Sampling Matters in Deep Embedding Learning pretty intuitive -- (1) the method for choosing negative samples is critical to the overall embedding, maybe more than the specific loss function, and (2) a distance-weighted sampling of negatives had some nice properties during training and better results compared to uniform random sampling or oversampling hard cases.

I'm brand-new to Annoy, not confident on the implementation details or performance changes here, but I suspect that the prebuilt index could be used for both positive and negative sampling. An example: the current approach draws random negatives in sequence and chooses the first index not in a neighbor list. A distance-weighted approach for choosing a negative for each triplet might work like this:

  • Draw a random set of candidate negatives
  • Drop any candidate negatives already in the neighbor list
  • Choose from the remaining set of candidates with probabilities proportional to 1/f(dist(i, j)), where f(dist) could be just 1/dist, 1/sqrt(dist), etc

Annoy gives us the dist(i, j) without much of a performance hit. Weighted choice of the candidate negatives puts a (tunable) thumb on the scale for triplets that contain closer/harder-negative matches.

This idea probably does increase some hyperparameter selection headaches. I think the impactful choices here are the size of the initial set of candidate negatives and (especially) f(dist).

@Szubie
Copy link
Collaborator

Szubie commented Apr 28, 2020

Hi, thanks for the feedback and the link to the paper, it was a good read!

I like the idea and think it would be interesting to try out and see how it impacts the results of the embeddings. The idea of weighting the sampling of triplets did occur to me back when I was initially writing ivis, but I was thinking about weighting the selection of positive examples rather than negatives at that time, as it would be almost free. Weighting the selection of neighbors never made it past initial prototyping though, since it didn't have a positive impact.

From that paper you linked it does sound like weighting the sampling of negative examples may well the worth trying though. Hopefully the performance impact is negligible, since Annoy is pretty efficient, but will keep an eye on it. It would be nice to keep the number of hyperparameters low, but if there are clear benefits from weighted sampling it shouldn't be a problem to expand the number slightly.

@idroz
Copy link
Collaborator

idroz commented Apr 28, 2020

I think this would be worth trying, provided we can benchmark this feature against current performance. We have some benchmarking code that was used to assess L1/L2 distances between points in embedding and original spaces, e.g. https://bering-ivis.readthedocs.io/en/latest/comparisons.html#quantitative-evaluation

I can make that notebook available on ivis repo and we could use it with a new feature branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants