Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Describe how sentence vectors are generated #309

Closed
matthias-samwald opened this issue Sep 6, 2017 · 4 comments
Closed

Describe how sentence vectors are generated #309

matthias-samwald opened this issue Sep 6, 2017 · 4 comments

Comments

@matthias-samwald
Copy link

matthias-samwald commented Sep 6, 2017

It would be great to have a better understanding of how the sentence vectors are generated. Superficially, there are similarities to Sent2Vec (https://github.com/epfml/sent2vec / paper) -- is that algorithm being used?

@matthias-samwald
Copy link
Author

Upon looking at the codebase a bit more, it seems like vectors are currently generated via simple averaging of word vectors?

@spate141
Copy link

spate141 commented Sep 6, 2017

@matthias-samwald have you tried using sentence embeddings generated by sent2vec in any task? or any types of comparison with fastText embeddings ?
Thanks!

@matthias-samwald
Copy link
Author

matthias-samwald commented Sep 6, 2017

@spate141 No I have not done anything like that yet, but I probably will in the future (based on biomedical text).
In the meanwhile I have read up a little bit more and found that my original question was somewhat misguided. sent2vec does not (at least primarily) have any special functionality for deriving sentence vectors from existing word vector embeddings. Rather, it uses a different training objective already at the stage of training the word vectors.
In contrast, TF-IDF weighing of word vectors, or Smooth Inverse Frequency (SIF) weighing (https://github.com/PrincetonML/SIF) could be applied on normally trained word vectors post-hoc, but require word frequency information.
It seems like the superiority of these sentence embeddings compared to simple averaging or max-pooling of word vectors is a robust finding across several evaluation sets. Perhaps this could be a potential new feature for fastText?
I think I will start out working with sent2vec, since it is specialized on sentence embeddings and also works with word n-grams (while it might be extra work to apply SIF weighing to word n-grams).

@cpuhrsch
Copy link
Contributor

Hello @matthias-samwald,

Please see the following issue #323 which details a discussion on how we calculate sentence embeddings. It appears that you have already figured this out yourself, I just want to make sure this is referenced here. Indeed it could be on fastText's roadmap to also implement some of Sent2Vec's features, if deemed relevant, so stay tuned. I'm closing this issue now, but please feel encouraged to reopen it at any point if you don't consider this resolved.

Thanks,
Christian

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants