-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USE multilingual = 16 languages, sentence-transformers distilled version = 15 languages? #2565
Comments
Hello! I suspect that this is indeed a mistake, although I can't be sure as I'm not sure whether the model is a direct copy of the one from Tensorflow Hub or whether it was further (?) tuned for those 15 languages in particular.
|
I also can't find any information on that. I've read what I can about how model distillation is done, but that unfortunately doesn't definitively shed light on the status of Japanese in this model since we don't know the specifics of how it was trained. It would be nice to have an official answer if possible (@nreimers ?). Regarding whether it is a direct copy of the one from Tensorflow, the vectors from But the alignment between models is also an interesting question since if I have selected |
I think I didn't include Japanese in the distillation process. But it might still work for Japanese, as the underlying pre trained model support Japanese. I would recommend to run your own tests and then select the model that works best |
Thanks @nreimers for getting back to me. I do have a couple of questions:
I will do some more testing as you suggested in the short term, though, since as I pointed out, the existing v1 model does actually seem to outperform v2 on Japanese despite Japanese not being included in the distillation process. |
I'm wondering, is this possibly a mistake in the documentation?
The original multilingual Universal Sentence Encoder upon which this is based supports 16 languages, but in the distilled version, Japanese is missing.
distiluse-base-multilingual-cased-v1
seems to work just fine with Japanese, and it seems it supports Japanese better thandistiluse-base-multilingual-cased-v2
, even though the former doesn't declare support for it while the latter does. 🤔The text was updated successfully, but these errors were encountered: