Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USE multilingual = 16 languages, sentence-transformers distilled version = 15 languages? #2565

Open
ryanheise opened this issue Mar 30, 2024 · 4 comments

Comments

@ryanheise
Copy link

I'm wondering, is this possibly a mistake in the documentation?

distiluse-base-multilingual-cased-v1: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

The original multilingual Universal Sentence Encoder upon which this is based supports 16 languages, but in the distilled version, Japanese is missing.

distiluse-base-multilingual-cased-v1 seems to work just fine with Japanese, and it seems it supports Japanese better than distiluse-base-multilingual-cased-v2, even though the former doesn't declare support for it while the latter does. 🤔

@tomaarsen
Copy link
Collaborator

Hello!

I suspect that this is indeed a mistake, although I can't be sure as I'm not sure whether the model is a direct copy of the one from Tensorflow Hub or whether it was further (?) tuned for those 15 languages in particular.

  • Tom Aarsen

@ryanheise
Copy link
Author

I also can't find any information on that. I've read what I can about how model distillation is done, but that unfortunately doesn't definitively shed light on the status of Japanese in this model since we don't know the specifics of how it was trained. It would be nice to have an official answer if possible (@nreimers ?).

Regarding whether it is a direct copy of the one from Tensorflow, the vectors from distiluse-base-multilingual-cased-v1 and distiluse-base-multilingual-cased-v2 do seem to align with each other for the same inputs, but they don't seem to align at all with the vectors from the original multilingual USE. I may be missing something here.

But the alignment between models is also an interesting question since if I have selected distiluse-base-multilingual-cased-v2 for my use case based on the Japanese requirement, and it turns out that distiluse-base-multilingual-cased-v1 works better for Japanese after all, it would be a big cost saving to be able to switch out one for the other, without having to reprocess all of the data that has already been processed, assuming that all of the vectors produced for English, Spanish and so on under distiluse-base-multilingual-cased-v2 will still be meaningful when compared to new vectors under the distiluse-base-multilingual-cased-v1 model.

@nreimers
Copy link
Member

nreimers commented Apr 4, 2024

I think I didn't include Japanese in the distillation process. But it might still work for Japanese, as the underlying pre trained model support Japanese.

I would recommend to run your own tests and then select the model that works best

@ryanheise
Copy link
Author

Thanks @nreimers for getting back to me. I do have a couple of questions:

  1. Would it be appropriate for me to contribute a donation to help get Japanese properly included? (Or was there some reason that made including Japanese infeasible?)
  2. Or if you would advise I run the distillation process myself, would you be able to share any more details on the specifics you used beyond this example?

I will do some more testing as you suggested in the short term, though, since as I pointed out, the existing v1 model does actually seem to outperform v2 on Japanese despite Japanese not being included in the distillation process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants