Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Language Detection #9

Open
PJ-Finlay opened this issue Dec 21, 2020 · 13 comments
Open

Support Language Detection #9

PJ-Finlay opened this issue Dec 21, 2020 · 13 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@PJ-Finlay
Copy link
Collaborator

The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.

@argosopentech argosopentech added help wanted Extra attention is needed enhancement New feature or request labels Dec 21, 2020
@pierotofy
Copy link
Contributor

This would be pretty useful for any automated translation mechanism!

@PJ-Finlay
Copy link
Collaborator Author

Interesting, I think using the same pipeline would be a good long term solution but this could be a something to do in the meantime. One issue with using the pipeline is that as soon as a we add a new language we have to also retrain the detector. This would probably also be lighter weight vs a 100MB model file. The main interest for this is currently from LibreTranslate so if someone wants to extend the Python API to use this that would be welcome and then the API could be reimplemented in the future if it makes sense.

@thomas536
Copy link

Some support was added to LibreTranslate in LibreTranslate/LibreTranslate#12

@hollorol
Copy link
Contributor

Recently I saw an article about the comparison of language detection tools. FastText can be a viable option instead of langdetect, because it is lot faster.
image

We have an another option which can be quite accurate in case of longer texts: N-grams. There are predetermined n-grams for all supported languages and it is easy the generate new lists. The advantages of using this approach is that the models are really small, the implementation is easy and we it does not need any extra library. In any case, if help needed, I can implement these.

@PJ-Finlay
Copy link
Collaborator Author

@hollorol If you can do this with jus the Python standard library a pull request would be appreciated.

@hollorol
Copy link
Contributor

@PJ-Finlay, I'll do it only for the cli, because I don't use the GUI part of the program; but I guess after it, adapt it to the GUI will be easy.

@PJ-Finlay
Copy link
Collaborator Author

That sounds good, it should probably be it's own file/module that can be integrated into the CLI.

@TechnologyClassroom
Copy link
Contributor

Lingua might be useful for this. Lingua is made with python, works with short strings, works offline, and licensed under Apache-2.0.

@PJ-Finlay
Copy link
Collaborator Author

LibreTranslate already has a system for language detection so this hasn't been a priority. My plan was to use CTranslate2 models to map input text into a language code but open to suggestions.

@TechnologyClassroom
Copy link
Contributor

Not everyone uses LibreTranslate.

@PJ-Finlay
Copy link
Collaborator Author

The way Argos Translate currently works it would be a breaking change to add this but I'm planning to add it in the next major version. It would also be possible to add language detection to the GUI (which is in a separate repo) using a third party library like Lingua.

@TechnologyClassroom
Copy link
Contributor

I could see it being used like a special input that would trigger the language detection. Syntax could be something like this:

echo "Text to translate" | argos-translate --from-lang auto-detect --to-lang en

@PJ-Finlay
Copy link
Collaborator Author

This is the way to do it for core Argos Translate, the only thing I might change is "detect" instead of "auto-detect".

animeavi pushed a commit to animeavi/pleroma that referenced this issue Dec 20, 2022
Argos Translate is a Python module for translation and can be used as a command line tool.

This is also the engine for LibreTranslate, for which we already have a module.
Here we can use the engine directly from our server without doing requests to a third party or having to install our own LibreTranslate webservice (obviously you do have to install Argos Translate).

One thing that's currently still missing from Argos Translate is auto-detection of languages (see <argosopentech/argos-translate#9>). For now, when no source language is provided, we just return the text unchanged, supposedly translated from the target language. That way you get a near immediate response in pleroma-fe when clicking Translate, after which you can select the source language from a dropdown.

Argos Translate also doesn't seem to handle html very well. Therefore we give admins the option to strip the html before translating. I made this an option because I'm unsure if/how this will change in the future.

Co-authored-by: ilja <git@ilja.space>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/351
Co-authored-by: ilja <akkoma.dev@ilja.space>
Co-committed-by: ilja <akkoma.dev@ilja.space>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants