Support Language Detection #9

PJ-Finlay · 2020-12-21T23:32:53Z

The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.

pierotofy · 2020-12-23T03:41:04Z

This would be pretty useful for any automated translation mechanism!

PJ-Finlay · 2021-01-12T13:44:21Z

Interesting, I think using the same pipeline would be a good long term solution but this could be a something to do in the meantime. One issue with using the pipeline is that as soon as a we add a new language we have to also retrain the detector. This would probably also be lighter weight vs a 100MB model file. The main interest for this is currently from LibreTranslate so if someone wants to extend the Python API to use this that would be welcome and then the API could be reimplemented in the future if it makes sense.

thomas536 · 2021-01-18T19:13:55Z

Some support was added to LibreTranslate in LibreTranslate/LibreTranslate#12

hollorol · 2021-10-30T07:24:35Z

Recently I saw an article about the comparison of language detection tools. FastText can be a viable option instead of langdetect, because it is lot faster.

We have an another option which can be quite accurate in case of longer texts: N-grams. There are predetermined n-grams for all supported languages and it is easy the generate new lists. The advantages of using this approach is that the models are really small, the implementation is easy and we it does not need any extra library. In any case, if help needed, I can implement these.

PJ-Finlay · 2021-10-30T21:34:17Z

@hollorol If you can do this with jus the Python standard library a pull request would be appreciated.

hollorol · 2021-10-31T10:14:08Z

@PJ-Finlay, I'll do it only for the cli, because I don't use the GUI part of the program; but I guess after it, adapt it to the GUI will be easy.

PJ-Finlay · 2021-10-31T12:58:43Z

That sounds good, it should probably be it's own file/module that can be integrated into the CLI.

TechnologyClassroom · 2022-01-11T21:44:56Z

Lingua might be useful for this. Lingua is made with python, works with short strings, works offline, and licensed under Apache-2.0.

PJ-Finlay · 2022-01-11T23:50:28Z

LibreTranslate already has a system for language detection so this hasn't been a priority. My plan was to use CTranslate2 models to map input text into a language code but open to suggestions.

TechnologyClassroom · 2022-01-12T14:48:51Z

Not everyone uses LibreTranslate.

PJ-Finlay · 2022-01-13T00:31:05Z

The way Argos Translate currently works it would be a breaking change to add this but I'm planning to add it in the next major version. It would also be possible to add language detection to the GUI (which is in a separate repo) using a third party library like Lingua.

TechnologyClassroom · 2022-01-13T15:08:36Z

I could see it being used like a special input that would trigger the language detection. Syntax could be something like this:

echo "Text to translate" | argos-translate --from-lang auto-detect --to-lang en

PJ-Finlay · 2022-01-14T00:44:35Z

This is the way to do it for core Argos Translate, the only thing I might change is "detect" instead of "auto-detect".

Argos Translate is a Python module for translation and can be used as a command line tool. This is also the engine for LibreTranslate, for which we already have a module. Here we can use the engine directly from our server without doing requests to a third party or having to install our own LibreTranslate webservice (obviously you do have to install Argos Translate). One thing that's currently still missing from Argos Translate is auto-detection of languages (see <argosopentech/argos-translate#9>). For now, when no source language is provided, we just return the text unchanged, supposedly translated from the target language. That way you get a near immediate response in pleroma-fe when clicking Translate, after which you can select the source language from a dropdown. Argos Translate also doesn't seem to handle html very well. Therefore we give admins the option to strip the html before translating. I made this an option because I'm unsure if/how this will change in the future. Co-authored-by: ilja <git@ilja.space> Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/351 Co-authored-by: ilja <akkoma.dev@ilja.space> Co-committed-by: ilja <akkoma.dev@ilja.space>

argosopentech added help wanted Extra attention is needed enhancement New feature or request labels Dec 21, 2020

pierotofy mentioned this issue Jan 10, 2021

Auto detect source language LibreTranslate/LibreTranslate#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Language Detection #9

Support Language Detection #9

PJ-Finlay commented Dec 21, 2020

pierotofy commented Dec 23, 2020

PJ-Finlay commented Jan 12, 2021

thomas536 commented Jan 18, 2021

hollorol commented Oct 30, 2021

PJ-Finlay commented Oct 30, 2021

hollorol commented Oct 31, 2021

PJ-Finlay commented Oct 31, 2021

TechnologyClassroom commented Jan 11, 2022

PJ-Finlay commented Jan 11, 2022

TechnologyClassroom commented Jan 12, 2022

PJ-Finlay commented Jan 13, 2022

TechnologyClassroom commented Jan 13, 2022

PJ-Finlay commented Jan 14, 2022

Support Language Detection #9

Support Language Detection #9

Comments

PJ-Finlay commented Dec 21, 2020

pierotofy commented Dec 23, 2020

PJ-Finlay commented Jan 12, 2021

thomas536 commented Jan 18, 2021

hollorol commented Oct 30, 2021

PJ-Finlay commented Oct 30, 2021

hollorol commented Oct 31, 2021

PJ-Finlay commented Oct 31, 2021

TechnologyClassroom commented Jan 11, 2022

PJ-Finlay commented Jan 11, 2022

TechnologyClassroom commented Jan 12, 2022

PJ-Finlay commented Jan 13, 2022

TechnologyClassroom commented Jan 13, 2022

PJ-Finlay commented Jan 14, 2022