Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid translation of Named Entities #127

Open
opme opened this issue Jan 13, 2023 · 3 comments
Open

avoid translation of Named Entities #127

opme opened this issue Jan 13, 2023 · 3 comments

Comments

@opme
Copy link

opme commented Jan 13, 2023

I noticed that named entities like company names are getting translated.

I was thinking of running a preprocessor model like from spacy.io to flag all the named entities. I then want to avoid translating those.

I am wondering if there is an official way to prevent translation within the text sent to translatelocally.

For example: China Nonferrous Gold Limited -> Kiina Non Iron Gold Limited (finnish from Opus-mt student)

Using the student models is the best solution for translating large amounts of text with limited computer power. I am playing around with translating a site I am building to many languages but just 80k paragraphs was going to take months on a single computer. Here I can do it in one night.

@jelmervdl
Copy link
Collaborator

It's an issue we're aware of, but don't have a solution for yet. We're thinking along the same lines though!

Our plan was adding support for placeholders, e.g. placeholders in the input sentence would be translated as is into the output sentence (but in the proper position). We could then replace some or all named entities, urls, email addresses, etc, with placeholders and put them back in after translation. Problem with this approach is that the model has to be trained with placeholder support. So this won't work with our current models.

What you could try is to use the support for HTML translation that's in bergamot-translator (the library backing translateLocally.) I just pushed a commit to the main branch to make that accessible from the command line. With that version, you should be able to do something like:

echo "The train leaves for <span>London St. Pancras</span> at quarter past six." | ./translateLocally -m eng-fin-tiny --html
Juna lähtee <span>Lontoo St. Pancras</span> neljännestä yli kuusi.

HTML support is not really meant for this, but it might get you at least half way. You can add <span id="1"> etc around the named entities. They will be translated, but at least you know where they are and you can put the original back in. If you'd rather hide them from the translation engine you can insert a blank <span id="1"></span> in there but that might confuse the translation model even more (because it's not trained with missing words).

@opme
Copy link
Author

opme commented Jan 15, 2023

Thank you. It is working with the html support in all languages except estonian. The html support looks to be broken in the estonian model. I'm doing the preprocessing with a spacey model that is able to detect full names. I then add the span and regex then back to the original after the translation.

I'm also see what looks like a memory leakage though it can be worked around by restarting the sub process every 1000 iterations.

I am still working on the scripts and will post an example when it is stable.

   # load model to handle named entites
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
    token_classifier = pipeline(
        "token-classification", model=model_checkpoint, aggregation_strategy="simple"
    )

Example of the estonian issues. hmm.

 echo "<span id=\"1\">Atrium Mortgage Investment Corporation</span>, a non-bank lender, provides financing solutions to the real estate communities in Ontario, Alberta, and British Columbia. It offers various types of mortgage loans for residential, multi-residential, and commercial real properties" | ./translate
Locally -m en-et-tiny --html

<span id="1">Panka</span> mittekuuluva laenuandja <span id="1">Atrium Mortgage Investment Corporation</span> pakub rahastamislahendusi Ontario, Alberta ja Briti Columbia kinnisvarakogukondadele. Ta pakub erinevat tüüpi eluasemelaenu elamu-, multiresidentide ja ärikinnisvarale

@jelmervdl
Copy link
Collaborator

I think you're seeing the results of using alignment scores for inserting HTML, and why it isn't ideal for your use case. What it basically does is look per output token which source token aligns best according to some alignment model.

There's no guarantee in there that there's a 1-to-1 mapping, and the HTML reconstruction is allowed to duplicate elements if it thinks that a span in the input sentence got split up in the translated sentence. You might want to do some post-processing to decide which ones of the spans is the actual named entity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants