Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] What would be a reasonable/sound approach when we only have translation with reference? #51

Closed
alvations opened this issue Jan 13, 2022 · 6 comments
Labels
question Further information is requested

Comments

@alvations
Copy link
Contributor

Sometimes we only have translations with their references and no source. But the default COMET expects something like:

{"src": src, "mt": hyp, "ref": ref}

or QE comet

{"src": src, "mt": hyp}

Is there a way to let comet take

{"mt": hyp, "ref": ref}

Would this be a feasible approach?

{"src": ref, "mt": hyp, "ref": ref}
@alvations alvations added the question Further information is requested label Jan 13, 2022
@ricardorei
Copy link
Collaborator

Hi @alvations! We do not have any model with MT and Reference only. All our models receive the source.

The best way to have this would be to retrain the QE model but replacing the "src" with the reference.

@ricardorei
Copy link
Collaborator

{"mt": hyp, "ref": ref}
This is a bad idea for a QE model. Because the embeddings will be super close to each other and the QE model would assign a very high score.

{"src": ref, "mt": hyp, "ref": ref}
This might give you something (because the reference-based model relies much more on the reference than it does on the source). Yet, the score might be a bit biased towards higher values...

@alvations
Copy link
Contributor Author

Thanks for the explanation! Yes it makes sense that {"src": ref, "mt": hyp, "ref": ref} would bias towards higher values. Would be nice to compare it vs {"src": src, "mt": hyp, "ref": ref}

@ogencoglu
Copy link

I am also interested in the case of only having some translation and ground truth without any source.

What is the exact effect of src to the score for "Unbabel/wmt22-comet-da"? Here are 4 experiments:

data = [
    {
        "src": "Dem Feuer konnte Einhalt geboten werden",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    }

which results in 0.8386

Then I replaced the source with some random Japanese text which is not related to mt or ref at all:

data = [
    {
        "src": "犬が郵便配達員に向かって吠えている。",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    }
]

which results in 0.8260

Then I just passed empty string:

data = [
    {
        "src": "",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    }
]

which results in 0.8229

And finally passed ref:

data = [
    {
        "src": "They were able to control the fire.",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    }
]

which results in 0.8423

There does not seem to be a significant difference between these scores. Any comments would be appreciated.

@ogencoglu
Copy link

Difference in scores even less significant considering the fact that the following gives 0.4236 score.

data = [
    {
        "src": "",
        "mt": "",
        "ref": "They were able to control the fire."
    }
]

@ricardorei
Copy link
Collaborator

Hey! Two things to consider for Unbabel/wmt22-comet-da:

  1. Its relies much more on reference than source. Source seems to help just a little to disambiguate a few phenomena but overall the model seems to give much more importance to reference.
  2. Empty strings are OOD. Its likely that there is no empty string in WMT data which is used to train the model. The model sometimes can have weird behaviours on this very low quality translations because they typically never occur. Thats why I believe its important to pair COMET with a lexical metric like chrF.

From your examples, the range of changes according to your perturbations seems to be little (this is not desirable), yet if we look at the rankings, they seem correct. For empty src and mt the score is the lowest, and empty source/random src have similar score. The highest score is when you use all data correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants