Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INPUT] Text Length of Input (source, reference, and hypothesis) #209

Open
foreveronehundred opened this issue Mar 28, 2024 · 2 comments
Open
Labels
question Further information is requested

Comments

@foreveronehundred
Copy link

foreveronehundred commented Mar 28, 2024

For translation quality estimation of COMET, I think there is no limitation of the text length. However, from my personal perspective, I do not think the estimation will be accurate if the text is too long.

So, what text length (of source, reference, and hypothesis) do you recommend?

@foreveronehundred foreveronehundred added the question Further information is requested label Mar 28, 2024
@ricardorei
Copy link
Collaborator

Hi @foreveronehundred! The code does not break when running very very large segments BUT the models truncate the input if it goes above 512 tokens. For models like wmt22-cometkiwi-da the input will be shared for both source and translation which means that the total number of tokens from source and translation should not be longer than 512 tokens....

Still, 512 tokens is a long input. Its more than enough to input several sentences together and evaluate entire paragraphs. Maybe not enough for an entire 2 page document tho.

A quick way to test it is to tokenize both inputs and get their length:

from transformers import XLMRobertaTokenizer
source = ["Hello, how are you?", "This is a test sentence."]
translations = ["Bonjour, comment ça va?", "Ceci est une phrase de test."]
# This is the same for most COMET models
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") 
# Tokenize and count tokens for each pair
for src, trans in zip(source, translations):
     # Tokenize sentences
    src_tokens = tokenizer.encode(src, add_special_tokens=False)
    trans_tokens = tokenizer.encode(trans, add_special_tokens=False)

    # Jointly encode and count tokens
    joint_tokens = tokenizer.encode(src, trans, add_special_tokens=True, truncation=True)

    # Output token counts
    print(f"Source: {src}\nTranslation: {trans}")
    print(f"Source tokens: {len(src_tokens)}")
    print(f"Translation tokens: {len(trans_tokens)}")
    print(f"Jointly encoded tokens: {len(joint_tokens)}")
    print("="*30)

@foreveronehundred
Copy link
Author

Thanks for reply. I think the length is enough for general cases.
By the way, I want to know the token length of the training data. Could you give some statistics (Mean, STD, etc.)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants