Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translations are not proper when source contain the different format of numbers. #87

Closed
Sab8605 opened this issue Jul 29, 2024 · 4 comments

Comments

@Sab8605
Copy link

Sab8605 commented Jul 29, 2024

I have setup the models and I am using the En to Indic model for translation by following the Readme file.
Observed the some issues with numbers.

Issues with Numerical Handling in English-to-Indic Translations:

Missing Numbers in Translation:
The model fails to correctly output numerical values in translations. For instance, in the sentence "A reduction of 20% from the existing liability of Rs. 1,87,500," the Hindi translation is: "रुपये की मौजूदा देनदारी से 20 प्रतिशत की कमी।" The numerical value is missing in the output. Upon debugging, it was found that the model itself does not return the token, resulting in a translated output: ['▁रुपये ▁की ▁मौजूदा ▁देन दारी ▁से ▁20 ▁प्रतिशत ▁की ▁कमी ▁।'].

Incorrect Sentence Splitting with Numerical Values:
The model is splitting sentences incorrectly around numerical values when there are spaces in "Rs." For example, in the sentence "The company reported a revenue increase from Rs. 12,34,567 in 2019-20 to Rs. 56,78,910 in 2020-21," preprocessing results in: ['The company reported a revenue increase from Rs.', '12,34,567 in 2019-20 to Rs.', '56,78,910 in 2020-21.']. This results in the Hindi output: 'कंपनी ने रुपये से राजस्व वृद्धि की सूचना दी। 12,34,567 से 2019-20 में रु। 2020-21 में 56,78,910।' This incorrect splitting affects the translation quality.

Regex Pattern Limitations:
The regex pattern defined for handling numbers does not correctly process certain number formats. For example, in the sentence "The company reported a revenue increase from 12,34,567.74 in 2019-20 to Rs. 56,78,910.74 in 2020-21," preprocessing yields: ['The company reported a revenue increase from 12,34 , in to Rs.', '56,78 , in .']. The resulting translation is: 'कंपनी ने राजस्व में वृद्धि दर्ज की जो 12,34,567.74 से 2019-20 में रु। 56, 78, 2020-21 में 910.74.'

Extra Spaces in Numerical Values:
The model is generating extra spaces in numerical values. For instance, in the sentence "when the highest basic pay in the government was only Rs. 30,000 per month," the translation is: "जब सरकार में सबसे अधिक मूल वेतन केवल रु। 30, 000 प्रति माह।" The inclusion of an extra space in the number "30, 000" affects the translation quality.

Thank you.

@PranjalChitale
Copy link
Collaborator

The inference pipeline is designed to be broad-spectrum, handling texts from a wide array of domains. However, it is not foolproof.

This regex-placeholder method is applied post-hoc as we found it effective in most cases through empirical testing.

Note that the models weren't specifically trained to retain these placeholders and you can go ahead and fine-tune the models to do so.

Sentence splitting is performed using the best open-source libraries available.

The regex pattern was developed by analyzing encountered cases and covers most general-purpose use cases.

If you have any recommendations for other libraries or improved regex patterns, please let us know.

Additionally, you can choose to bypass the inference pipeline when sentence splitting is not necessary (if you are confident about the sequence length).

Below are the results when using the Fairseq model without the inference pipeline.

मौजूदा 1,87,500 रुपये की देनदारी से 20% की कमी।
कंपनी का राजस्व 2019-20 में 12,34,567 रुपये से बढ़कर 2020-21 में 56,78,910 रुपये हो गया।
जब सरकार में सबसे अधिक मूल वेतन केवल 30,000 रुपये प्रति माह था।

@Sab8605
Copy link
Author

Sab8605 commented Jul 29, 2024

Thank you so much for your detailed response and for explaining the current approach and its limitations.

Could you kindly guide me on how these results were generated? Were they produced using the model.batch_translate() function?

Thank you once again for your support and assistance.

@PranjalChitale
Copy link
Collaborator

These are using joint_translate.sh, but batch_translate can also be modified do disable the regex based preprocessing and sentence splitting.

@Sab8605
Copy link
Author

Sab8605 commented Jul 31, 2024

Thank you for the prompt response.

I also have tested joint_translate.sh on several examples and noticed that it occasionally inserts extra spaces within numbers generated by the model. This issue does not occur consistently but is intermittent.

For example, in the translation from English to Hindi:
with an investment of 2,560 crores. --> 2, 560 करोड़ के निवेश के साथ।
an amount of 15,000 crores will be made available. --> 15, 000 करोड़ रुपये की राशि उपलब्ध कराई जाएगी।
Central assistance of 5,300 crore will be given. --> 5, 300 करोड़ रुपये की केंद्रीय सहायता दी जाएगी।
79,000 crores. --> 79, 000 करोड़ रु.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants