Translations are not proper when source contain the different format of numbers. #87

Sab8605 · 2024-07-29T08:35:38Z

I have setup the models and I am using the En to Indic model for translation by following the Readme file.
Observed the some issues with numbers.

Issues with Numerical Handling in English-to-Indic Translations:

Missing Numbers in Translation:
The model fails to correctly output numerical values in translations. For instance, in the sentence "A reduction of 20% from the existing liability of Rs. 1,87,500," the Hindi translation is: "रुपये की मौजूदा देनदारी से 20 प्रतिशत की कमी।" The numerical value is missing in the output. Upon debugging, it was found that the model itself does not return the token, resulting in a translated output: ['▁रुपये ▁की ▁मौजूदा ▁देन दारी ▁से ▁20 ▁प्रतिशत ▁की ▁कमी ▁।'].

Incorrect Sentence Splitting with Numerical Values:
The model is splitting sentences incorrectly around numerical values when there are spaces in "Rs." For example, in the sentence "The company reported a revenue increase from Rs. 12,34,567 in 2019-20 to Rs. 56,78,910 in 2020-21," preprocessing results in: ['The company reported a revenue increase from Rs.', '12,34,567 in 2019-20 to Rs.', '56,78,910 in 2020-21.']. This results in the Hindi output: 'कंपनी ने रुपये से राजस्व वृद्धि की सूचना दी। 12,34,567 से 2019-20 में रु। 2020-21 में 56,78,910।' This incorrect splitting affects the translation quality.

Regex Pattern Limitations:
The regex pattern defined for handling numbers does not correctly process certain number formats. For example, in the sentence "The company reported a revenue increase from 12,34,567.74 in 2019-20 to Rs. 56,78,910.74 in 2020-21," preprocessing yields: ['The company reported a revenue increase from 12,34 , in to Rs.', '56,78 , in .']. The resulting translation is: 'कंपनी ने राजस्व में वृद्धि दर्ज की जो 12,34,567.74 से 2019-20 में रु। 56, 78, 2020-21 में 910.74.'

Extra Spaces in Numerical Values:
The model is generating extra spaces in numerical values. For instance, in the sentence "when the highest basic pay in the government was only Rs. 30,000 per month," the translation is: "जब सरकार में सबसे अधिक मूल वेतन केवल रु। 30, 000 प्रति माह।" The inclusion of an extra space in the number "30, 000" affects the translation quality.

Thank you.

PranjalChitale · 2024-07-29T11:28:49Z

The inference pipeline is designed to be broad-spectrum, handling texts from a wide array of domains. However, it is not foolproof.

This regex-placeholder method is applied post-hoc as we found it effective in most cases through empirical testing.

Note that the models weren't specifically trained to retain these placeholders and you can go ahead and fine-tune the models to do so.

Sentence splitting is performed using the best open-source libraries available.

The regex pattern was developed by analyzing encountered cases and covers most general-purpose use cases.

If you have any recommendations for other libraries or improved regex patterns, please let us know.

Additionally, you can choose to bypass the inference pipeline when sentence splitting is not necessary (if you are confident about the sequence length).

Below are the results when using the Fairseq model without the inference pipeline.

मौजूदा 1,87,500 रुपये की देनदारी से 20% की कमी।
कंपनी का राजस्व 2019-20 में 12,34,567 रुपये से बढ़कर 2020-21 में 56,78,910 रुपये हो गया।
जब सरकार में सबसे अधिक मूल वेतन केवल 30,000 रुपये प्रति माह था।

Sab8605 · 2024-07-29T12:16:01Z

Thank you so much for your detailed response and for explaining the current approach and its limitations.

Could you kindly guide me on how these results were generated? Were they produced using the model.batch_translate() function?

Thank you once again for your support and assistance.

PranjalChitale · 2024-07-29T12:37:15Z

These are using joint_translate.sh, but batch_translate can also be modified do disable the regex based preprocessing and sentence splitting.

Sab8605 · 2024-07-31T07:18:41Z

Thank you for the prompt response.

I also have tested joint_translate.sh on several examples and noticed that it occasionally inserts extra spaces within numbers generated by the model. This issue does not occur consistently but is intermittent.

For example, in the translation from English to Hindi:
with an investment of 2,560 crores. --> 2, 560 करोड़ के निवेश के साथ।
an amount of 15,000 crores will be made available. --> 15, 000 करोड़ रुपये की राशि उपलब्ध कराई जाएगी।
Central assistance of 5,300 crore will be given. --> 5, 300 करोड़ रुपये की केंद्रीय सहायता दी जाएगी।
79,000 crores. --> 79, 000 करोड़ रु.

PranjalChitale closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translations are not proper when source contain the different format of numbers. #87

Translations are not proper when source contain the different format of numbers. #87

Sab8605 commented Jul 29, 2024

PranjalChitale commented Jul 29, 2024

Sab8605 commented Jul 29, 2024

PranjalChitale commented Jul 29, 2024

Sab8605 commented Jul 31, 2024

Translations are not proper when source contain the different format of numbers. #87

Translations are not proper when source contain the different format of numbers. #87

Comments

Sab8605 commented Jul 29, 2024

PranjalChitale commented Jul 29, 2024

Sab8605 commented Jul 29, 2024

PranjalChitale commented Jul 29, 2024

Sab8605 commented Jul 31, 2024