-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translations are not proper when source contain the different format of numbers. #87
Comments
The inference pipeline is designed to be broad-spectrum, handling texts from a wide array of domains. However, it is not foolproof. This regex-placeholder method is applied post-hoc as we found it effective in most cases through empirical testing. Note that the models weren't specifically trained to retain these placeholders and you can go ahead and fine-tune the models to do so. Sentence splitting is performed using the best open-source libraries available. The regex pattern was developed by analyzing encountered cases and covers most general-purpose use cases. If you have any recommendations for other libraries or improved regex patterns, please let us know. Additionally, you can choose to bypass the inference pipeline when sentence splitting is not necessary (if you are confident about the sequence length). Below are the results when using the Fairseq model without the inference pipeline.
|
Thank you so much for your detailed response and for explaining the current approach and its limitations. Could you kindly guide me on how these results were generated? Were they produced using the model.batch_translate() function? Thank you once again for your support and assistance. |
These are using |
Thank you for the prompt response. I also have tested joint_translate.sh on several examples and noticed that it occasionally inserts extra spaces within numbers generated by the model. This issue does not occur consistently but is intermittent. For example, in the translation from English to Hindi: |
I have setup the models and I am using the En to Indic model for translation by following the Readme file.
Observed the some issues with numbers.
Issues with Numerical Handling in English-to-Indic Translations:
Missing Numbers in Translation:
The model fails to correctly output numerical values in translations. For instance, in the sentence "A reduction of 20% from the existing liability of Rs. 1,87,500," the Hindi translation is: "रुपये की मौजूदा देनदारी से 20 प्रतिशत की कमी।" The numerical value is missing in the output. Upon debugging, it was found that the model itself does not return the token, resulting in a translated output: ['▁रुपये ▁की ▁मौजूदा ▁देन दारी ▁से ▁20 ▁प्रतिशत ▁की ▁कमी ▁।'].
Incorrect Sentence Splitting with Numerical Values:
The model is splitting sentences incorrectly around numerical values when there are spaces in "Rs." For example, in the sentence "The company reported a revenue increase from Rs. 12,34,567 in 2019-20 to Rs. 56,78,910 in 2020-21," preprocessing results in: ['The company reported a revenue increase from Rs.', '12,34,567 in 2019-20 to Rs.', '56,78,910 in 2020-21.']. This results in the Hindi output: 'कंपनी ने रुपये से राजस्व वृद्धि की सूचना दी। 12,34,567 से 2019-20 में रु। 2020-21 में 56,78,910।' This incorrect splitting affects the translation quality.
Regex Pattern Limitations:
The regex pattern defined for handling numbers does not correctly process certain number formats. For example, in the sentence "The company reported a revenue increase from 12,34,567.74 in 2019-20 to Rs. 56,78,910.74 in 2020-21," preprocessing yields: ['The company reported a revenue increase from 12,34 , in to Rs.', '56,78 , in .']. The resulting translation is: 'कंपनी ने राजस्व में वृद्धि दर्ज की जो 12,34,567.74 से 2019-20 में रु। 56, 78, 2020-21 में 910.74.'
Extra Spaces in Numerical Values:
The model is generating extra spaces in numerical values. For instance, in the sentence "when the highest basic pay in the government was only Rs. 30,000 per month," the translation is: "जब सरकार में सबसे अधिक मूल वेतन केवल रु। 30, 000 प्रति माह।" The inclusion of an extra space in the number "30, 000" affects the translation quality.
Thank you.
The text was updated successfully, but these errors were encountered: