-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Asian scripts better #8
Comments
Thanks for the detailed report! I'll check it out. |
I have a fix for this almost ready. I should have it posted soon. |
Awesome, thank you for the speedy response! |
Fix #8 - text getting dropped in wikipedia articles
Give v0.4.0 a shot. It won't fix the japanese characters, but it should fix the missing text. New output:
|
Great, thanks! Anything you can do about the two newline characters added after |
I'm using unfluff as an easy way to grab the first few paragraphs of wikipedia articles to describe media. When I print the text returned from https://en.wikipedia.org/wiki/Now_and_Then,_Here_and_There I get:
At the start where the actual article gives:
The problem is almost certainly with the 今 character. I understand you know Asian text doesn't work very well. However, in this instance I'm losing a massive portion of English text. A simple fix for now would be just removing the offending character from the output or replacing it with the Unicode unknown character symbol.
The text was updated successfully, but these errors were encountered: