Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Asian scripts better #8

Closed
thethomaseffect opened this issue Jul 19, 2014 · 5 comments · Fixed by #9
Closed

Handle Asian scripts better #8

thethomaseffect opened this issue Jul 19, 2014 · 5 comments · Fixed by #9

Comments

@thethomaseffect
Copy link

I'm using unfluff as an easy way to grab the first few paragraphs of wikipedia articles to describe media. When I print the text returned from https://en.wikipedia.org/wiki/Now_and_Then,_Here_and_There I get:

Now and Then, Here and There (

Now and Then, Here and There follows a young boy named Shuzo "Shu" Matsutani who, in an attempt to save an unknown girl, is transported to another world which is possibly the Earth in the far future. The world is desolate and militarized, and water is a scarce commodity.

At the start where the actual article gives:

Now and Then, Here and There (今、そこにいる僕 Ima, Soko ni Iru Boku?) is a thirteen episode anime series directed by Akitaro Daichi and written by Hideyuki Kurata. The story was originally conceived by director Daichi. It premiered in Japan on the WOWOW television station on October 14, 1999 and ran until January 20, 2000. It was licensed for Region 1 DVD English language release by Central Park Media under the US Manga Corps banner. Following the 2009 bankruptcy and liquidation of Central Park Media, ADV Films picked up the series for a release on July 7, 2009.[1] As of Sept. 1, 2009, the series is licensed by ADV's successor, AEsir Holdings, with distribution from Section23 Films.[2]

Now and Then, Here and There follows a young boy named Shuzo "Shu" Matsutani who, in an attempt to save an unknown girl, is transported to another world which is possibly the Earth in the far future. The world is desolate and militarized, and water is a scarce commodity.

The problem is almost certainly with the 今 character. I understand you know Asian text doesn't work very well. However, in this instance I'm losing a massive portion of English text. A simple fix for now would be just removing the offending character from the output or replacing it with the Unicode unknown character symbol.

@ageitgey
Copy link
Owner

Thanks for the detailed report! I'll check it out.

@ageitgey
Copy link
Owner

I have a fix for this almost ready. I should have it posted soon.

@thethomaseffect
Copy link
Author

Awesome, thank you for the speedy response!

ageitgey added a commit that referenced this issue Jul 22, 2014
Fix #8 - text getting dropped in wikipedia articles
@ageitgey
Copy link
Owner

Give v0.4.0 a shot. It won't fix the japanese characters, but it should fix the missing text.

New output:

$ curl -s "https://en.wikipedia.org/wiki/Now_and_Then,_Here_and_There" | unfluff | jq -r .text
Now and Then, Here and There (

) is a thirteen episode anime series directed by Akitaro Daichi and written by Hideyuki Kurata. The story was originally conceived by director Daichi. It premiered in Japan on the WOWOW television station on October 14, 1999 and ran until January 20, 2000. It was licensed for Region 1 DVD English language release by Central Park Media under the US Manga Corps banner. Following the 2009 bankruptcy and liquidation of Central Park Media, ADV Films picked up the series for a release on July 7, 2009. As of Sept. 1, 2009, the series is licensed by ADV's successor, AEsir Holdings, with distribution from Section23 Films.

Now and Then, Here and There follows a young boy named Shuzo "Shu" Matsutani who, in an attempt to save an unknown girl, is transported to another world which is possibly the Earth in the far future. The world is desolate and militarized, and water is a scarce commodity.

While walking home from a somewhat bad, but regular day of school, "Shu", the main protagonist, spots a girl on top of a smoke stack in an industrial park where he used to hang out as a young child. Shuzo tries numerous attempts to communicate with the young girl but she acts emotionless and quiet, and hardly acknowledges his presence. After decoding her name from her lips (Lala-Ru) the only other piece of information he finds out about her is her love of watching sunsets. There is a sudden explosion and time stops; Shu finds himself defending Lala-Ru from abductors in mechanized snakes. After attempting to defend the girl, he is caught in a transportation to the world from which the strangers hail, a wasteland devoid of water and dominated by a red giant star. Lala-Ru possesses a pendant containing a vast reservoir of water, and has the ability to control that water.

Shu is trapped in this new, harsh reality, and he is beaten and interrogated repeatedly inside the warship commanded by the ruthless, manic dictator, Hamdo. While locked in a cell he meets an abducted girl who introduces herself as Sara Ringwalt of America. Sara's reason for her capture was being mistaken for Lala-Ru by Hamdo's minions. Sara goes through extremely horrific experiences and eventually becomes emotionally scarred. After an assault by an unknown enemy landship, Shu is forced to join an army of child soldiers; children trained to for the looting of villages, in which they kidnap female villagers for breeding, and conscript orphaned male children into the ever dwindling ranks of Hamdo's army.

From the start, the series may seem lighthearted in nature, but this is far from the truth. Much of the series deals with serious moral issues relating to war, the consequences of war, slavery, and the exploitation of children.

@thethomaseffect
Copy link
Author

Great, thanks! Anything you can do about the two newline characters added after ( ? I've worked around it easily enough by matching one of my regex over multiple lines so it's not a huge priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants