Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

korean encoding issue #10

Closed
mrchypark opened this issue Jan 18, 2018 · 8 comments
Closed

korean encoding issue #10

mrchypark opened this issue Jan 18, 2018 · 8 comments

Comments

@mrchypark
Copy link

#9

new start.

@jwijffels
Copy link
Contributor

jwijffels commented Jan 19, 2018

I don't know of a clear solution if you really want to incorporate the code your provided inside the function. But the docs in udpipe_annotate are pretty clear on the input needed. It says that x should be a character vector in UTF-8 encoding
If you do not have x in UTF-8 encoding then you need to make sure x is in UTF-8 encoding. This can be done with iconv
As in iconv(x, from = "CP949", to = "UTF-8")
Where the list of encodings is specified in iconvlist for example

Encoding("Je n'aime pas ça")
[1] "latin1"
Encoding(iconv("Je n'aime pas ça", from = "latin1", to = "UTF-8"))
[1] "UTF-8"

But the default encoding if you type in text in R depends on your locale, mine is as follows.

Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
localeToCharset()
[1] "ISO8859-1"

So for my case I would need to do

library(udpipe)
ud_model <- udpipe_download_model("french")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = iconv("Je n'aime pas ça", from = "latin1", to = "UTF-8"))
as.data.frame(x)

Testing inside udpipe_annotate if x is in UTF-8 has the following complexities shown below:

  • ASCII code has always unknown encoding - so unknown can be an indication of ascii which is fine
  • You are never sure that text data on your machine is in the default locale of your machine as shown below

As a result, I'm reluctant to do any fixes inside the udpipe_annotate function. The user just needs to make sure his input is in UTF-8

> ## ASCII is always Encoding unknown
> Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> x <- "I drink milk in the morning"
> Encoding(x)
[1] "unknown"
> Encoding(iconv(x, to = "UTF-8"))
[1] "unknown"
> Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
> Encoding(x)
[1] "UTF-8"
> out <- iconv(x, from = "UTF-8", to = "CP949")
> iconv(out, from = "CP949", to = "UTF-8")
[1] "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
> result <- iconv(out, to = "UTF-8")
> Sys.setlocale("LC_ALL", locale = "Korean_Korea.949")
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
> result
[1] "¾È³çÇϼ¼¿ä. Àú´Â ¹ÚÂù¿±ÀÔ´Ï´Ù. ÇѱÛÀÇ ÀÎÄÚµù ¹®Á¦¸¦ ÀçÇöÇÏ·Á°í ÇÕ´Ï´Ù."

@dselivanov
Copy link
Contributor

Absolutely agree with @jwijffels

@mrchypark
Copy link
Author

I agree with The user just needs to make sure his input is in UTF-8.

how about add warning message if Encoding(x)!= "UTF-8"?

@dselivanov
Copy link
Contributor

dselivanov commented Jan 19, 2018 via email

@dselivanov
Copy link
Contributor

dselivanov commented Jan 19, 2018

I mean it would be nice to have such check/warning but it seems it will be tricky to implement it.

@mrchypark
Copy link
Author

Ok, just my opinion. Thank you guys for support and discuss. Is it ok to close issue?

@jwijffels
Copy link
Contributor

Completely agree that it would be nice to have such a check/warning but due to the 2 elements I just enumerated, I don't know of any valid way on how to implement this.

@jwijffels
Copy link
Contributor

Closing this. Feel free to re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants