Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

korean encoding issue #9

Closed
wants to merge 1 commit into from
Closed

korean encoding issue #9

wants to merge 1 commit into from

Conversation

mrchypark
Copy link

@mrchypark mrchypark commented Jan 17, 2018

When I tried to get annotate in korean, text Encoding of result is broken.
I fixed to add code below.
I checked in windows and ubuntu 16.04

windows

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] udpipe_0.3 RevoUtils_10.0.6 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 Matrix_1.2-11 tools_3.4.2 yaml_2.1.14
[5] Rcpp_0.12.13 grid_3.4.2 data.table_1.10.4-2 lattice_0.20-35

ubuntu

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] udpipe_0.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 lattice_0.20-35 digest_0.6.13 withr_2.1.1
[5] grid_3.4.3 R6_2.2.2 git2r_0.20.0 httr_1.3.1
[9] curl_3.0 data.table_1.10.4-3 Matrix_1.2-12 devtools_1.13.4
[13] tools_3.4.3 yaml_2.1.16 compiler_3.4.3 memoise_1.1.0
[17] knitr_1.17

@jwijffels
Copy link
Contributor

jwijffels commented Jan 17, 2018

It does not make sense to add this in the function.
Make sure x is in UTF8 encoding as the doc indicates. Closing.

@jwijffels jwijffels closed this Jan 17, 2018
@mrchypark
Copy link
Author

@jwijffels
Copy link
Contributor

jwijffels commented Jan 17, 2018

Yes, that's correct, you need to make sure x is in UTF-8 encoding, that's what the doc of udpipe_annotate indicates. So the second example is how you should do it.
Let me show the output of your first example on my computer. If I type in this in my console, it is already immediately UTF-8, which is what udpipe_annotate requests me to give. If you have data in another encoding, you just need to make sure that you put it in UTF-8 before giving it to udpipe_annotate as you showed.
Incorporating the pull request would for this reason, shown below give errors on other computers where the default locale is something else then yours.

> x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다." 
> Encoding(x) 
[1] "UTF-8" 
> iconv(x, to = "UTF-8") 
[1] "안녕하세요. 저는 박찬엽입니다. 한글ì\u009d˜ ì\u009d¸ì½”딩 문제를 재현하려고 합니다."

 

@mrchypark
Copy link
Author

@jwijffels Then, how about check Encoding(x)!="UTF-8" then print warnning message include "you make sure Encoding(x) is UTF-8. If not, let try x <- iconv(x, to = "UTF-8") first."

@mrchypark
Copy link
Author

@jwijffels anyway, can you show me your sessionInfo()? I tried to assign text on windows 10, ubuntu 16.04, Mac 10.13.2. and all os return Encoding(x) is "unknown".

@jwijffels
Copy link
Contributor

Checking for 'unknown' encoding is not a good solution as ASCII always gives encoding 'unknown' so that would generate warnings for every call in all European languages, even on CRAN.
If you want to reproduce my environment which is Dutch_Netherlands.1252.

Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"

Change the locale as follows:

Sys.setlocale("LC_ALL", locale = "Korean_Korea.949")
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
localeToCharset()
[1] "CP949"
x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
Encoding(x)
[1] "unknown"
Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
localeToCharset()
[1] "ISO8859-1"
x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
Encoding(x)
[1] "UTF-8"

I think we should move these type of discussions to Issues, as the pull requests will give errors on all European Windows machines.

@mrchypark mrchypark mentioned this pull request Jan 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants