korean encoding issue #9

mrchypark · 2018-01-17T12:40:27Z

When I tried to get annotate in korean, text Encoding of result is broken.
I fixed to add code below.
I checked in windows and ubuntu 16.04

windows

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] udpipe_0.3 RevoUtils_10.0.6 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 Matrix_1.2-11 tools_3.4.2 yaml_2.1.14
[5] Rcpp_0.12.13 grid_3.4.2 data.table_1.10.4-2 lattice_0.20-35

ubuntu

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] udpipe_0.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 lattice_0.20-35 digest_0.6.13 withr_2.1.1
[5] grid_3.4.3 R6_2.2.2 git2r_0.20.0 httr_1.3.1
[9] curl_3.0 data.table_1.10.4-3 Matrix_1.2-12 devtools_1.13.4
[13] tools_3.4.3 yaml_2.1.16 compiler_3.4.3 memoise_1.1.0
[17] knitr_1.17

jwijffels · 2018-01-17T13:48:51Z

It does not make sense to add this in the function.
Make sure x is in UTF8 encoding as the doc indicates. Closing.

mrchypark · 2018-01-17T20:44:19Z

https://mrchypark.github.io/udpipe_korean_error/

jwijffels · 2018-01-17T20:58:39Z

Yes, that's correct, you need to make sure x is in UTF-8 encoding, that's what the doc of udpipe_annotate indicates. So the second example is how you should do it.
Let me show the output of your first example on my computer. If I type in this in my console, it is already immediately UTF-8, which is what udpipe_annotate requests me to give. If you have data in another encoding, you just need to make sure that you put it in UTF-8 before giving it to udpipe_annotate as you showed.
Incorporating the pull request would for this reason, shown below give errors on other computers where the default locale is something else then yours.

> x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다." 
> Encoding(x) 
[1] "UTF-8" 
> iconv(x, to = "UTF-8") 
[1] "ì•ˆë…•í•˜ì„¸ìš”. ì €ëŠ” ë°•ì°¬ì—½ìž…ë‹ˆë‹¤. í•œê¸€ì\u009d˜ ì\u009d¸ì½”ë”© ë¬¸ì œë¥¼ ìž¬í˜„í•˜ë ¤ê³  í•©ë‹ˆë‹¤."

mrchypark · 2018-01-17T23:58:32Z

@jwijffels Then, how about check Encoding(x)!="UTF-8" then print warnning message include "you make sure Encoding(x) is UTF-8. If not, let try x <- iconv(x, to = "UTF-8") first."

mrchypark · 2018-01-18T01:10:52Z

@jwijffels anyway, can you show me your sessionInfo()? I tried to assign text on windows 10, ubuntu 16.04, Mac 10.13.2. and all os return Encoding(x) is "unknown".

jwijffels · 2018-01-18T09:25:50Z

Checking for 'unknown' encoding is not a good solution as ASCII always gives encoding 'unknown' so that would generate warnings for every call in all European languages, even on CRAN.
If you want to reproduce my environment which is Dutch_Netherlands.1252.

Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"

Change the locale as follows:

Sys.setlocale("LC_ALL", locale = "Korean_Korea.949")
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"
localeToCharset()
[1] "CP949"
x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
Encoding(x)
[1] "unknown"
Sys.setlocale("LC_ALL", locale = "Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
localeToCharset()
[1] "ISO8859-1"
x <- "안녕하세요. 저는 박찬엽입니다. 한글의 인코딩 문제를 재현하려고 합니다."
Encoding(x)
[1] "UTF-8"

I think we should move these type of discussions to Issues, as the pull requests will give errors on all European Windows machines.

korean encoding issue done

0d5eff8

jwijffels closed this Jan 17, 2018

mrchypark mentioned this pull request Jan 18, 2018

korean encoding issue #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

korean encoding issue #9

korean encoding issue #9

mrchypark commented Jan 17, 2018 •

edited

Loading

jwijffels commented Jan 17, 2018 •

edited

Loading

mrchypark commented Jan 17, 2018

jwijffels commented Jan 17, 2018 •

edited

Loading

mrchypark commented Jan 17, 2018

mrchypark commented Jan 18, 2018

jwijffels commented Jan 18, 2018

korean encoding issue #9

korean encoding issue #9

Conversation

mrchypark commented Jan 17, 2018 • edited Loading

windows

ubuntu

jwijffels commented Jan 17, 2018 • edited Loading

mrchypark commented Jan 17, 2018

jwijffels commented Jan 17, 2018 • edited Loading

mrchypark commented Jan 17, 2018

mrchypark commented Jan 18, 2018

jwijffels commented Jan 18, 2018

mrchypark commented Jan 17, 2018 •

edited

Loading

jwijffels commented Jan 17, 2018 •

edited

Loading

jwijffels commented Jan 17, 2018 •

edited

Loading