[SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file #27362

zero323 · 2020-01-26T01:07:46Z

What changes were proposed in this pull request?

Reference data for "collect() support Unicode characters" has been moved to an external file, to make test OS and locale independent.

Why are the changes needed?

As-is, embedded data is not properly encoded on Windows:

library(SparkR)
SparkR::sparkR.session()
Sys.info()
#           sysname           release           version 
#         "Windows"      "Server x64"     "build 17763" 
#          nodename           machine             login 
# "WIN-5BLT6Q610KH"          "x86-64"   "Administrator" 
#              user    effective_user 
#   "Administrator"   "Administrator" 

Sys.getlocale()

# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

lines <- c("{\"name\":\"안녕하세요\"}",
           "{\"name\":\"您好\", \"age\":30}",
           "{\"name\":\"こんにちは\", \"age\":19}",
           "{\"name\":\"Xin chào\"}")

system(paste0("cat ", jsonPath))
# {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
# {"name":"<U+60A8><U+597D>", "age":30}
# {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
# {"name":"Xin chào"}
# [1] 0


jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)

df <- read.df(jsonPath, "json")


printSchema(df)
# root
#  |-- _corrupt_record: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

head(df)
#              _corrupt_record age                                     name
# 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
# 2                       <NA>  30                         <U+60A8><U+597D>
# 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
# 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>

This can be reproduced outside tests (Windows Server 2019, English locale), and causes failures, when testthat is updated to 2.x (#27359). Somehow problem is not picked-up when test is executed on testthat 1.0.2.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Running modified test, manual testing.

Note

Alternative seems to be to used bytes, but it hasn't been properly tested.

test_that("collect() support Unicode characters", {

  lines <- markUtf8(c(
    '{"name": "안녕하세요"}',
    '{"name": "您好", "age": 30}',
    '{"name": "こんにちは", "age": 19}',
    '{"name": "Xin ch\xc3\xa0o"}'
  ))

  jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
  writeLines(lines, jsonPath, useBytes = TRUE)

  expected <- regmatches(lines, regexec('(?<="name": ").*?(?=")', lines, perl = TRUE))

  df <- read.df(jsonPath, "json")
  rdf <- collect(df)
  expect_true(is.data.frame(rdf))

  rdf$name <- markUtf8(rdf$name)
  expect_equal(rdf$name[1], expected[[1]])
  expect_equal(rdf$name[2], expected[[2]])
  expect_equal(rdf$name[3], expected[[3]])
  expect_equal(rdf$name[4], expected[[4]])

  df1 <- createDataFrame(rdf)
  expect_equal(
    collect(
      where(df1, df1$name == expected[[2]])
    )$name,
    expected[[2]]
  )
})

HyukjinKwon

LGTM if tests pass on AppVeyor.

SparkQA · 2020-01-26T02:03:39Z

Test build #117405 has finished for PR 27362 at commit 87bf625.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-01-26T03:57:39Z

Merged to master and branch-2.4.

zero323 · 2020-01-26T03:58:19Z

Thanks @HyukjinKwon

…nal file ### What changes were proposed in this pull request? Reference data for "collect() support Unicode characters" has been moved to an external file, to make test OS and locale independent. ### Why are the changes needed? As-is, embedded data is not properly encoded on Windows: ``` library(SparkR) SparkR::sparkR.session() Sys.info() # sysname release version # "Windows" "Server x64" "build 17763" # nodename machine login # "WIN-5BLT6Q610KH" "x86-64" "Administrator" # user effective_user # "Administrator" "Administrator" Sys.getlocale() # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" lines <- c("{\"name\":\"안녕하세요\"}", "{\"name\":\"您好\", \"age\":30}", "{\"name\":\"こんにちは\", \"age\":19}", "{\"name\":\"Xin chào\"}") system(paste0("cat ", jsonPath)) # {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"} # {"name":"<U+60A8><U+597D>", "age":30} # {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19} # {"name":"Xin chào"} # [1] 0 jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(lines, jsonPath) df <- read.df(jsonPath, "json") printSchema(df) # root # |-- _corrupt_record: string (nullable = true) # |-- age: long (nullable = true) # |-- name: string (nullable = true) head(df) # _corrupt_record age name # 1 <NA> NA <U+C548><U+B155><U+D558><U+C138><U+C694> # 2 <NA> 30 <U+60A8><U+597D> # 3 <NA> 19 <U+3053><U+3093><U+306B><U+3061><U+306F> # 4 {"name":"Xin ch<U+FFFD>o"} NA <NA> ``` This can be reproduced outside tests (Windows Server 2019, English locale), and causes failures, when `testthat` is updated to 2.x (#27359). Somehow problem is not picked-up when test is executed on `testthat` 1.0.2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Running modified test, manual testing. ### Note Alternative seems to be to used bytes, but it hasn't been properly tested. ``` test_that("collect() support Unicode characters", { lines <- markUtf8(c( '{"name": "안녕하세요"}', '{"name": "您好", "age": 30}', '{"name": "こんにちは", "age": 19}', '{"name": "Xin ch\xc3\xa0o"}' )) jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(lines, jsonPath, useBytes = TRUE) expected <- regmatches(lines, regexec('(?<="name": ").*?(?=")', lines, perl = TRUE)) df <- read.df(jsonPath, "json") rdf <- collect(df) expect_true(is.data.frame(rdf)) rdf$name <- markUtf8(rdf$name) expect_equal(rdf$name[1], expected[[1]]) expect_equal(rdf$name[2], expected[[2]]) expect_equal(rdf$name[3], expected[[3]]) expect_equal(rdf$name[4], expected[[4]]) df1 <- createDataFrame(rdf) expect_equal( collect( where(df1, df1$name == expected[[2]]) )$name, expected[[2]] ) }) ``` Closes #27362 from zero323/SPARK-30645. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 40b1f4d) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

dongjoon-hyun · 2020-01-26T04:51:14Z

+1, late LGTM. Thanks!

Move Unicode test data to external file

87bf625

zero323 mentioned this pull request Jan 26, 2020

[SPARK-23435][SPARKR][TESTS] Update testthat to >= 2.0.0 #27359

Closed

HyukjinKwon approved these changes Jan 26, 2020

View reviewed changes

HyukjinKwon closed this in 40b1f4d Jan 26, 2020

zero323 deleted the SPARK-30645 branch January 26, 2020 04:00

dongjoon-hyun added SPARKR TESTS labels Jan 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file #27362

[SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file #27362

zero323 commented Jan 26, 2020

HyukjinKwon left a comment

SparkQA commented Jan 26, 2020

HyukjinKwon commented Jan 26, 2020 •

edited

Loading

zero323 commented Jan 26, 2020

dongjoon-hyun commented Jan 26, 2020

[SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file #27362

[SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file #27362

Conversation

zero323 commented Jan 26, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Note

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 26, 2020

HyukjinKwon commented Jan 26, 2020 • edited Loading

zero323 commented Jan 26, 2020

dongjoon-hyun commented Jan 26, 2020

HyukjinKwon commented Jan 26, 2020 •

edited

Loading