diffobj doesn't ignore encodings #144

hadley · 2020-03-29T14:36:05Z

x <- c("fa\xE7ile", "fa\ue7ile")
Encoding(x) <- c("latin1", "UTF-8")
y <- rev(x)

x == y
#> [1] TRUE TRUE
identical(x, y)
#> [1] TRUE

diffobj::ses(x, y)
#> [1] "1d0" "2a2"

^{Created on 2020-03-29 by the reprex package (v0.3.0)}

brodieG · 2020-03-29T18:38:31Z

Thanks for reporting. Comparisons are done the string pool memory addresses, so anything that gets a new address will be considered different. We could translate to a common encoding, although that has a cost. Since diffobj claims to do diffs on the display representation of objects, at a minimum this should be documented.

hadley · 2020-03-29T18:42:32Z

I think you should consider always re-encode to UTF-8 (or some other matching encoding) since it is very rare for R to distinguish between the same string in different encodings (ie see identical() result above)

brodieG · 2020-03-29T18:49:31Z

Certainly that identical does the re-encoding is a strong argument in favor of doing the same. I'll look into it next time I update the package; if you have a pressing need for this change let me know (I guess worst case you can re-encode yourself first in the meantime).

brodieG · 2020-05-08T23:20:20Z

I don't think we can avoid enc2utf8 b/c there is no cheap way to distinguish between "unknown" and ASCII encoding, and if we have strings with both "latin1" and ASCII in a non-latin1/UTF-8 locale we are must assume there could be some non-ASCII in the "unknown". However cost is minimal, seemingly:

> x <- rawToChar((as.raw(160:255)), multiple=TRUE)
> xx <- do.call(paste0, expand.grid(x, x, x))
> Encoding(xx) <- "latin1"
> zz <- paste0(xx, rev(xx), xx, rev(xx))
> length(zz)
[1] 884736
> system.time(ww <- enc2utf8(zz))
   user  system elapsed 
  0.009   0.000   0.008 
>

brodieG added this to the 0.2.5 milestone Mar 29, 2020

brodieG added the bug label Apr 6, 2020

brodieG added the fixed in dev label May 9, 2020

brodieG closed this as completed in 7a5556c May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffobj doesn't ignore encodings #144

diffobj doesn't ignore encodings #144

hadley commented Mar 29, 2020

brodieG commented Mar 29, 2020

hadley commented Mar 29, 2020 •

edited

Loading

brodieG commented Mar 29, 2020 •

edited

Loading

brodieG commented May 8, 2020

diffobj doesn't ignore encodings #144

diffobj doesn't ignore encodings #144

Comments

hadley commented Mar 29, 2020

brodieG commented Mar 29, 2020

hadley commented Mar 29, 2020 • edited Loading

brodieG commented Mar 29, 2020 • edited Loading

brodieG commented May 8, 2020

hadley commented Mar 29, 2020 •

edited

Loading

brodieG commented Mar 29, 2020 •

edited

Loading