Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragility of serialize() for digest() use? #95

Closed
ilarischeinin opened this issue Jan 17, 2019 · 5 comments
Closed

Fragility of serialize() for digest() use? #95

ilarischeinin opened this issue Jan 17, 2019 · 5 comments

Comments

@ilarischeinin
Copy link

First a little background: I use testthat and its expect_known_hash() (which in turn uses digest::digest()) for a package that I'm developing. I noticed that my laptop (Mac) and our CI (Linux) were returning different hashes for one object, which was causing my tests to fail.

As part of debugging, I used dput() to inspect the objects and not surprisingly, there was some differences in floating point values. I figured that was probably the cause (there is pull request r-lib/testthat#822 that references your digest() vs. sha1() vignette), but decided to go a bit further anyway.

I saved the objects as rds files, transferred them to the same machine, and loaded into the same R session. To my surprise, they now passed both all.equal() and identical() but still had different hashes with digest::digest().

Looking at the output of dput() for the two objects, they were otherwise identical (also the floating point values), but attributes class and row.names were in a different order (the objects are data.frames). Also the output of serialize() is different, as is to be expected because of the different digest::digest() hashes. I don't know what determines their order in the output of dput(), nor if it's the same underlying reason at play with serialize(). But whatever the cause, doesn't this behavior of serialize() seem a bit fragile for digest::digest()? I know the package has been around for quite some time, so I guess this must be some kind of rare edge case. Anyways, digest::sha1() gives identical hashes.

Below is a simple reprex that shows this behavior:

a <-
  structure(
    list(a = "example"),
    class = "data.frame",
    row.names = c(NA, -1L)
  )

b <-
  structure(
    list(a = "example"),
    row.names = c(NA, -1L),
    class = "data.frame"
  )

all.equal(a, b)
#> [1] TRUE
identical(a, b)
#> [1] TRUE

digest::digest(a)
#> [1] "64026fb88a58c424353ad931698acbb3"
digest::digest(b)
#> [1] "8335977c807d32d87b4c39bdf0c1c6b1"

digest::digest(a, algo = "sha1")
#> [1] "3cc1ed15c94980d4890179401b78e017f499a4c5"
digest::digest(b, algo = "sha1")
#> [1] "8464f91957b9587c3205b4ed888ebfc90abe4d12"

digest::sha1(a)
#> [1] "8a98077f38de43dd1e716e69e6ce1d58712f75af"
digest::sha1(b)
#> [1] "8a98077f38de43dd1e716e69e6ce1d58712f75af"

serialize(a, connection = NULL)
#>   [1] 58 0a 00 00 00 02 00 03 05 02 00 02 03 00 00 00 03 13 00 00 00 01 00
#>  [24] 00 00 10 00 00 00 01 00 04 00 09 00 00 00 07 65 78 61 6d 70 6c 65 00
#>  [47] 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00
#>  [70] 10 00 00 00 01 00 04 00 09 00 00 00 01 61 00 00 04 02 00 00 00 01 00
#>  [93] 04 00 09 00 00 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00
#> [116] 09 00 00 00 0a 64 61 74 61 2e 66 72 61 6d 65 00 00 04 02 00 00 00 01
#> [139] 00 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00
#> [162] 00 02 80 00 00 00 ff ff ff ff 00 00 00 fe
serialize(b, connection = NULL)
#>   [1] 58 0a 00 00 00 02 00 03 05 02 00 02 03 00 00 00 03 13 00 00 00 01 00
#>  [24] 00 00 10 00 00 00 01 00 04 00 09 00 00 00 07 65 78 61 6d 70 6c 65 00
#>  [47] 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00
#>  [70] 10 00 00 00 01 00 04 00 09 00 00 00 01 61 00 00 04 02 00 00 00 01 00
#>  [93] 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00 00
#> [116] 02 80 00 00 00 ff ff ff ff 00 00 04 02 00 00 00 01 00 04 00 09 00 00
#> [139] 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 0a
#> [162] 64 61 74 61 2e 66 72 61 6d 65 00 00 00 fe

Created on 2019-01-17 by the reprex package (v0.2.1)

@eddelbuettel
Copy link
Owner

I think you just demonstrated (quite nicely) that reordering in structure() can lead to different results in serialize().

Both are base R functions I use as is. There is nothing wrong with the digest package or code.

@eddelbuettel
Copy link
Owner

In a nutshell we now have:
sha1(serialize(a)) == sha1(serialize(b))
as more stringent than
identical(a,b)
which is more stringent than
all.equal(a,b)

No more, no less.

@ilarischeinin
Copy link
Author

So, would you say that it's better if use cases such as testthat::expect_known_hash() that should give identical results across platforms used digest::sha1() instead of digest::digest() because of possible differences in serialization?

@eddelbuettel
Copy link
Owner

This is not the testthat repo either. Nowhere do I talk about test that.

I have sympathy for your concern but you are talking to the wrong entity. Either take it up with R Core for base R (though I doubt they promised reordering would lead to identical serialization) or with the testthat team.

@ilarischeinin
Copy link
Author

Sorry I didn't phrase my question very well. I just wanted to ask that since digest provides two functions that compute hashes, digest() and sha1(), do the package authors have a recommendation on which one to choose in a specific use case. The vignette linked above discusses this from the point of view of floating point numbers (which doesn't directly apply here), but the section "Choosing digest() or sha1()" simply says "TBD". I mentioned testthat just as an example of a case that should work (as in return identical hashes) across different operating systems, etc.

"It depends" and "TBD" are valid answers to that question.

Anyways, thanks for your time (and the super quick replies).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants