Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove source reference before digesting #200

Closed
dipterix opened this issue Jan 12, 2024 · 20 comments
Closed

Remove source reference before digesting #200

dipterix opened this issue Jan 12, 2024 · 20 comments

Comments

@dipterix
Copy link

dipterix commented Jan 12, 2024

Function and language objects usually have attribute srcref that changes when users choose to run the code differently. For example

a <- function(){}
digest::digest(a)

Changes every time when I use command/ctrl+shift+return/enter in RStudio. If you select the script and command+return, the results could also change even if you add blank lines in the selection.

I wonder if it's possible to remove srcref when digesting objects?

@eddelbuettel
Copy link
Owner

Possibly via an option or different entry point, or a wrapper around digest.

As you know, digest takes and (ahem) "digests" what serialize gives it, so you would have to look into what serialize lets you do. I am not too concerned with your environment and history changing when you do whatever you do there in an IDE which may have side effects. They key is that digest does what it does reliably and repeatedly, also in a normal shell session:

> a <- function(){}
> digest::digest(a)
[1] "b6ce3bd7555610d97bf1b73331c4e5d8"
> digest::digest(a)
[1] "b6ce3bd7555610d97bf1b73331c4e5d8"
> digest::digest(a)
[1] "b6ce3bd7555610d97bf1b73331c4e5d8"
> digest::digest(a)
[1] "b6ce3bd7555610d97bf1b73331c4e5d8"
> 

@dipterix
Copy link
Author

dipterix commented Jan 12, 2024

How about this?

> x <- quote({print(a)})
> digest::digest(x)
[1] "1d39e309d1c74c5f775ba93fb31945fc"
> 
> y <- quote({print( a )})
> digest::digest(y)
[1] "cfff128b7af1f7d8e4a96a017c666951"
> 
> print(x)
{
    print(a)
}
> print(y)
{
    print(a)
}
> 
> identical(x, y)
[1] FALSE
> identical(removeSource(x), removeSource(y))
[1] TRUE

I think R keeps source in language & functions. attr(y, "srcref") shows the original text generating the object, though the underlying R objects should be identical.

For a simple language or function, I can use removeSource or rlang::zap_srcref to get rid of source reference. However, I feel it could be very difficult when x is an nested environment that contains functions and languages.

@eddelbuettel
Copy link
Owner

Sure but I still do not see this as a problem with the scope and use of digest.

The R language is clear about functions and enclosing environments. If you want a 'stripped down' serialize() you seem to now know what to take out. digest should, methinks, be more general and digest the actual object as it is .

Or maybe I am still missing what you are trying to explain to me...

@dipterix
Copy link
Author

dipterix commented Jan 12, 2024

I believe when users use digest, they expect the same digest result if the underlying R objects are equivalent, rather than the code generating them to be exactly the same. Maybe the following example is a better demonstration:

> options(keep.source = TRUE)
> a <- function(){}
> digest::digest(a)
[1] "b6ce3bd7555610d97bf1b73331c4e5d8"
> a <- function(){}
> digest::digest(a)
[1] "b6ce3bd7555610d97bf1b73331c4e5d8"
> a  <-        function(){}
> digest::digest(a)
[1] "7e8ebbd63410116d67e85ebdbd01bec0"
> b <- function(){}
> digest::digest(b)
[1] "00ddb1241c5b90837b64e8656c26a540"

> options(keep.source = FALSE)
> a <- function(){}
> digest::digest(a)
[1] "c82d22ecfd038cae8f95bb2099fe84c7"
> a  <-        function(){}
> digest::digest(a)
[1] "c82d22ecfd038cae8f95bb2099fe84c7"
> b <- function(){}
> digest::digest(b)
[1] "c82d22ecfd038cae8f95bb2099fe84c7"

All a and b have the same formals, bodies, and environments. In fact, the first two as generate the identical digests. However, I don't see any reasons why the third a and the first b is any different than the first two as simply there are extra spacing around <- or the variable has different name.

If I turn off options(keep.source = FALSE), all a and b are the same. Of course I can kindly ask other people to turn off this option, but there should be a more elegant way?

@eddelbuettel
Copy link
Owner

I still see nothing wrong R or serialize() and as you demonstrate, it is documented behaviour. digest is a package with a 20-year history and millions of deployments. We have to think very hard about changes in default behaviour.

What you show seems like a perfect illustration of an available change that can be opted into locally via an existing variable. So I think we are good here.

@eddelbuettel
Copy link
Owner

PS I could add to the help page that setting the options removes white space. That may be useful for some.

@dipterix
Copy link
Author

Adding a help page would be helpful.

PS: I was not asking to change the default behavior of a 20+ year package, instead I was just asking if we could add this option to remove srcref. But if it's out of scope for digest, maybe I should ask the R core devs to have that option when serialize() objects.

@eddelbuettel
Copy link
Owner

Gotcha. That is better. And yes, given that digest() is essential the one and sole (with minor exceptions) entry point we could add an option to remove the attribute. That is within scope.

In fact, sometimes attributes get added (I add a query status when returning tiledb objects) and users may want to remove them so you get propose a PR which strips attributes (with default not to for backwards comp) but allows to remove srcref as well as other listed attributes. Would that make sense?

@eddelbuettel eddelbuettel reopened this Jan 12, 2024
@dipterix
Copy link
Author

That makes sense. I see in digest():

   if (serialize && !file) {
        if (!is_streaming_algo) {
            object <- if (.hasNoSharing()) 
                serialize(object, connection = NULL, ascii = ascii, 
                  nosharing = TRUE, version = serializeVersion)
            else serialize(object, connection = NULL, ascii = ascii, 
                version = serializeVersion)
        }
        if (is.character(skip) && skip == "auto") 
            skip <- set_skip(object, ascii)
    }

It's calling serialize directly. It seems that I do need to ask R devs to add that feature.

Ideally it could be serialize(object, connection = NULL, ..., keep.source = keep.source), then digest(..., keep.source = getOption("keep.source", TRUE))

@eddelbuettel
Copy link
Owner

Isn't what you want about (pseudo-code, not tests)

if (stripAttrArgumentTrue)
   obj <- stripAttribute(obj)
# carry on as before

ie by altering the object to what you wish it were (eg no source ref) you get your behaviour? No need to involve R and create a different, versioned, dependency methinks.

@dipterix
Copy link
Author

dipterix commented Jan 16, 2024

I'm implementing the PR and encountered something that beyond me knowledge. May I ask for your insight? @eddelbuettel

The first case is as below: a keeps changing its digest once called, but its digest becomes stable after 3 calls (the 3rd and 4th digests are the same). This behavior has nothing to do with keep.source option. I turn it off simply to avoid srcref to cause any artifacts.

> library(digest)
> 
> options(keep.source = FALSE)
> a <- function(){}
> digest(a)
[1] "c82d22ecfd038cae8f95bb2099fe84c7"
> 
> a()
NULL
> digest(a)
[1] "86058e413c7f8b2d6cef3d5a019486f7"
> 
> a()
NULL
> digest(a)
[1] "48717494d87c56e418ce1287ad31ec06"
> 
> a()
NULL
> digest(a)
[1] "48717494d87c56e418ce1287ad31ec06"

The second case is removeSource from utils package will create a copy of a, but then the digest of the new copy differs from a, but identical shows a and b are identical.

> b <- removeSource(a)
> identical(a, b)
[1] TRUE
> digest(b)
[1] "c82d22ecfd038cae8f95bb2099fe84c7"
> 

I think this has something to do with sexpinfo.gp in the function. It seems gp is changed when I call the function... It seems that I should use digest to find out whether two objects are different instead of "not" different.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jan 16, 2024

As was touched upon previously, both here and on the mailing list where you chose to also post, what digest() transforms comes from serialize(). How serialize() behaves is something outside the scope of R.

I have also learned over the years that it is much preferable to first discuss what a PR is supposed to do before handing it over. It may save you disappointment when the PR may end up being declined. Last we spoke here, I believe my recommendation to you was to alter the object before handing it to digest(). (Or maybe we said we could do that optionally inside the R-level digest() function but function it is the same methinks.

@eddelbuettel
Copy link
Owner

See below. Note that the digest package is neither loaded nor attached or loaded.

> options(keep.source = FALSE)
> a <- function(){}
> serialize(a, NULL, ascii=TRUE)
 [1] 41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 31 30 32 37 0a 32 35 33 0a 32 35 34 0a 36 0a 31 0a 32 36 32 31 35 33 0a 31 0a 7b 0a 32 35 34 0a
> a()
NULL
> serialize(a, NULL, ascii=TRUE)
 [1] 41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 32 36 33 31 37 31 0a 32 35 33 0a 32 35 34 0a 36 0a 31 0a 32 36 32 31 35 33 0a 31 0a 7b 0a 32 35 34 0a
> a()
NULL
> serialize(a, NULL, ascii=TRUE)
  [1] 41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 31 30 32 37 0a 32 35 33 0a 32 35 34 0a 32 31 0a 31 0a 31 33 0a 33 0a 31 32 0a 31 37 0a 31 0a 33 0a 36 0a 32 35 34 0a
 [66] 30 0a 31 0a 32 36 32 31 35 33 0a 31 0a 7b 0a 30 0a 32 35 34 0a 30 0a 32 35 34 0a 31 33 0a 37 38 31 0a 33 0a 4e 41 0a 31 0a 31 0a 31 30 32 36 0a 31 0a 32 36 32 31 35 33 0a 35 0a 63 6c 61 73 73 0a
[131] 31 36 0a 31 0a 32 36 32 31 35 33 0a 31 36 0a 65 78 70 72 65 73 73 69 6f 6e 73 49 6e 64 65 78 0a 32 35 34 0a
> a()
NULL
> serialize(a, NULL, ascii=TRUE)
  [1] 41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 31 30 32 37 0a 32 35 33 0a 32 35 34 0a 32 31 0a 31 0a 31 33 0a 33 0a 31 32 0a 31 37 0a 31 0a 33 0a 36 0a 32 35 34 0a
 [66] 30 0a 31 0a 32 36 32 31 35 33 0a 31 0a 7b 0a 30 0a 32 35 34 0a 30 0a 32 35 34 0a 31 33 0a 37 38 31 0a 33 0a 4e 41 0a 31 0a 31 0a 31 30 32 36 0a 31 0a 32 36 32 31 35 33 0a 35 0a 63 6c 61 73 73 0a
[131] 31 36 0a 31 0a 32 36 32 31 35 33 0a 31 36 0a 65 78 70 72 65 73 73 69 6f 6e 73 49 6e 64 65 78 0a 32 35 34 0a
> 

There is apparently a difference to R between calling a() once or twice. I have no idea why.

@dipterix
Copy link
Author

Haha it's good to know the scope before going too far. And I really appreciate that you could answer me : )

After digging into the R internal source code, I found that PackFlags encodes some extra SEXP information (LEVELS(s) and OBJECT(s)) when serializing objects (see code below or source code).

        case CLOSXP:
...
        flags = PackFlags(TYPEOF(s), LEVELS(s), OBJECT(s),
                          hasattr, hastag);
        OutInteger(stream, flags);

I sent an email to r-devel mailing list and they told me that serialize is not good for comparing objects in this case, and asked me to use identical instead. Maybe I should "serialize" environment/language/function myself and use digest to compare atomic objects.

@eddelbuettel
Copy link
Owner

It can be annoying when it gets to this detail :-/

Can you briefly describe your use case again and why you seem to need source references?

And yes, using digest for what it is pretty good and reliable (comparing atomic objects of whatever structure) seems a good move if you can handle what makes your case 'special' in another way?

@dipterix
Copy link
Author

dipterix commented Jan 16, 2024

Currently pipeline workflow packages such as targets depend on digest to cache the results. The basic idea is that if the input variables and the functions or expressions have the same digest, then users can instruct to skip the recalculation. This is very helpful when analyses take long to run, or if someone wants to validate the calculation without re-run the entire analysis.

The framework works under the assumption:

If object a and b are identical, then they have the same digest

This assumption works great on atomic results. However, when the results are (or contain) functions/expressions, the digest results become unstable. It gives us lots of false-positives: identical objects may still result in different digest.

Initially I thought it was because of the source reference. Since R languages and functions preserve the original code generating the objects, depending on how you run the code, the source references will be different. Indeed after running options(keep.source = FALSE), I was able to obtain the same digest for functions. Therefore I posted this issue asking to have the options to remove source reference.

However, it seems source reference is not the only cause that fails the assumption. Even the same function/expression (with the same memory address), its digest changes once evaluated.

Also the reason I posted this issue was because I thought if users use digest package, they might have the same assumption. I might have to rethink the implementation.

Please feel free to close this issue. I think the cause is serialize, which is out of scope for digest.

@eddelbuettel
Copy link
Owner

Sure. digest has long been used to monitor cacheing, all the way from (more locally in) memoize to more broadly in knitr. I am not a user of targets but in all we have discussed here I do not think we have demonstrated a shortcoming in digest so I think closing this is indeed appropriate.

As this is however what digest is for, if you should identify something that needs changing or extending I am all for it. A little example demonstrating shortcomings and beginnings of ideas about what/where to change are always welcome please feel free to reopen as needed.

@dipterix
Copy link
Author

dipterix commented Jan 16, 2024

Thanks! I'm so grateful that you could help me on this.

Also attach the reply from Tomas (r-devel mailing list) just for future reference.

I don't think such functionality would belong to serialize(). This function is not meant to produce stable results based on the input, the serialized representation may even differ based on properties not seen by users.

I think an option to ignore source code would belong to a function that computes the hash, as other options of identical().

Tomas

Unfortunately memorise does not solve it since it use rlang::hash, which also suffers from this issue:

  • memorise case
> memF <- memoise::memoise(function(f){ f() })
> a <- function(){
+   message("a is evaluated")
+ }
> memF(a)
a is evaluated
> memF(a)
a is evaluated
> memF(a)
a is evaluated
> memF(a)
> memF(a)
  • rlang case
> a <-   function(){}
> rlang::hash(a)
[1] "badd918f7a8d088de7ce5c4e817d8dd2"
> a <- function(){}
> rlang::hash(a)
[1] "e2a98ccbe019303395180640cf6959b7"
> a()
NULL
> rlang::hash(a)
[1] "f6563a8b0972d9d0ddd2e3fc1d091ef6"

@eddelbuettel
Copy link
Owner

(memoize used to use digest, so I guess that switched as a certain firm prefers all dependencies in house.)

I am a reader of r-devel and have been for decades so no need to copy from there.

@eddelbuettel
Copy link
Owner

(And of course I'd wish this were simpler! digest is mature and has helped with just this for over 20 years at CRAN now. I am a little puzzled why it doesn't here ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants