Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closures capture storage in extras #252

Closed
timholy opened this issue May 10, 2024 · 18 comments
Closed

Closures capture storage in extras #252

timholy opened this issue May 10, 2024 · 18 comments
Labels
core Related to the core utilities of the package wontfix This will not be worked on

Comments

@timholy
Copy link

timholy commented May 10, 2024

If you prepare a project with both DifferentiationInterface and ForwardDiff:

using DifferentiationInterface
using ForwardDiff

struct FC{F,C,LT}
    f::F
    c::C
    Lx::LT

    function FC(f, c)
        # Define the Lagrange function.
        # Compared to defining it locally in downstream "consumer" functions,
        # this ensures it's the same function-type each time. Aside from
        # reducing useless re-compilation, the main impact is that we can use
        # DifferentiationInterface's `prepare_*` functions to get better autodiff
        # performance.
        function Lx(λ)
            x -> f(x) - λ'*c(x)
        end
        new{typeof(f),typeof(c),typeof(Lx)}(f, c, Lx)
    end
end

# Powell's Maratos-effect test problem
f(x) = 2 * (x[1]^2 + x[2]^2 - 1) - x[1]
c(x) = [x[1]^2 + x[2]^2 - 1]
fc = FC(f, c)

θ = π/12
x = [cos(θ), sin(θ)]
λ = [3/2]
λprep = [0.2]

backend = SecondOrder(AutoForwardDiff(), AutoForwardDiff())
extrasL = prepare_hessian(fc.Lx(λprep), backend, x)
H1 = hessian(fc.Lx(λ), backend, x)
H2 = hessian(fc.Lx(λ), backend, x, extrasL)
copyto!(λprep, λ)
H3 = hessian(fc.Lx(λ), backend, x, extrasL)
@assert H1  H2
@assert H1  H3

yields


julia> H1 = hessian(fc.Lx(λ), backend, x)
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

julia> H2 = hessian(fc.Lx(λ), backend, x, extrasL)
2×2 Matrix{Float64}:
 3.6  0.0
 0.0  3.6

julia> copyto!(λprep, λ)
1-element Vector{Float64}:
 1.5

julia> H3 = hessian(fc.Lx(λ), backend, x, extrasL)
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

julia> @assert H1  H2
ERROR: AssertionError: H1  H2
Stacktrace:
 [1] top-level scope
   @ REPL[55]:1

julia> @assert H1  H3

The issue seems to be that the specific storage in λprep somehow gets captured in extrasL, corrupting the answer.

I've also tried the following:

    # backend = SecondOrder(AutoEnzyme(Enzyme.Forward), AutoEnzyme(Enzyme.Reverse))
    # backend = SecondOrder(AutoForwardDiff(), AutoEnzyme(Enzyme.Reverse))
    # backend = SecondOrder(AutoForwardDiff(), AutoZygote())

Enzyme fails outright (ERROR: Attempting to call an indirect active function whose runtime value is inactive...). Zygote gives the same numeric error.

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

I think that happens because the HVP preparation (used in the extras) itself defines a closure using f:

function prepare_hvp_aux(f::F, backend::SecondOrder, x, v, ::ForwardOverForward) where {F}
# pushforward of many pushforwards in theory, but pushforward of gradient in practice
inner_gradient_closure(z) = gradient(f, inner(backend), z)
outer_pushforward_extras = prepare_pushforward(
inner_gradient_closure, outer(backend), x, v
)
return ForwardOverForwardHVPExtras(inner_gradient_closure, outer_pushforward_extras)
end

I'm not sure there is anything I can do about it. The extras are specific to a given couple (f, backend), so if you modify the function (even through a sneaky closure over λprep), I feel like you deserve your fate?

As for the other backends, not every pair is compatible with SecondOrder. We already knew that for a few pairs, but DI allows more systematic testing, and thus more systematic disappointment

@timholy
Copy link
Author

timholy commented May 10, 2024

Presumably it should be documented in prepare_hessian and any similarly-affected functions? "I feel like you deserve your fate" applies only to people who understand enough about the internals to realize this is a predictable problem.

@timholy
Copy link
Author

timholy commented May 10, 2024

It's perhaps also worth pointing out that this interacts with #206. If you can't efficiently differentiate L(x) then a good alternative is to compute the "hessian" of both f and c (note that c is vector-valued because usually you have many constraints) and from that you can contract with λ.

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

I might have been a bit unfair to you, but I had also never imagined someone would modify the function between runs and expect the result to stay the same.
How would you phrase it more precisely than the current doc?

The preparation prepare_operator(f, backend, x) will create an object called extras containing the necessary information to speed up operator and its variants. This information is specific to backend and f, as well as the type and size of the input x and the control flow within the function, but it should work with different values of x.
The extras object is nearly always mutated when given to an operator, even when said operator does not have a bang ! in its name.

@timholy
Copy link
Author

timholy commented May 10, 2024

julia> using DifferentiationInterface

help?> prepare_hessian
search: prepare_hessian

  prepare_hessian(f, backend, x) -> extras

  Create an extras object subtyping HessianExtras that can be given to Hessian operators.

No hint of danger there.

help?> DifferentiationInterface.HessianExtras
  HessianExtras

  Abstract type for additional information needed by Hessian operators.

Nor there.

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

Basically all of DI treats the function f as a fixed object. For instance we don't differentiate with respect to its internal fields, only with respect to x.

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

The preparation prepare_operator(f, backend, x) will create an object called extras containing the necessary information to speed up operator and its variants. This information is specific to backend and f, as well as the type and size of the input x and the control flow within the function, but it should work with different values of x.

Should I copy this warning into each preparation function? Do you have a better wording?

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

How about

"If the function f changes in any way, the result of preparation will be invalidated, and you will need to run it again"

@timholy
Copy link
Author

timholy commented May 10, 2024

julia> hc = ForwardDiff.HessianConfig(fc.Lx(λprep), x);

julia> ForwardDiff.hessian(fc.Lx(λ), x, hc)
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

julia> ForwardDiff.hessian(fc.Lx(λ), x)
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0

julia> ForwardDiff.hessian(fc.Lx(λprep), x)
2×2 Matrix{Float64}:
 3.6  0.0
 0.0  3.6

Works just fine.

@timholy
Copy link
Author

timholy commented May 10, 2024

Also relevant:

julia> typeof(fc.Lx(λ))
var"#3#5"{Vector{Float64}, typeof(f), typeof(c)}

julia> typeof(fc.Lx(λprep))
var"#3#5"{Vector{Float64}, typeof(f), typeof(c)}

There is literally nothing different about the type.

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

It's not about the type, preparation is specific to the actual function object. And when it's a closure or functor then the contract is that if you change it, you re-prepare the operator.
I thought it was natural, I was wrong, so I'll add it to the docstrings

@timholy
Copy link
Author

timholy commented May 10, 2024

It is a little unfortunate: if you're doing constrained optimization, λ is likely to change on every iteration. Presumably there is no point in preparation?

@adrhill
Copy link
Collaborator

adrhill commented May 10, 2024

While this is about forward-mode, changing memory used by functions is more generally an issue in reverse-mode AD, not just with the preparation of extras:

julia> using DifferentiationInterface

julia> import Zygote

julia> A = [1.0 2.0; 3.0 4.0]
2×2 Matrix{Float64}:
 1.0  2.0
 3.0  4.0

julia> f(x) = sum(A * x);

julia> x = [1.0, 1.0];

julia> gradient(f, AutoZygote(), x)
2-element Vector{Float64}:
 4.0
 6.0

julia> fill!(A, 1.0)
2×2 Matrix{Float64}:
 1.0  1.0
 1.0  1.0

julia> gradient(f, AutoZygote(), x)
2-element Vector{Float64}:
 2.0
 2.0

or in pure Zygote:

julia> A = [1.0 2.0; 3.0 4.0]
2×2 Matrix{Float64}:
 1.0  2.0
 3.0  4.0

julia> y, pb = Zygote.pullback(f, x)
(10.0, Zygote.var"#75#76"{Zygote.Pullback{Tuple{typeof(f), Vector{Float64}}, Tuple{Zygote.ZBack{ChainRules.var"#times_pullback#1476"{Matrix{Float64}, Vector{Float64}}}, Zygote.var"#2989#back#768"{Zygote.var"#762#766"{Vector{Float64}}}, Zygote.var"#1986#back#194"{Zygote.var"#190#193"{Zygote.Context{false}, GlobalRef, Matrix{Float64}}}}}}((f)))

julia> pb(1.0) # evaluate pullback function to compute gradient
([4.0, 6.0],)

julia> fill!(A, 1.0)
2×2 Matrix{Float64}:
 1.0  1.0
 1.0  1.0

julia> pb(1.0)
([2.0, 2.0],)

@timholy
Copy link
Author

timholy commented May 10, 2024

Yep. There's an obvious hack to get around this (copy λ into a constant buffer), but of course that introduces new concerns about thread-safety.

@adrhill
Copy link
Collaborator

adrhill commented May 10, 2024

Unfortunately, there is nothing DI can do about this when reverse-mode AD backends use mutable Wengert lists.

@timholy
Copy link
Author

timholy commented May 10, 2024

TBH this is the motivating issue for #206. I suspect that constrained optimization is far and away the most common application of second-order expansions of vector-valued functions, so a good alternative would be to address that issue.

@gdalle
Copy link
Owner

gdalle commented May 10, 2024

If lambda changes at each iteration, the only solution I see is to make it part of the variable x. Allowing kwargs would not be supported by every backend, far from it

@timholy
Copy link
Author

timholy commented May 10, 2024

That might be workable. What you need genuinely is $\nabla^2_{x,\lambda} L$, see https://www.math.uci.edu/~qnie/Publications/NumericalOptimization.pdf#page=551. Thus you don't actually need keywords. But it's very wasteful to compute the second derivatives in $\lambda$ because it's linear in $\lambda$.

#206 really is more fundamental.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Related to the core utilities of the package wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants