Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible optimizations in PassiveLogic internal Swift AD benchmark #69967

Open
jkshtj opened this issue Nov 17, 2023 · 10 comments
Open
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels

Comments

@jkshtj
Copy link
Contributor

jkshtj commented Nov 17, 2023

We want to investigate possible optimizations in one of our key, internal benchmarks for Swift AD, through a combination of inlining and closure-sepcialization. We are specifically pointing out those 2 optimizations because they can help us get rid of the memory allocations made by Swift AD for creating pullback closures.

The following code is representative of the structure of the benchmark. Other than the code shown in the functions, the size of the functions can be assumed to be the same, i.e., think of // ... as representing a constant number of inline differentiable operations. No control-flow is involved.

import _Differentiation

@differentiable(reverse)
func CR(...) -> Float
{
    // ...
}

@differentiable(reverse)
func CL(...) -> Float
{
    let r = CR(a: a, b: b, c: c)
    // ...
}

@differentiable(reverse)
func UQ(...) -> Float
{
    // ...
}

@differentiable(reverse)
func UB(...) -> Float
{
    // ...
}

@differentiable(reverse)
func US(...) -> Float
{
    // ...
}

@differentiable(reverse)
@inlinable func abs(...) -> Float {
    // ...
}

func LC(...) -> Float {
    // ...
    return abs(diff)
}

@differentiable(reverse)
func SM(params: Params) -> Float {
    // ...
    let x = US(...)
    
    let y = UQ(...)
    
    let z = CL(...)
    
    let f = UQ(...)
    
    let g = UB(...)

    // ...
}

@differentiable(reverse)
func FP(params: Params) -> Float {
    let a = SM(params: Params)
    let b = LC(...)
    // ...
}

func foo() -> Float {
    let params = Params(...)
    let (_, pb) = valueWithPullback(at: params, of: FP)
    let grad = pb(1)
    // ...
}

Using the existing compiler optimizations, the top-level function foo ends up looking something like:

func foo() -> Float {
    // ...
    // Reverse-mode derivative and pullback of FP are fully inlined
    // Reverse-mode derivative and pullback of LC are fully inlined


    let (_, SM_PB) = SM_VJP() // reverse-mode derivative of SM
    let grad = SM_PB(1) // pullback of SM
    // ...
}

What we instead want, is one of the following outcomes.

  1. VJP and pullback of SM are fully inlined into foo. This way the partial_applys of intermediate pullbacks in SM_VJP should get constant-folded into the corresponding applys of the pullbacks in SM_PB.
    1. Based on some experimentation, inlining benefits of 70 (currently 30) for VJP like functions returning closures and 170 (currently 70) for PB like functions receiving closures achieve this goal.
    2. Simply tweaking the cost-benefit analysis however might be an over-fitting solution and might not work in the general case.

  2. SM_VJP is fully inlined into foo. This way even if we cannot inline SM_PB into foo we can specialize it to take the values closed over by the intermediate pullback closures instead of the intermediate pullback closures themselves.
    1. This should be a more generally useful optimization that should work on a larger number of cases. It can be enabled by modifying the existing closure-spec optimization to handle callsites taking multiple closures.

What's clear?

  1. Modifying the existing closure-spec optimization to handle callsites taking multiple closures. We have confirmed that there is no technical blocker for this change.

What's unclear?

  1. How much can we tweak the existing cost-benefit analysis? Can we start out by tweaking it to the values we need to achieve outcome (2) specified above (or even outcome (1))?
  2. If we cannot, what other options do we have to maximally get rid of the pullback-closure allocations?
@jkshtj jkshtj added bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels labels Nov 17, 2023
@jkshtj
Copy link
Contributor Author

jkshtj commented Nov 17, 2023

@BradLarson @asl Could you guys take a look at this?

@asl
Copy link
Contributor

asl commented Nov 17, 2023

It would be helpful if the testcase could be self-contained.

@asl
Copy link
Contributor

asl commented Nov 17, 2023

But judging from the rough sketch presented here: I won't bother about top-level (SM) VJP and pullbacks being inlined or not. We actually do not need this. What is important are the internals of SM:

  • Are nested VJPs inlined into VJP of SM?
  • Is returned value of SM's VJP (the pullback) properly specialized?
    And if not – what prevents the optimization of SM_VJP itself?

@asl
Copy link
Contributor

asl commented Nov 17, 2023

In any case, it seems you're looking from the wrong direction. We do not want SM to be inlined at the first place. Only if it's beneficial / profitable. For now it does not look like it is profitable as the function itself is huge. What we need is to optimize SM by itself (certainly, it may happen that it will be profitable afterwards, but not vice-versa).

@asl
Copy link
Contributor

asl commented Nov 17, 2023

In particular, it seems we can think about the following optimization: consider a partial_apply whose argument is another partial_apply. Then we can specialize the callee, so we can eliminate the inner partial_apply. Outer one will be capture the arguments of the inner ones. This way instead of lots of small context memory allocations / releases we'd simply have a single larger one.

Essentially, this would look like as if autodiff-code was run on the function with nested function calls inlined.

@jkshtj
Copy link
Contributor Author

jkshtj commented Nov 17, 2023

Based on offline discussion with @asl we have the following work items.

  1. Extend the existing closure-specialization optimization to work not just on functions that are called explicitly using applys but also functions that are returned from other functions.
    1. This is similar to the "top-level" closure idea we have discussed in the design for the AD specific closure specialization optimization.

  2. Modify (if need be) the inlining cost benefit analysis to award more benefits to VJPs whose callers are other VJPs.

@jkshtj
Copy link
Contributor Author

jkshtj commented Nov 28, 2023

@asl here's the full code for the benchmark we have been using internally.

import _Differentiation
import Foundation

let dt: Float = 0.1
let π = Float.pi

struct SP: Differentiable{
    var t: TT = TT()
    var s: ST = ST()
    var q: QT = QT()
    var tn: TnT = TnT()
    var stt: Float
}

struct TT: Differentiable
{
    //@noDerivative
    var sp: Float = 0.50292    
    //@noDerivative
    var dm: Float = 0.019      
    //@noDerivative
    var tk: Float = 0.001588   
    //@noDerivative
    var rt: Float = 2.43       
}

struct ST: Differentiable
{
    var tp: Float = 21.1111111       
    //@noDerivative
    var ar: Float = 100.0            
    //@noDerivative
    var Cp: Float = 0.2
    //@noDerivative
    var dt: Float = 2242.58       
    //@noDerivative
    var tk: Float = 0.101       
    
    var nuuu: Float = 10.0
    var nuuuu: Float = 10.0
}

struct QT: Differentiable
{
    var pw: Float = 0.0             
    
    var tp: Float = 60.0             
    var fw: Float = 0.0006309      
    var dt: Float = 1000.0        
    var Cp: Float = 4180.0        
    
    
    var nuuu: Float = 10.0
    var nuuuu: Float = 10.0
    var nuuuuu: Float = 10.0
}

struct TnT: Differentiable
{
    var tp: Float = 70.0
    //@noDerivative
    var vl: Float = 0.0757082
    //@noDerivative
    var Cp: Float = 4180.000
    //@noDerivative
    var dt: Float = 1000.000
    //@noDerivative
    var ms: Float = 75.708
    
    var nuuu: Float = 10.0
    var nuuuu: Float = 10.0
    var nuuuuu: Float = 10.0
    var nuuuuuu: Float = 10.0
}

////////////// Computations  ///////////////////////////////////////////////////////
@differentiable(reverse)
func CRT(fr: ST, tu: TT, qt: QT) -> Float
{
    let gc: Float = 10.0
    let ts = (fr.ar/tu.sp) * π * tu.dm
    let rb = tu.rt * tu.tk / ts
    let rc = rb * gc
    return rc
}

struct QP: Differentiable {
    var qt: QT
    var pw: Float
}

@differentiable(reverse)
func CLP(f: ST, t: TT, qt: QT) -> QP
{
    let rb = CRT(fr: f, tu: t, qt: qt)
    
    let ct: Float = 1/rb
    let d = f.tp - qt.tp
    let pw = d * ct
    
    var u = qt
    u.pw = pw
    let l = -pw
    
    return QP(qt: u, pw: l)
}

@differentiable(reverse)
func UQ(q: QT) -> QT
{
    let w = (q.fw * dt)
    let m = (w * q.dt)
    let e = q.pw * dt
    let t = e / q.Cp / m
    var u = q
    u.tp = q.tp + t
    
    u.pw = 0
    return u
}

@differentiable(reverse)
func UB(pw: Float, f: ST) -> ST
{
    var u = f
    
    let v = f.ar * f.tk
    let m = v * f.dt
    
    u.tp = f.tp + ((pw * dt) / f.Cp / m)
    return u
}

struct TQ: Differentiable{ 
    var t: TnT
    var q: QT
}

@differentiable(reverse)
func UST(s: TnT, q: QT) -> TQ
{
    var u = s
    var uq = q
    
    let m = q.fw * q.dt
    let dt = s.tp - q.tp
    let p = dt * m * q.Cp
    
    uq.pw = p
    
    let tm = s.vl * s.dt
    let r = (p * dt) / s.Cp / tm
    u.tp = s.tp + r
    
    return TQ(t: u, q: uq)
}

//-----------------------------------------------------------------------

@differentiable(reverse)
@inlinable func absDifferentiable(_ value: Float) -> Float {
    if value < 0 {
        return -value
    }
    return value
}

func LC(p: Float, gt: Float) -> Float{
    let diff = p - gt
    return absDifferentiable(diff)
}

@differentiable(reverse)
func SM(s: SP) -> Float{
    let pt = s.t
    var sb = s.s
    var tn = s.tn
    var q = s.q
    
    sb.tp = s.stt //.scalarized
    let taq = UST(s: tn, q: q)
    tn = taq.t
    q = taq.q
    
    q = UQ(q: q)
    
    let p = CLP(f: sb, t: pt, qt: q)
    q = p.qt
    let b = p.pw
    q = UQ(q: q)
    
    sb = UB(pw: b, f: sb)

    return sb.tp
}

@differentiable(reverse)
func FP(s: SP) -> Float {
    let p = SM(s: s)
    let r = LC(p: p, gt: 27.344767)
    return r
}

@inline(never)
@_silgen_name("foo")
func foo() -> SP.TangentVector {
    let s = SP(stt: 33.3)
    let (_, pb) = valueWithPullback(at: s, of: FP)
    let grad = pb(1)
    return grad
}

@jkshtj
Copy link
Contributor Author

jkshtj commented Nov 28, 2023

As we discussed, I modified the current closure-spec optimization to handle "returned" closures and the performance of the benchmark has improved -- the reverse to forward ratio has been cut in half(a little more actually). And I think with better placement of the optimization in the pipeline the performance can be improved further.

@asl
Copy link
Contributor

asl commented Nov 28, 2023

@jkshtj do you have a PR to look at?

@jkshtj
Copy link
Contributor Author

jkshtj commented Nov 28, 2023

I haven't been able to send out a PR yet, but I pushed the changes to my fork of the Swift repo, here.

Please note that I added the optimization in a new pass just for prototyping purposes. This is not what I intend to do in the final change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels
Projects
None yet
Development

No branches or pull requests

2 participants