# Indexing and other woes #16

Closed
opened this issue Apr 30, 2018 · 21 comments

Projects
None yet
2 participants

### cortner commented Apr 30, 2018

 I am opening this to ask for help with some issues that I have before I can go back to the Jacobians #15 . Here is a very simple example of the kind of functions I am trying to differentiate: `f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])` (I put brackets so I don't have to declare new differentiation rules here, but normally I'd do that) My first problem is: ```using StaticArrays import XGrad f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6]) r = @SVector rand(6) XGrad.xdiff(f, x=ra) # or x = rand(6)``` which gives ``````ERROR: AssertionError: Currently only indexing of tuples issupported, but got Array{Float64,1} `````` Is there a work-around?
Owner

### dfdx commented Apr 30, 2018

 Yeah, indexing is a relatively new part of XGrad, so some woes are expected. For normal arrays (e.g. `x = rand(6)`) indexing already works on `jacobian` branch: ``````f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6]) df = xdiff(f, x=rand(6)) df(rand(6)) # ==> (1.6760141328514941, [0.893547, 1.17716, 1.3425, 1.32344, 1.17759, 0.896482]) `````` Static arrays aren't that easy though - current implementation of `getindex`'s gradient involves `setindex!` which is not supported for static arrays, so we need to make a it a special case. Technically it's easy, but it also brings a number of design questions (e.g. should we depend on `StaticArrays`?).
Owner

### dfdx commented Apr 30, 2018

 Just pushed another commit adding experimental support for `StaticArrays`. Now the following works as well (again, on `jacobian` branch). ``````using StaticArrays using XGrad f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6]) r = @SVector rand(6) df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r) df(r) `````` `codegen=VectorCodeGen()` is optional in this specific case, but other code may fail on static arrays without it. The difference between `VectorCodeGen()` and `BufCodeGen()` (default) is that the later uses a number of optimizations including memory pre-allocation and in-place operations which are not available for `StaticArrays`. Later we may add a separate codegen specifically for static arrays so you don't have to specify it explicitly, but let's first make sure that this type of arrays is amenable for full differentiation at all.
Author

### cortner commented Apr 30, 2018

 I see. Since I have few variables, wiould it then be better to just write them out explicitly maybe
Owner

### dfdx commented Apr 30, 2018 • edited

 What do you mean by writing variables explicitly? UPD. I just realized you use static arrays to group several variables. In this case you may want to use structs instead, e.g.: ``````julia> using XGrad julia> struct S a::Float64 b::Float64 c::Float64 end julia> f(s::S) = s.a + s.b * s.c f (generic function with 1 method) julia> df = xdiff(f, s=S(1, 2, 3)) f_deriv_666 (generic function with 2 methods) julia> df(S(3, 2, 1)) (5.0, S(1.0, 1.0, 2.0)) `````` The result is another struct with each field representing derivative w.r.t. corresponding field in the input struct. I use this feature to pass ML models which normally have 5-20 parameters, so passing around all of them is quite tedious.
Author

### cortner commented Apr 30, 2018 • edited

 EDIT: this looks like a problem at my end, maybe ignore for the moment. EDIT: the error message below I get consistently Hm - if I run the example, ```using StaticArrays using XGrad f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6]) r = @SVector rand(6) df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)``` I get ``````julia> df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r) ERROR: UndefVarError: tmp677 not defined Stacktrace: [1] eval(::Module, ::Any) at ./boot.jl:235 [2] (::XGrad.##21#22{Espresso.ExGraph})(::Symbol) at ./:0 [3] collect(::Base.Generator{Array{Symbol,1},XGrad.##21#22{Espresso.ExGraph}}) at ./array.jl:475 [4] rev_step!(::Espresso.ExGraph, ::Espresso.ExGraph, ::Espresso.ExNode{:call}) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:166 [5] reverse_pass!(::Espresso.ExGraph) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:230 [6] _xdiff(::Espresso.ExGraph) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:239 [7] #xdiff#29(::Dict{Any,Any}, ::Array{Any,1}, ::Function, ::Expr) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:277 [8] (::XGrad.#kw##xdiff)(::Array{Any,1}, ::XGrad.#xdiff, ::Expr) at ./:0 [9] #xdiff#32(::Dict{Symbol,Espresso.VectorCodeGen}, ::Array{Any,1}, ::Function, ::Function) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:307 [10] (::XGrad.#kw##xdiff)(::Array{Any,1}, ::XGrad.#xdiff, ::Function) at ./:0 `````` This is with Julia v0.6.2, ``````julia> Pkg.status("StaticArrays") - StaticArrays 0.7.0 julia> Pkg.status("XGrad") - XGrad 0.1.0+ jacobian ``````
Author

### cortner commented Apr 30, 2018

 tbh, Id rather avoid the struct trick, it is actually much more natural to work with SVectors, but I may get back to it if nothing else will work.
Owner

### dfdx commented Apr 30, 2018

 Oops, I forgot to push the changes in Espresso.jl. Please, check out its latest master. Also I recommend to add at least one empty line before a function to be differentiated as described in Code Discovery section of the docs. Normally it's not needed, but extracting sources of functions in Julia is quite fragile, so better to be on the safe side.
Author

### cortner commented May 1, 2018

 This all works now - thank you! But it is strangely slow due to many allocations? ``````@btime f(\$r) @btime df(\$r) 2.265 ns (0 allocations: 0 bytes) 21.801 μs (283 allocations: 11.61 KiB) ``````
Owner

### dfdx commented May 1, 2018

 Yes, as well as due to lots of unnecessary work. Let me explain it in a bit more detail. In most ML applications (which I originally designed XGrad for) you only call `getindex` once, e.g.: ``````A = ... loss = A[1] `````` Derivative of scalar loss w.r.t. to array `A` in this case is (roughly): ``````dloss!dA = zeros(size(A)) dloss!dA[1] = loss `````` So all but the first element of `A` are zeros. However, if the loss (i.e. output variable) is constructed from several elements of `A`, code becomes quite non-optimal: ``````A = ... loss = A[1] + A[2] * A[3] dloss!dA__1 = zeros(size(A)) dloss!dA__1[1] = ... dloss!dA__2 = zeros(size(A)) dloss!dA__2[2] == ... dloss!dA__3 = zeros(size(A)) dloss!dA__3[3] == ... # sum individual derivatives to obtain the whole dloss!dA dloss!dA = dloss!dA__1 .+ dloss!dA__3 .+ dloss!dA__3 `````` So for 3 components of `A` we first allocated 3 zero arrays, filled in only one element and then combined these arrays together. No need to say it's highly non-optimal, especially for static arrays. One way to overcome it is to use pure tuples instead which generate code similar to this: ``````A = ... loss = A[1] + A[2] * A[3] dloss!dA__1 = ... dloss!dA__2 == ... dloss!dA__3 == ... dloss!dA = (dloss!dA__1, dloss!dA__3, dloss!dA__3) `````` and runs significantly faster than for static arrays: ``````using XGrad using BenchmarkTools f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6]) r = tuple(rand(6)...) df = XGrad.xdiff(f; x=r) @btime f(\$r) @btime df(\$r) # 2.026 ns (0 allocations: 0 bytes) # 131.121 ns (8 allocations: 864 bytes) `````` So it's ~170x faster than a version with static arrays, although still 65x slower than original function. With a bit of semi-manual optimizations I can get the following expression: ``````function df_manual(x) tmp917 = 3 dtmp946!dtmp927 = 1.0 tmp920 = 1 tmp931 = 4 tmp932 = x[tmp931] tmp918 = x[tmp917] tmp925 = 5 tmp926 = x[tmp925] tmp929 = 2 tmp930 = x[tmp929] tmp933 = tmp930 * tmp932 tmp913 = x[tmp920] tmp934 = 6 tmp935 = x[tmp934] tmp936 = tmp933 * tmp935 dtmp946!dtmp916 = tmp918 * dtmp946!dtmp927 dtmp946!dtmp915 = tmp913 * dtmp946!dtmp916 dtmp946!dtmp935 = tmp933 * dtmp946!dtmp927 dtmp946!dtmp913 = tmp930 * dtmp946!dtmp916 tmp916 = tmp913 * tmp930 dtmp946!dtmp918 = tmp916 * dtmp946!dtmp927 tmp919 = tmp916 * tmp918 tmp924 = tmp913 * tmp932 tmp927 = tmp924 * tmp926 tmp928 = tmp919 + tmp927 dtmp946!dtmp926 = tmp924 * dtmp946!dtmp927 dtmp946!dtmp924 = tmp926 * dtmp946!dtmp927 dtmp946!dtmp923 = tmp913 * dtmp946!dtmp924 dtmp946!dx = (dtmp946!dtmp913, dtmp946!dtmp915, dtmp946!dtmp918, dtmp946!dtmp923, dtmp946!dtmp926, dtmp946!dtmp935) tmp941 = tmp918 * tmp926 tmp944 = tmp941 * tmp935 tmp945 = tmp936 + tmp944 tmp946 = tmp928 + tmp945 tmp964 = (tmp946, dtmp946!dx) end `````` which runs in: ``````@btime df_manual(\$r) # 9.744 ns (0 allocations: 0 bytes) `````` I think I can make it work that fast automatically for tuples. However, I'm really not sure about static arrays - they are somewhere in between normal arrays and tuples, and I don't see yet what's the best way to handle them. Does it makes sense to you to utilize tuples instead of static arrays?
Author

### cortner commented May 1, 2018 • edited

 Hi, for my application I am perfectly happy with tuples. I'll give this a try. You probably know this, but static arrays are basically just tuples: `x.data` is a tuple. Thanks for putting so much effort into this. I will try this for a while and report back if I run into serious problems.
Owner

### dfdx commented May 1, 2018 • edited

 You probably know this, but static arrays are basically just tuples: x.data is a tuple. Yes - from implementation point of view, but not from the point of code generation in XGrad as you can see :) BTW, please update `jacobian` branch - I've just fixed some code for tuples there.
Author

### cortner commented May 1, 2018

 seems to be working quite well so far. And if this last performance problem is really fixable it would be great. Anything I can help with?
Author

### cortner commented May 1, 2018 • edited

 I notice that the exponents in expressions like \$x[n]^p\$ are all turned into variables. Does this not mean that you will call `pow` instead of doing a fast integer exponentiation? It isn't really a problem for me since I resolve the exponentiation manually, I was just wondering.
Author

### cortner commented May 1, 2018

 here is an observation I thought is interesting: ``````function ff(x) x1 = x[1] x2 = x[2] x3 = x[3] x4 = x[4] x5 = x[5] x6 = x[6] ((x1*x2)*x3 + (x1*x4)*x5) + ((x2*x4)*x6 + (x3*x5)*x6) end dff = XGrad.xdiff(ff, x=tuple(r...)) @btime dff(\$(tuple(r...))) `````` this times to ca 200ms instead of 380ms (original).
Author

### cortner commented May 1, 2018

 I don't mind at all converting my functions like that if it would help. So it's now a matter for me of finding out where the allocation is coming from?
Owner

### dfdx commented May 1, 2018

 I notice that the exponents in expressions like \$x[n]^p\$ are all turned into variables. Does this not mean that you will call pow instead of doing a fast integer exponentiation? Julia compiler seems to be smart enough to handle both cases efficiently, at least I haven't noticed any difference in my experiments. If your results differ, please report and I'll incorporate a fix. this times to ca 200ms instead of 380ms (original). Hm, for me the difference is negligible - 131 and 129ns, on both - Julia 0.6.0 and 0.6.2. Are your benchmarks reproducible on your machine? Also, did you mean `200ns` instead of `200ms` (ns vs. ms)? I don't mind at all converting my functions like that if it would help. So it's now a matter for me of finding out where the allocation is coming from? It turns out most of this time comes from an unnecessary optional parameter in the generated function (recall that XGrad was designed for ML tasks where constant factors like optional parameters don't really matter). To overcome it, I've added (perhaps temporary) flag for skipping this optional argument. Please, checkout the `jacobian` branch and run: ``````ctx = Dict{Any,Any}(:nomem => true) df = XGrad.xdiff(f, ctx=ctx, x=tuple(r...)) @btime df(\$r) # 22.954 ns (2 allocations: 128 bytes) ``````
Author

### cortner commented May 1, 2018 • edited

 I get consistent timing: `32.384 ns (2 allocations: 128 bytes)`, just a slower machine I guess. Fantastic! I run into trouble with a bigger system though. Let me just double-check this and then post it here.
Author

### cortner commented May 1, 2018

 So for scalar outputs, I think this is already very nice. Do you want to keep it open to discuss the last allocation that still messes up the performance a bit?
Owner

### dfdx commented May 1, 2018

 Yeah, let's keep it open for now. Also I don't want to merge it into master until a more future-proof way to specify code generation options (e.g. `:nomem`) is found.
Author

### cortner commented May 1, 2018

 thank you for your efforts. ill report back here if I find anything else.
Owner

### dfdx commented May 16, 2018

 I've just got back to this issue and found out that the left 2x difference comes from incorrect measurements. When I changed this: ``````@btime df(\$r) `````` to this (interpolating function itself): ``````@btime \$df(\$r) `````` I've got the same 9.7ms as in completely manual version. Closing this issue and moving to jacobians :)