# INTRODUCTION

This notebook tries to take what was learned from the MGO and apply it to the ProAnti model.  First step is trying to find the Marino parameters with big tau and dt.  

Currently (20-July-2017) instead trying to start off from Marino's model_gradient_kwargs_fast.jl.  See "Marino Fast Model.ipynb"

# ==========


# Preliminaries

In [None]:
using PyCall
using PyPlot
using ForwardDiff
using DiffBase

pygui(true)

import Base.convert
convert(::Type{Float64}, x::ForwardDiff.Dual) = Float64(x.value)
function convert(::Array{Float64}, x::Array{ForwardDiff.Dual}) 
    y = zeros(size(x)); 
    for i in 1:prod(size(x)) 
        y[i] = convert(Float64, x[i]) 
    end
    return y
end

include("hessian_utils.jl")

"""
We define functions to convert Duals, the variable types used by ForwardDiff, 
to Floats. This is useful if we want to print out the value of a variable 
(since print doesn't know how to Duals). Note that after being converted to a Float, no
differentiation by ForwardDiff can happen!  e.g. after
    x = convert(Float64, y)
ForwardDiff can still differentiate y, but it can't differentiate x
"""





# The main model dynamics function

Note that the documenation indicates some default values for the optional parameters; these values need to be updated to what the code actually says below. The actual defaults are much closer to what Marino's May 2017 model has

In [None]:
""" 
function t, U, V, W = run_dynamics(trial_type; kw_params)

    Runs the 4-way mutual inhibition model
    
    OBLIGATORY PARAMS:
    ------------------

        trial_type    Must be either "pro" or "anti"


    OPTIONAL PARAMS:
    ----------------
    
        vwi = 1.5       Weight between ProContra and AntiIpsi units (on each side of the brain)
        hwi = 1.5       Weight between ProContra units across the brain; also between AntiContra
        const_pro_bias = 0.03     Extra positive input to the two Pro units
        const_E = 0.19                Constant positive input to all four units
        U_rest = -1     Resting point for U in the absance of other inputs
        g_leak = 0.5    Mutliplies (U_rest - U) for the dynamics
        theta = 1       Threshold on U for sigmoidal transform from U to V 
        beta  = 1       Scaling on sigmoid going from U to V:   V = 0.5*tanh((U-theta)/beta) + 0.5
        dt=0.02         Timestep
        sigma=0.1,      added standard deviation on U per unit time
        rule_period = 0.5       in seconds
        delay_period = 0.5      in seconds. Opto will happen during this period
        target_period = 0.1     right_light_input and target_extra_E will happen during this period; 
                                Pro v Anti input will be turned off; 
        post_target_period = 0.5  target_extra_E will still be on, but right_side_input won't
        tau=4.4         Time constant of dynamics, in millisecs
        marino_tau = true   If True, tau applies only to leak term; if False, applies to entire dUdt
        start_U = [-7, -7, -7, -7]
        const_E = 0.15  Constant excitation added to all units
        right_light_input=1     Extra excitation to right side of the brain units during the target period
        right_light_pro_extra   Even further excitation added to pro Right side units during target
        pro_self_ex = 0   Self excitation weight of Pro units
        anti_self_ex = 0  Self excitation weight of Anti units
        pfc_anti_input = 0.05    Input to Anti units during rule and delay periods
        pfc_pro_input = np.nan   Input to Pro units during rule and delay periods (default means same as Anti)
        const_pro_bias = 0       A constant extra input to the Pro units
        target_extra_E = 0.25    Extra excitation added to all units during target and post_target periods
        opto='off'    Whether there is optogenetic-induced scaling of outputs.
            ='on'     Opto will be done during the delay period only
            ='dt'     Opto will be done during the delay plus the target period
           opto_scaling=0.5      Factor by which to scale the weight matrix during opto 
           opto_scale_on_E=1     Factor by which to scale the constant excitation during opto
           opto_conductance=0    How much conductance to add to gleak during opto
           opto_current=0        Added to dUdt at each time step during opto
        unilateral_opto = False If True, then opto_scale_on_E will be forced to 0, and opto_scaling will apply
                                to only one side
        do_plot = True  whether or not to plot the results
        fignum=1        figure on which to plot
        decision_threshold   If |V(Pro_R) - V(Pro_L)| >= this number, a proper answer is produced. 
                The target light is presented to the right, so Right means "pro"


    RETURNS:
    --------

        response     +1 for Pro, -1 for Anti, 0 for undefined 
                        if |V(Pro_R) - V(Pro_L)| < decision_threshold
        t    Time vector
        U   U matrix, size 4-by-len(t). Order is ProContra on right side, AntiIpsi on right
                    side, ProContra on left side, AntiIpsi on left.
        V   V matrix = 0.5*np.tanh((U-theta)/beta) + 0.5
        W   Weight matrix between units
        
"""
function run_dynamics(trial_type; opto="off", opto_scaling=0.8, opto_scale_on_E=1,
    vwi = 1.5, hwi = 1.5, const_pro_bias=0.03, const_E=0.19,
    opto_conductance = 0, opto_current=0, 
    right_light_pro_extra = 0, right_light_input=1, 
    pro_self_ex = 0.7, anti_self_ex = 0.7, 
    tau=0.1, marino_tau = false, dt=0.02, target_extra_E = 0.25,
    pfc_anti_input = 0.15, pfc_pro_input = 0.15,
    sigma=0.1, start_U = [-1, -1, -1, -1], do_plot = false, fignum=1, plot_V_only=false,
    g_leak = 0.25, U_rest = -1, theta = 1, beta = 1,
    rule_period = 0.1, delay_period = 1, target_period = 0.1,
    post_target_period = 1, decision_threshold = 0.3, nderivs=0, difforder=0)
    

    t = [0 : dt : rule_period + delay_period + target_period + post_target_period;] 

    V = ForwardDiffZeros(4, length(t), nderivs=nderivs, difforder=difforder)
    U = ForwardDiffZeros(4, length(t), nderivs=nderivs, difforder=difforder)

    U[:,1] = start_U
    V[:,1] = 0.5*tanh((U[:,1]-theta)/beta) + 0.5

    W = [pro_self_ex -vwi -hwi 0; -vwi anti_self_ex 0 -hwi; 
        -hwi 0 pro_self_ex -vwi; 0 -hwi -vwi anti_self_ex]

    E = const_E
    
    for i in [2:length(t);]  # the funny semicolon appears to be necessary in Julia
        if marino_tau
            dUdt = E + W * V[:,i-1] + g_leak*(U_rest - U[:,i-1])/tau
        else
            dUdt = E + W * V[:,i-1] + g_leak*(U_rest - U[:,i-1])
        end
    
        if t[i] < rule_period + delay_period
            if trial_type=="anti"
                dUdt[[2,4]] += pfc_anti_input
            elseif trial_type == "pro"
                dUdt[[1,3]] += pfc_pro_input
            else
            end
            
        elseif t[i] < rule_period + delay_period + target_period
            dUdt[[1,2]] += right_light_input
            dUdt[1]     += right_light_pro_extra
            dUdt        += target_extra_E
        else
            dUdt        += target_extra_E
        end
    
        dUdt[[1,3]] += const_pro_bias
        
        if marino_tau
            try
                U[:,i] = U[:,i-1] +       dt*dUdt + sigma*randn(4)*sqrt(dt)
            catch
                @printf "yep\n"
            end
        else
            U[:,i] = U[:,i-1] + (dt/tau)*dUdt + sigma*randn(4)*sqrt(dt)
        end
    
        V[:,i] = 0.5*tanh((U[:,i]-theta)/beta) + 0.5
    end    

    if do_plot
        figure(fignum); 
        if !plot_V_only
            subplot(3,1,1)
        end
        h = plot(t, V'); 
        setp(h[1], color=[0, 0, 1])
        setp(h[2], color=[1, 0, 0])
        setp(h[3], color=[1, 0.5, 0.5])
        setp(h[4], color=[0, 1, 1])
        ylabel("V")

        ax = gca()
        yl = [ylim()[1], ylim()[2]]
        vlines([rule_period, rule_period+delay_period, 
            rule_period+delay_period+target_period], 
            -0.05, 1.05, linewidth=2)
        if yl[1]<0.02
            yl[1] = -0.02
        end
        if yl[2]>0.98
            yl[2] = 1.02
        end
        ylim(yl)
        grid(true)
        
        if !plot_V_only
            subplot(3,1,2)
            hu = plot(t, U')
            setp(hu[1], color=[0, 0, 1])
            setp(hu[2], color=[1, 0, 0])
            setp(hu[3], color=[1, 0.5, 0.5])
            setp(hu[4], color=[0, 1, 1])
            ylabel("U"); ylim(-100, 100)
            vlines([rule_period, rule_period+delay_period, 
                rule_period+delay_period+target_period], 
                ylim()[1], ylim()[2], linewidth=2)

            grid(true)

            subplot(3,1,3)
            hr = plot(t, V[1,:] - V[3,:])
            ylim([-1.1, 1.1])
            vlines([rule_period, rule_period+delay_period, 
                rule_period+delay_period+target_period], 
                ylim()[1], ylim()[2], linewidth=2)
            ylabel("Pro R - Pro L")
            grid(true)
        end

        xlabel("t"); 
    end
    
#    if V[1,end] - V[3,end] > decision_threshold
#        answer = 1
#    elseif V[1,end] - V[3,end] < -decision_threshold
#        answer = -1
#    else
#        answer = 0
#    end

    answer1 = 0.5 + 0.5*tanh(((V[1,end] - V[3,end]) - decision_threshold)/0.1)
    answer2 = 0.5 + 0.5*tanh(((V[3,end] - V[1,end]) - decision_threshold)/0.1)
    answer  = answer1 - answer2
    
    return answer, t, U, V, W 
end





# Trying to find equivalent parameters with reasonable tau and dt

In [None]:
ntrys = 10



mypars = Dict(:do_plot=>true, :sigma=>0.1, :pfc_anti_input=>0.15, :pfc_pro_input=>0.15,
:tau=>0.1, :dt=>0.02, :const_E=>0.19, :const_pro_bias=>0.03, :vwi=>1.5, :hwi=>1.5, :marino_tau=>false,
:right_light_input=>1, :right_light_pro_extra=>0.1, :target_extra_E=>0.25, 
:start_U=>[-1,-1,-1,-1], :target_period=>0.1, :delay_period=>1, :rule_period=>0.1, :post_target_period=>1,
:g_leak=>0.25, :theta=>1, :beta=>1, :U_rest=>-1, :pro_self_ex=>0.7, :anti_self_ex=>0.7)

# FROM PYTHON:
# "Correct target selection"
#                       right_light_pro_extra = 0, right_light_input=1,
#                       sigma=0.1,
#                       vwi = 1.5, hwi = 1.5, const_E = 0.19, target_extra_E=1,
#                       pfc_anti_input = 0.1,
#                       pro_self_ex=0.7, anti_self_ex=0.7,
#                       start_U = [-1, -1, -1, -1], g_leak = 0.25, theta=1, fignum=fig1)

# Function defaults:
#               opto_conductance = 0, opto_current=0, 
#                right_light_pro_extra = 0.25, right_light_input=1, 
#                vwi = 1.5, hwi = 1.5, pro_self_ex = 0, anti_self_ex = 0, tau=0.1, marino_tau = False,
#                dt=0.02, const_E = 0.15, target_extra_E = 0.25,
#                pfc_anti_input = 0.05, pfc_pro_input = np.nan, const_pro_bias = 0,
#                sigma=0.1,
#                start_U = [-7, -7, -7, -7], do_plot = True,
#                g_leak = 0.5, U_rest = -1,
#                fignum=1, theta = 1, beta = 1,
#                rule_period = 0.5, delay_period = 0.5, target_period = 0.1,
#                post_target_period = 0.5):


figure(3); clf();
for i in [1:ntrys;]
    answer, t, U, V, W = run_dynamics("pro"; fignum=3, do_plot=true)
    # println(answer)
end

figure(4); clf();
for i in [1:ntrys;]
    answer, t, U, V, W = run_dynamics("anti"; fignum=4, do_plot=true)
    # println(answer)
end


In [None]:
ntrys = 1000

function perf(ntrys)
    mypars = Dict(:do_plot=>false, :sigma=>0.1, :pfc_anti_input=>0.15, :pfc_pro_input=>0.15,
    :tau=>0.1, :dt=>0.02, :const_E=>0.19, :const_pro_bias=>0.03, :vwi=>1.5, :hwi=>1.5, :marino_tau=>false,
    :right_light_input=>1, :right_light_pro_extra=>0.0, :target_extra_E=>0.25, 
    :start_U=>[-1,-1,-1,-1], :target_period=>0.1, :delay_period=>1, :rule_period=>0.1, :post_target_period=>1,
    :g_leak=>0.25, :theta=>1, :beta=>1, :U_rest=>-1, :pro_self_ex=>0.7, :anti_self_ex=>0.7)

    results = zeros(2,ntrys)
    for i in [1:ntrys;]
        answer, t, U, V, W = run_dynamics("pro"; fignum=3, mypars...)
        results[1,i] = sign(V[1,end]-V[3,end])

        answer, t, U, V, W = run_dynamics("anti"; fignum=3, mypars...)
        results[2,i] = sign(V[3,end]-V[1,end])
    end
    return results
end

results = @time(perf(ntrys))

@printf "Pro perf = %g , Anti perf = %g\n"  mean(results[1,:]) mean(results[2,:])

# Defining the cost function Jcost()

In [None]:
function Jcost(;nderivs=0, difforder=0, ntrials=100, seedrand=NaN, targets=[0.8, 0.7],
    theta1=0.15, theta2=0.2, beta=0.05, verbose=false, do_plot=false, plot_trials=[1], params...)

    if !isnan(seedrand)
        srand(seedrand)
    else  # if the random seed is passed as NaN, use the system time in milliseconds
        srand(convert(Int64, round(1000*time())))
    end
    
    Vpro = ForwardDiffZeros(4, ntrials, nderivs=nderivs, difforder=difforder)
    Vant = ForwardDiffZeros(4, ntrials, nderivs=nderivs, difforder=difforder)

    
    if do_plot; figure(1); clf(); subplot(2,1,1); end
    for i in [1:ntrials;]
        if do_plot && ~isempty(find(plot_trials.==i))
            V = run_dynamics("pro"; do_plot=true, plot_V_only=true, fignum=1,
                nderivs=nderivs, difforder=difforder, params...)[3][:,end]
        else
            V = run_dynamics("pro"; do_plot=false, plot_V_only=true, fignum=1,
                nderivs=nderivs, difforder=difforder, params...)[3][:,end]
        end
        Vpro[:,i] = V
    end
    if do_plot; subplot(2,1,2); end
    for i in [1:ntrials;]
        if do_plot && ~isempty(find(plot_trials.==i))
            V = run_dynamics("anti"; do_plot=true, plot_V_only=true, fignum=1, 
                nderivs=nderivs, difforder=difforder, params...)[3][:,end]
        else
            V = run_dynamics("anti"; do_plot=false, plot_V_only=true, fignum=1,
                nderivs=nderivs, difforder=difforder, params...)[3][:,end]
        end
        Vant[:,i] = V
    end
    
    hpro = 0.5*(1 + tanh.((Vpro[1,:]  - Vpro[3,:])/theta1))
    hant = 0.5*(1 + tanh.((Vant[3,:]  - Vant[1,:])/theta1))
    hits = [hpro ; hant]
    dpro   = tanh((Vpro[1,:]  - Vpro[3,:]) /theta2).^2
    dant   = tanh((Vant[1,:]  - Vant[3,:]) /theta2).^2
    diffs = [dpro ; dant]
        
    cost1 = 0.5*(mean(hpro) - targets[1])^2 + 0.5*(mean(hant) - targets[2])^2
    cost2 = -mean(diffs) 
    
    if verbose
        @printf("        cost1=%g, cost2=%g, mean(hpro)=%.3f, mean(hant)=%.3f, mean(diffs)=%.3f\n", 
            convert(Float64, cost1), beta*convert(Float64, cost2), 
            convert(Float64, mean(hpro)), convert(Float64, mean(hant)), convert(Float64, mean(diffs)))
    end
    
    if do_plot
        figure(2); clf();
        subplot(2,1,1)
        plot(1:2*ntrials, hits, "."); vlines(ntrials+0.5, ylim()[1], ylim()[2]) 
        xlim(-1, ntrials*2+1)
        title("hits")

        subplot(2,1,2)
        plot(1:2*ntrials, diffs, "."); vlines(ntrials+0.5, ylim()[1], ylim()[2]) 
        xlim(-1, ntrials*2+1)
        title("diffs")
    end
    
    return cost1 + beta*cost2, hits, diffs    
end


val, grad, hess = keyword_vgh((;pars...)->Jcost(;do_plot=true, ntrials=100, seedrand=10, pars...)[1], 
    ["const_E", "sigma", "theta1"], [0.19, 0.1, 0.15])

hess

In [None]:
args = ["const_E", "vwi", "hwi", "const_pro_bias", "right_light_input", "sigma"]
goods = [0.19, 1.5, 1.5, 0.03, 1, 0.05]
bbox = [0 3 ;
        0 6 ; 
        0 6 ; 
        0.01 1 ; 
        0.01 4 ; 
        0.005 0.4]

# run_dynamics()
func = (;pars...) -> Jcost(; do_plot=true, theta1=1.5, theta2=2,
ntrials=200, verbose=true, seedrand=3, post_target_period=0.4, pars...)[1]

params, trajectory = bbox_Hessian_keyword_minimization(goods, args, bbox, func, verbose=true, 
start_eta=0.01, tol=1e-12)

In [None]:
figure(1); clf();
func(; do_plot=true, ntrials=200, plot_trials=1:10, verbose=true, make_dict(args, trajectory[3:end,end])...)


In [None]:
trajectory[:,end]'

In [None]:
function make_dict(args, x::Vector)
    kwargs = Dict();    
    for i in [1:length(args);]    
        kwargs = merge(kwargs, Dict(Symbol(args[i]) => x[i]))        
    end    
    return kwargs
end 


In [None]:
println(args)
trajectory[:, [1 ;end]]

In [None]:
ps = [0.222, 1.543, 1.458, 0.065, 1.043, 0.091]  # these are from training with ntrials=4000 and interrupting kernel. Looked good.
ntrials = 4000

# ps = [0.222, 0.1543, 0.1458, 0.065, 1.043, 0.091]  # Exptl.
# ntrials = 2000

# ps = trajectory[3:end,end]  # these are the old ones, from training with ntrials=100 and getting stuck at sigma=0.133

value, grad = keyword_vgh((;pars...) -> Jcost(; do_plot=true, theta1=0.15, theta2=0.2, 
    ntrials=100, verbose=true, seedrand=3, pars...)[1], 
    ["const_E", "vwi", "hwi", "const_pro_bias", "right_light_input", "sigma"], [ps[1:end-1]; [0.133]])
# [0.19, 1.5, 1.5, 0.03, 1.0, 0.05])

#
#   RIGHT HERE: with theta=0.15 and theat2=0.2, and ntrials=100,
#          SETTING SIGMA=0.133 above makes gradient minute; 0.15 makes it non-negligible. 
#
#       If  we set theta1=1.5 and theta2=2  training stops at 0.133 because we're doing Newton jumps. A genuine local minimum?
#
#  Adding trials increases odds of some trial not hitting total ceiling, so gradient is non-negligible again
#  at ntrials=4000.  Training with ntrials=4000 was slow, but worked on getting a small sigma to grow to its
#  optimal point. Training was done with theta1=1.5 and theta2=2

@printf "cost=%g\n" value
grad

In [None]:
trajectory[2:end,[1 end]]

In [None]:
args = ["const_E", "vwi", "hwi", "const_pro_bias"]
goods = [0.19, 1.5, 1.5, 0.03]
bbox = [0 3 ;
        0 6 ; 
        0 6 ; 
        0.01 1]


ntrials =  200
max_attempts = 50
figure(4); clf();
nruns = 20
attempts_needed = zeros(1, nruns)
final_costs     = zeros(1, nruns)
randseeds       = zeros(Int64, 1, nruns)
final_params    = zeros(length(args), nruns)

for k in [1:nruns;]            
    local i
    local params
    local trajectory
    
    for i in [1:max_attempts;]                        
        randseeds[k] = convert(Int64, rem(round(time()*1e12), 1e6))
        
        
        func = (;pars...) -> Jcost(;beta=0.01, min_sigma=0.01, max_sigma=20, 
        rulestrength=2, ntrials=ntrials, seedrand=randseeds[k],    
            do_plot=false, plot_trials=[5,6,7,8,9], verbose=false, dt=25, 
            cue_period=200, delay_period=200, response_period=300,  
            pars...)[1]

        seed = rand(length(args),1)[:].*(bbox[:,2]-bbox[:,1]) + bbox[:,1]

        params, trajectory = bbox_Hessian_keyword_minimization(seed, args, bbox, func, hardbox=false,
            verbose=false, tol=1e-12, start_eta=1, wallwidth_factor=0.18);

        @printf "Attempt %d with seed %d ended with cost %g\n" i randseeds[k] trajectory[2,end]

        if trajectory[2,end] <= -0.005
            @printf "        ---success!\n"        
            break    
        end
    end

    
    final_params[:,k] = params
    final_costs[k] = trajectory[2,end]
    attempts_needed[k] = i
    @printf "\n---------- Finished run %d --------\n" k
end

matwrite("attempt_distribution.mat", Dict("attempts_needed" => attempts_needed, "ntrials"=>ntrials, 
"max_attempts" => max_attempts, "randseeds"=>randseeds, "final_costs"=>final_costs, "final_params"=>final_params,
"args"=>args, "bbox"=>bbox, "ntrials"=>ntrials))

figure(4); clf();
ax = gca();
ax[:hist](attempts_needed');
xlabel("no. of atempts needed")
ylabel("no. of runs")

# Old-- run the dynamics with Marino's parameters just to test it

Param values may be off since we changed the defaults in run_dynamics()

In [None]:
ntrys = 4

figure(1); clf();
for i in [1:ntrys;]
    answer, t, U, V, W = run_dynamics("pro"; do_plot=true, sigma=0.4, 
    pfc_anti_input = 1.6, pfc_pro_input = 0.05, fignum=1)
    # println(answer)
end

figure(2); clf();
for i in [1:ntrys;]
    answer, t, U, V, W = run_dynamics("anti"; do_plot=true, sigma=0.4,
    pfc_anti_input = 1.6, pfc_pro_input = 0.05, fignum=2)
    # println(answer)
end


# ========================================
# .
#              Thinking about the cost function  
# .
# ========================================


In [None]:
npoints = 100
data_sigma = 10


function make_data(;npoints=100, data_sigma=10, seedrand=NaN)
    if ~isnan(seedrand)
        srand(seedrand)
    end
    return data_sigma*randn(npoints,1)
end

data1 = make_data(npoints=100, data_sigma=10, seedrand=10);


In [None]:
function J(data1; threshold=0.5, inv_slope=4, theta1 = 0.15, theta2=0.2, beta=0.05,
    do_plot=true, nderivs=0, difforder=0, verbose=true)
    npoints = length(data1)

    d1 = tanh((data1 - threshold)/inv_slope)/2+0.5

    hits = 0.5*(1 + tanh.((d1-0.5)/theta1))
    difs = tanh((d1 - 0.5)/theta2).^2
    
    if do_plot
        figure(1); clf();
        subplot(3,1,1)
        plot(1:npoints, d1, "b.")
        ylabel("d1"); 
        title(@sprintf("threshold=%.3f inv_slope=%.3f", convert(Float64, threshold), convert(Float64, inv_slope)))
        subplot(3,1,2)
        plot(1:npoints, hits, ".")
        ylabel("hits")
        subplot(3,1,3)
        plot(1:npoints, difs, ".")
        ylabel("difs")        
        title(@sprintf("<hits>=%.3f <difs>=%.3f", convert(Float64, mean(hits)), convert(Float64, mean(difs))))
    end
    cost1 = (mean(hits) - 0.75)^2
    cost2 = -mean(difs) 

    cost = cost1 + beta*cost2

    if verbose
        @printf("        cost1=%g, cost2=%g, mean(hits)=%.4f, mean(difs)=%.4f\n", 
            convert(Float64, cost1), beta*convert(Float64, cost2), 
            convert(Float64, mean(hits)), convert(Float64, mean(difs)))
    end

    return cost
end

J(data1, do_plot=true)


func = (;pars...) -> J(data1; do_plot=true, theta1=0.15, theta2=0.2, pars...)

val, grad, hess = keyword_vgh(func, ["threshold", "inv_slope"], [0.5, 4])



### Exploring whether beta and number of points matters

It does. Very small beta at first, and lots of points, are good!

In [None]:
# This gets us totally stuck: data1 = make_data(npoints=100, data_sigma=10, seedrand=10); 
# seed = [0.959544, 0.01]  
# args = ["threshold", "inv_slope"]
# beta=0, theta1=0.15, theta2=0.2, bbox = [-20.1 20.1 ; 0.01 20]
#
# Even npoints=1000 got stuck
# Even npoints=10000 got stuck!   At [-3.64425,0.00998397]
#     And making beta=0 removes the stickpoint
#
# Lesson seems to be start with a really small beta -- otherwise the inv_slope parameter gets pushed towards being 
# small way too fast.
#
# Starting with big inv_slope (0.5) but a beta=0.005 worked out with 10,000 points. 
# Doesn't work with 1,000, *does* work with 2500 and a couple of random seeds. 
# Starting with inv_slope of 0.5 or of 4 were both fine.
#
# Now what if we start with a small inv_slope (0.01)? Would a beta of 0.005 still work? Nope, got stuck.
# what about theta1 -- does it matter?

data1 = make_data(npoints=2500, data_sigma=10, seedrand=20);

seed = [-10, 0.2]
args = ["threshold", "inv_slope"]
bbox = [-20.1 20.1 ; 0.01 20]

seed = [-6.779, 3]
seed = [-4, 0.01]

params, traj, cost = bbox_Hessian_keyword_minimization(seed, args, bbox, 
(;pars...)->J(data1;do_plot=true, beta=0.005, theta1=0.15, theta2=0.2, pars...), 
verbose=true, verbose_every=10, start_eta=0.001, tol=1e-15, wallwidth_factor=0.01, maxiter=4000)


### Exploring whether theta matters

In [None]:

data1 = make_data(npoints=100, data_sigma=10, seedrand=10);

args = ["threshold", "slope"]
bbox = [-20.1 20.1 ; 0.01 200]

# This gets stuck at [-4.404, 12.263]
seed = [0.5, 10.1]

params, traj, cost = bbox_Hessian_keyword_minimization(seed, args, bbox, 
(;pars...)->J2(data1;do_plot=true, beta=0.005, theta1=0.15, theta2=0.2, pars...), 
verbose=true, verbose_every=10, start_eta=0.1, tol=1e-9, wallwidth_factor=0.01, maxiter=400)


# Trust region method for Hessian minimization

In [None]:
"""
function trust_region_Hessian_minimization(seed, func; start_eta=10, tol=1e-6, maxiter=400,
    verbose=false)

(below, x stands for delta_x, the step from the current x=x0 position at which the cost = const)

cost = 0.5*x'*H*x + grad*x + const

dcost/dx = H*x + grad  ;   dcost/dx = 0  ==> x =  - inv(H)*grad

Trust-region says have a parameter lambda, and replace H with hat{H} = H +  I/eta.  
When eta is very large, this is equivalent to a straight Newton method jump, 
because hat{H} ~= H.  But when eta is small, this is more like a small gradient
descent step, because for small eta inv(hat{H}) ~= eta and therefore the delta x is like 
-eta*grad.  So, if the cost function is going down, make eta larger, and if it is going
up, make eta a lot smaller. Just like we do in other adaptive methods

PARAMETERS:
===========

seed        column vector, representing the starting value of the parameters.

func        Function that takes a vector and returns a scalar.  If you want to
            work with a function that tales more parameterrs and returns more than one 
            output, you can use something like

                    x -> orig_func(x, other_params)[1]

            You only need the "[1]" part if the orig_func returns more outputs than a scalar. 

OPTIONAL PARAMETERS:
====================

start_eta=10   Starting value of eta.  It's good to start with somethibg biggish, if it is
               too much, it'll quickly get cut down.

tol=1e-6       Numerical tolerance. If a proposed jump produces a change in func that is less than
               this, the minimization stops.

maxiter=400    Maximum number of iterations to do before stopping

verbose=false   If true, print out a report on each iteration of iteration number, radius size (eta),
                what type jump was proposed ("Newton" means going straight to global min, "constrained" means jump has 
                norm eta, failed means that finding the minimum at a given radius somehow didn't work). Will also
                print out the cosine of the angle between the proposed jump and the gradient.

RETURNS:
========

params       A vector the size of seed that has the last values of the minimizing parameters for func

"""
function trust_region_Hessian_minimization(seed, func; start_eta=10, tol=1e-6, maxiter=400,
    verbose=false)

    params = seed
    eta = start_eta

    cost, grad, hess = vgh(func, params)


    for i in [1:maxiter;]
        hathess    = hess + eye(length(grad), length(grad))/eta        
        new_params = params - inv(hathess)*grad
        new_cost, new_grad, new_hess = vgh(func, new_params)
            
        if abs(new_cost - cost) < tol
            break
        end

        if new_cost >= cost
            eta = eta/2
            costheta = NaN
        else
            eta = eta*1.1
            costheta = dot(new_params-params, grad)/(norm(new_params-params)*norm(grad))

            params = new_params
            cost = new_cost
            grad = new_grad
            hess = new_hess
        end

        if verbose
            @printf "%d: eta=%.3f cost=%.4f costheta=%.3f ps=" i eta cost  costheta
            print_vector(params)
            @printf "\n"
        end
    end
    
    return params
end


In [None]:
npoints = 1000
args = ["baseline", "amplitude", "threshold", "slope"]

params = [1 5 0.5 0.8]
x = rand(npoints, 1)*6-3
y = params[1] + params[2]*0.5*(tanh((x-params[3])/params[4])+1) + randn(npoints,1)*2


figure(1); clf();
subplot(3,1,1);
plot(x, y, ".")

seed = [8, 3.1, 0, -4]
xx = -3:0.01:3

plot(xx, seed[1] + seed[2]*0.5*(tanh((xx-seed[3])/seed[4])+1), "g-")

function JJ(x, y; baseline=0, amplitude=1, threshold=0, slope=1)
    yhat =  baseline + amplitude*0.5*(tanh((x-threshold)/slope)+1) 
    err = yhat - y
    return sum(err.*err)
end

opars = trust_region_Hessian_minimization(seed, (w) -> JJ(x, y; make_dict(args, w)...), 
verbose=false, start_eta=0.001)

plot(xx, opars[1] + opars[2]*0.5*(tanh((xx-opars[3])/opars[4])+1), "r-")

In [None]:
"""
function value, gradient, hessian = vgh(func, x0)

Wrapper for ForwardDiff.hessian!() that computes and returns all three of a function's value, gradient, and hessian.

EXAMPLE:
========

function tester(x::Vector)

    return sum(x.*x)
end

value, grad, hess = vgh(tester, [10, 3.1])
"""
function vgh(func, x0)
    out = DiffBase.HessianResult(x0)             
    ForwardDiff.hessian!(out, func, x0)
    value    = DiffBase.value(out)
    gradient = DiffBase.gradient(out)
    hessian  = DiffBase.hessian(out)
    
    return value, gradient, hessian    
end


function tester(x)
    return sum(x.*x)
end

value, grad, hess = vgh(tester, [1.1, 2.2, 3.3])

In [None]:
data1 = make_data(npoints=100, data_sigma=10, seedrand=11);

args = ["threshold", "slope"]
seed = [0.5, 10.1]

params = seed; 

n_noise  = 5
n_params = length(seed)


noiseval, gradmag, J2noisegrad, J2parmsgrad, init_J2noisegrad= 
    bring_the_noise((;pars...) -> J2(data1;do_plot=false, verbose=false, pars...), 
    verbose=true, args, params, n_noise)        

@printf("|J2_noisegrad|^2 = %g, |J2_parmsgrad|2 = %g, |grad|^2 = %g, |init_J2_noisegrad|^2 - %g\n", 
sum(J2noisegrad.*J2noisegrad), sum(J2parmsgrad.*J2parmsgrad), gradmag, sum(init_J2noisegrad.*init_J2noisegrad))


# Minimizing with noise to maximize parameter gradient

In [None]:
data1 = make_data(npoints=100, data_sigma=10, seedrand=15);

args = ["threshold", "slope"]
bbox = [-20.1 20.1 ; 0.01 200]

seed = [0.5, 10.1]
seed = [-4.758, 1.1]

params = seed; new_params = 0; new_cost = 0;
eta = 0.1
beta = 0.0005

n_noise  = 5
n_params = length(seed)
maxiter  = 200
tol      = 1e-9
verbose  = true
verbose_level = 2

func       = (;pars...)     -> J2(data1; beta=beta, do_plot=false, verbose=false, pars...)
func_noisy = (nval;pars...) -> func(;nnoise=n_noise, noisy=nval, pars...)

trajectory = zeros(2 + n_params + n_noise, 0)

noiseval, gradmag, J2noisegrad, J2parmsgrad, initJ2noisegrad =  bring_the_noise(func, args, seed, n_noise)

cost, grad, hess = vgh( (w) -> func_noisy(noiseval; do_plot=true, verbose=true, make_dict(args, w)...), params)

#bring_the_noise((;pars...) -> J2(data1;do_plot=false, verbose=false, beta=beta, pars...), 
#        verbose=false, args, params, n_noise)        
# cost, grad, hess = vgh( (w) -> J2([data1;noiseval]; do_plot=true, verbose=true, beta=beta, make_dict(args, w)...), params)

@printf("|J2_noisegrad|^2 = %g, |J2_parmsgrad|2 = %g, |grad|^2 = %g, |init_J2noisegrad|^2 = %g\n", 
    sum(J2noisegrad.*J2noisegrad), sum(J2parmsgrad.*J2parmsgrad), 
    sum(grad.*grad), sum(init_J2noisegrad.*init_J2noisegrad))

for i in 1:maxiter;
    hathess    = hess + eye(length(grad), length(grad))/eta        
    new_params = params - inv(hathess)*grad
    new_cost, new_grad, new_hess = vgh((w)->func_noisy(noiseval;do_plot=true, verbose=true, make_dict(args, w)...), new_params)
    # vgh((w) -> J2([data1;noiseval]; do_plot=true, verbose=true, beta=beta, make_dict(args, w)...), new_params)
            
    delta_cost = new_cost - cost
    if abs(delta_cost) < tol
        break
    end

    new_noiseval, new_gradmag, new_J2mag_noisegrad, new_J2mag_parmsgrad, new_init_J2mag_noisegrad = 
    bring_the_noise(func, args, new_params, n_noise; init_noise=noiseval)
#    bring_the_noise((;pars...) -> J2(data1;do_plot=false, verbose=false, beta=beta, pars...), 
#        args, new_params, n_noise, init_noise=noiseval, ncycles=150, growth_factor=1.5)
    
    iJ2m_n = sum(new_init_J2mag_noisegrad.*new_init_J2mag_noisegrad)

    if new_cost >= cost || new_gradmag < 1e-8 || iJ2m_n < 1e-15
        if verbose
            if new_cost >= cost
                @printf("--- cost went up\n")
            elseif new_gradmag < 1e-8
                @printf("--- new_gradmag was too small, it was %g\n", new_gradmag)
            else
                @printf("--- initial grad of J2 = |dJ/dw| w.r.t. noise was too small, it was %g\n", iJ2m_n)
            end
        end
        eta = eta/2
        costheta = NaN
    else
        eta = eta*1.1
        costheta = dot(new_params-params, grad)/(norm(new_params-params)*norm(grad))
        
        params = new_params
        noiseval = new_noiseval
        gradmag  = new_gradmag

        cost, grad, hess = vgh( (w) -> func_noisy(noiseval; do_plot=true, verbose=true, make_dict(args, w)...), params)

#        cost, grad, hess = vgh( (w) -> J2([data1;noiseval]; do_plot=true, verbose=true, beta=beta, make_dict(args, w)...), params)
    end

    if verbose
        @printf("%d: eta=%g cost=%.4f Dcost=%g costheta=%.3f gradmag=%g, ps=", 
            i, eta, cost, delta_cost, costheta, gradmag)
        print_vector(params)
        @printf "\n"
        if verbose_level >= 2
            @printf("    noiseval="); print_vector(noiseval); @printf("\n")
            J2m_n  = sum(new_J2mag_noisegrad.*new_J2mag_noisegrad)
            J2m_p  = sum(new_J2mag_parmsgrad.*new_J2mag_parmsgrad)
            iJ2m_n = sum(new_init_J2mag_noisegrad.*new_init_J2mag_noisegrad)
            
            @printf("    init_J2mag_noisegrad= = %g, J2mag_noisegrad = %g,  J2mag_parmsgrad = %g\n",
                iJ2m_n, J2m_n, J2m_p)
        end
    end

    trajectory = [trajectory [i; eta; params; noiseval]]
end



In [None]:
data1 = make_data(npoints=100, data_sigma=10, seedrand=15);

args = ["threshold", "slope"]
bbox = [-20.1 20.1 ; 0.01 200]

seed = [0.5, 10.1]
# seed = [-7.758, 1.5]

params = seed; new_params = 0; new_cost = 0;
eta = 0.1
beta = 0.005

n_noise  = 1
n_params = length(seed)
maxiter  = 200
tol      = 1e-9
verbose  = true
verbose_level = 2

func       = (;pars...)     -> J2(data1; beta=beta, do_plot=false, verbose=false, pars...)
func_noisy = (nval;pars...) -> func(;nnoise=n_noise, noisy=nval, pars...)

trajectory = zeros(2 + n_params + n_noise, 0)

noiseval, gradmag, J2noisegrad, J2parmsgrad, initJ2noisegrad =  bring_the_noise(func, args, seed, n_noise)

cost, grad, hess = vgh( (w) -> func_noisy(noiseval; do_plot=true, verbose=true, make_dict(args, w)...), params)


In [None]:
size(trajectory)

In [None]:
J2(data1; do_plot=true, verbose=true, beta=beta, make_dict(args, params)...)

In [None]:
i = 71; eta = trajectory[2,i]; params = trajectory[3:4,i]; noiseval = trajectory[5:end,i]
trajectory[:,i]

In [None]:
[params  new_params]

In [None]:
dd = data1[1:100]
new_params = [-8.033, 1.106]

#dd = []
cost, grad, hess = vgh( (w) -> J2([dd ; noiseval]; do_plot=true, verbose=true, beta=beta, make_dict(args, w)...), new_params)
    hathess    = hess + eye(length(grad), length(grad))/eta        
    new_params = params - inv(hathess)*grad


# Real-life testing of bring_the_noise()

In [None]:
dd = data1[1:99]
dd = [data1[1:10] ; data1[91:100]]
dd = -8.01*ones(30,1)
#dd = []
cost, grad, hess = vgh( (w) -> J2([dd ; noiseval]; do_plot=true, verbose=true, beta=beta, make_dict(args, w)...), params)

    hathess    = hess + eye(length(grad), length(grad))/eta        
    new_params = params - inv(hathess)*grad
    new_params = [-8.033, 1.106]

    new_cost, new_grad, new_hess = 
vgh((w) -> J2([dd ; noiseval]; do_plot=true, verbose=true, beta=beta, make_dict(args, w)...), new_params)
            
    new_noiseval, new_gradmag, new_J2mag_noisegrad, new_J2mag_parmsgrad, new_init_J2mag_noisegrad = 
bring_the_noise((;pars...) -> J2(dd;do_plot=true, verbose=false, beta=beta, pars...), 
args, new_params, n_noise, init_noise=noiseval, verbose=true, ncycles=100, growth_factor=1.5)
    
    iJ2m_n = sum(new_init_J2mag_noisegrad.*new_init_J2mag_noisegrad)
    J2m_n = sum(new_J2mag_noisegrad.*new_J2mag_noisegrad)
    @printf("iJ2m_n = %g, J2m_n = %g\n", iJ2m_n, J2m_n)

new_noiseval
    

In [None]:
bring_the_noise((;pars...) -> J2(data1;do_plot=true, verbose=true, pars...), 
args, new_params, n_noise, init_noise=[-7.98], verbose=true, ncycles=1, start_eta=6)

In [None]:
function compute_noise_grad(func, args, seed, n_noise_params, init_noise) 

    nparams = length(seed)
    n_noise_params = length(init_noise)
    
    myargs    = Array{Any, 1}(nparams+1)
    for i=1:nparams
        myargs[i] = args[i]
    end
    myargs[nparams+1] = ["noisy", n_noise_params]

    value, grad, hess = keyword_vgh((;pars...) -> func(; nnoise=n_noise_params, pars...), 
        myargs, [seed ; init_noise])
    pgrad = grad[1:nparams]

    noise_grad = hess[nparams+1:end,1:nparams]*grad[1:nparams]
    current_grad_mag = sum(pgrad.*pgrad)

    return noise_grad, pgrad, current_grad_mag
end

noise_grad, pgrad, current_grad_mag = compute_noise_grad((;pars...) -> J2(data1;do_plot=true, verbose=false, beta=0, pars...), 
    args, new_params, 1, -8.0);
noise_grad

In [None]:
epsilon = 0.0000001;
funny = (epsilon) -> 0.5*compute_noise_grad((;pars...) -> J2(data1;do_plot=true, verbose=false, pars...), 
args, new_params, 1, -7.98+epsilon)[3]
[funny(epsilon) - funny(0)]/epsilon

In [None]:
pgrad = compute_noise_grad((;pars...) -> J2(data1;do_plot=true, verbose=false, pars...), args, new_params, 1, -8.00)[2];
sum(pgrad.*pgrad)

# Function BRING_THE_NOISE()

**Lessons I may have learnt:**

1. The zone of succes for bring_the_noise() may be narrow; therefore, for the adaptive eta, it pays to make the growth factor small. If a big jump takes us too far, we can end up in a zero gradient region.
2. It also pays to have enough ncycles that we really finish
3. Finally, in J(), the diffs factor (which applies through theta2, when beta > 0), can actually make for local maxima. There can be a zone where |dJ/dw|^2 goes through a narrow maximum that is actually the one we want; further off, there might be a local minimum and then further positive gradient zones. If we have large jumps and skip over the narrow maximum and the local minimum, then we go all the way off into badland, without a hope of returning.  Reinforcing the importance of a small growth factor.

**New issue revealed:**  FIXED:   ~~If we suddenly move into a region where the gradient is much larger than in the previous step size, we may take an unfortunate big jump. It's nto all about |eta|.  We have to fix that.~~

**Another thing:** In our current cost function (I'll call it J() here although above it is defined as J2, because I'm resrving J2 for J2 = |dJ/dw|^2 ), dJ/dw and therefore J2 *does* depend on non-noise data points, and the gradient can therefore interfere there. This may be a particularly strong problem when we get close to the optimum (wishful thinking, this, trying to just wish it away?)

In [None]:
"""

noise_value, grad_magnitude = bring_the_noise(func, args, seed, n_noise_params; 
            init_noise=NaN, verbose=false, ncycles=100)

Given a scalar function that takes some keyword-value arguments, as well as a "noise" vector 
(it doesn't really have to represent noise, it could be anything), finds the value of the "noise" vector 
that would maximize the magnitude of the gradient of func w.r.t. the keyword-value parameters.

PARAMETERS:
===========

func       A scalar function, with keyword-value parameters.  These MUST include nnoise=0, which 
           will be used to indicate the length of the noise vector, and noisy=[], which will be used to indicate
           the value of the noise vector itself.  They MUST also include nderivs=0 and difforder=0, used internally
           together with ForwardDiffZeros in order to make sure new arrays and vectors are differentiable.

args       A list of strings, indicating the keyword parameters for which differentiation is desired.

seed       A list of the initial values (all scalars) of those keyword parameters

n_noise_params    The desired length of the "noise" vector


OPTIONAL PARAMETERS:
====================

init_noise  Default NaN, in which case it is ignored and noise is initialized randomly. If not NaN, it 
            should be a column vector, length n_noise_p repraesenting the initil value of the "noise"

verbose     Default false. If true, prints out debugging information at each cycle of the iterative search for the
            best noise vector value
      
ncycles     Default 100. Number of iterations of the adaptive gradient descent that will be used to find the 
            best noise vector value.

start_eta   Default 1. starting value of learning rate.

growth_factor    Default 1.2.  Factor by which eta gets multiplies every time a step successfully leads to an
            increase in J2 = |d(func)/d(params)|^2

RETURNS:
========

noise_val   The value of the noise at the end of the iterations seeking to maximize |d(func)/d(params)|^2

paramgrad_mag  J2 = |d(func)/d(params)|^2

noise_grad    d(J2)/dnoise    since we're trying to find the noise that maximizes J2, 
                            if we were successful this will be very small at the end of the iterations

param_grad    d(J2)/dparams

init_noise_grad   d(J2)/dnoise at the beginning (not end) of the iterations

"""
function bring_the_noise(func, args, seed, n_noise_params; init_noise = NaN, verbose=false, 
    ncycles=100, start_eta=1, growth_factor = 1.2)

    function unit_vector(vec)
        return vec/sqrt(sum(vec.*vec))
    end
    
    if length(init_noise)==1 && isnan(init_noise[1])
        noise_val = randn(n_noise_params, 1)
    else
        noise_val = init_noise
    end
    nparams   = length(seed)
    myargs    = Array{Any, 1}(nparams+1)
    
    for i=1:nparams
        myargs[i] = args[i]
    end
    myargs[nparams+1] = ["noisy", n_noise_params]
    
    eta = start_eta
    
    value, grad, hess = keyword_vgh((;pars...) -> func(; nnoise=n_noise_params, pars...), 
        myargs, [seed ; noise_val])
    pgrad = grad[1:nparams]

    noise_grad = hess[nparams+1:end,1:nparams]*grad[1:nparams]
    current_grad_mag = sum(pgrad.*pgrad)
    
    starting_step_size = start_eta*sqrt(sum(noise_grad.*noise_grad))
    step_size = starting_step_size
    
    if verbose
        @printf("0: eta is %g, noise_grad is ", eta); print_vector_g(noise_grad);
        @printf("  |pgrad|^2 is %g\n", current_grad_mag)
    end
    
    init_noise_grad = noise_grad

    for i=1:ncycles
        new_noise_val = noise_val + step_size*unit_vector(noise_grad)

        value, grad, hess = keyword_vgh((;pars...) -> func(; nnoise=n_noise_params, pars...), 
            myargs, [seed ; new_noise_val])
        new_noise_grad = hess[nparams+1:end,1:nparams]*grad[1:nparams]
        pgrad = grad[1:nparams]
        
        if verbose
            @printf("%d: step_size is %g, |pgrad|^2 is %g, delta in |pgrad|^2 is %g\n", i, step_size, 
            sum(pgrad.*pgrad), sum(pgrad.*pgrad)-current_grad_mag)
            @printf("new_noise_val: ");  print_vector(new_noise_val);        @printf("\n")
            @printf("new_noise_grad: "); print_vector_g(new_noise_grad); @printf("\n")
        end
        
        if sum(pgrad.*pgrad)-current_grad_mag > 0
            step_size *= growth_factor
            noise_val  = new_noise_val
            noise_grad = new_noise_grad 
            current_grad_mag = sum(pgrad.*pgrad)
        elseif sum(pgrad.*pgrad)-current_grad_mag == 0
            break
        else
            step_size /=5
            if verbose
                @printf("   Going back to noise_val: "); print_vector(noise_val); @printf("\n")
            end
        end
    end

    return noise_val, sum(pgrad.*pgrad), noise_grad, hess[1:nparams,1:nparams]*grad[1:nparams], init_noise_grad
end


In [None]:
function J2(data1; nnoise=0, noisy=[], threshold=0.5, slope=0.25, theta1 = 0.15, theta2=0.2, beta=0.005,
    do_plot=true, nderivs=0, difforder=0, verbose=true)

    if nnoise > 0
        data1 = [data1 ; noisy]
    end
    npoints = length(data1)

    d1 = tanh((data1 - threshold)*slope)/2+0.5

    hits = 0.5*(1 + tanh.((d1-0.5)/theta1))
    difs = tanh((d1 - 0.5)/theta2).^2
    
    if do_plot
        figure(1); clf();
        subplot(3,1,1)
        plot(1:npoints, d1, "b."); # @printf("Plotted %d points\n", npoints)
        ylabel("d1"); 
        title(@sprintf("threshold=%.3f slope=%.3f", convert(Float64, threshold), convert(Float64, slope)))
        subplot(3,1,2)
        plot(1:npoints, hits, ".")
        ylabel("hits")
        subplot(3,1,3)
        plot(1:npoints, difs, ".")
        ylabel("difs")        
        title(@sprintf("<hits>=%.3f <difs>=%.3f", convert(Float64, mean(hits)), convert(Float64, mean(difs))))
    end
    cost1 = (mean(hits) - 0.75)^2
    cost2 = -mean(difs) 

    cost = cost1 + beta*cost2

    if verbose
        @printf("        cost1=%g, cost2=%g, mean(hits)=%.4f, mean(difs)=%.4f\n", 
            convert(Float64, cost1), beta*convert(Float64, cost2), 
            convert(Float64, mean(hits)), convert(Float64, mean(difs)))
    end

    return cost
end



In [None]:
ax = subplot(3,1,1)


In [None]:
new_noiseval, new_gradmag = 
    bring_the_noise((;pars...) -> J2(data1;do_plot=false, verbose=false, pars...), 
    args, new_params, n_noise, init_noise=noiseval, verbose=true)


In [None]:
grad


In [None]:
func = (w;pars...) -> J2(data1)

cost, grad, hess = vgh(func, params)


    for i in [1:maxiter;]
        hathess    = hess + eye(length(grad), length(grad))/eta        
        new_params = params - inv(hathess)*grad
        new_cost, new_grad, new_hess = vgh(func, new_params)
            
        if abs(new_cost - cost) < tol
            break
        end

        if new_cost >= cost
            eta = eta/2
            costheta = NaN
        else
            eta = eta*1.1
            costheta = dot(new_params-params, grad)/(norm(new_params-params)*norm(grad))

            params = new_params
            cost = new_cost
            grad = new_grad
            hess = new_hess
        end

        if verbose
            @printf "%d: eta=%.3f cost=%.4f costheta=%.3f ps=" i eta cost  costheta
            print_vector(params)
            @printf "\n"
        end
    end
    
    return params
end

In [None]:
J2(data1, do_plot=true, beta=0.005, theta1=0.6, theta2=0.2, threshold=-4, slope=10.1)

In [None]:
rem(12,3)

In [None]:
val, grad, hess = keyword_vgh((;pars...)->J(data1;do_plot=true, beta=0.05, theta1=0.15, theta2=0.2, pars...), 
    ["threshold", "inv_slope"], [-3.64425,0.00998397])

# Weird gradient idea


In [None]:
function vgh(func, pars)
    
    out = DiffBase.HessianResult(pars)
    ForwardDiff.hessian!(out, func, pars)
    value = DiffBase.value(out)
    grad  = DiffBase.gradient(out)
    hess  = DiffBase.hessian(out)

    return value, grad, hess
end

In [None]:
function googaa(x::Vector)
    return sum(x.*x)
end

googaa(randn(1,100)[:])

vgh(googaa, randn(1,10)[:])


In [None]:
npoints = 100
nnoise  = 2
data_sigma = 10


function make_data(;npoints=100, data_sigma=10, seedrand=NaN)
    if ~isnan(seedrand)
        srand(seedrand)
    end
    return data_sigma*randn(npoints,1)
end

data1 = make_data(npoints=100, data_sigma=10, seedrand=10);


function J2(data1; nnoise=0, noisy=[], threshold=0.5, slope=0.25, theta1 = 0.15, theta2=0.2, beta=0.05,
    do_plot=true, nderivs=0, difforder=0, verbose=true)

    if nnoise > 0
        data1 = [data1 ; noisy]
    end
    npoints = length(data1)

    d1 = tanh((data1 - threshold)*slope)/2+0.5

    hits = 0.5*(1 + tanh.((d1-0.5)/theta1))
    difs = tanh((d1 - 0.5)/theta2).^2
    
    if do_plot
        figure(1); clf();
        subplot(3,1,1)
        plot(1:npoints, d1, "b.")
        ylabel("d1"); 
        title(@sprintf("threshold=%.3f slope=%.3f", convert(Float64, threshold), convert(Float64, slope)))
        subplot(3,1,2)
        plot(1:npoints, hits, ".")
        ylabel("hits")
        subplot(3,1,3)
        plot(1:npoints, difs, ".")
        ylabel("difs")        
        title(@sprintf("<hits>=%.3f <difs>=%.3f", convert(Float64, mean(hits)), convert(Float64, mean(difs))))
    end
    cost1 = (mean(hits) - 0.75)^2
    cost2 = -mean(difs) 

    cost = cost1 + beta*cost2

    if verbose
        @printf("        cost1=%g, cost2=%g, mean(hits)=%.4f, mean(difs)=%.4f\n", 
            convert(Float64, cost1), beta*convert(Float64, cost2), 
            convert(Float64, mean(hits)), convert(Float64, mean(difs)))
    end

    return cost
end

value, grad, hess = keyword_vgh((;pars...)->J2(data1, do_plot=true, nnoise=2;pars...), 
["threshold", "slope", ["noisy", 2]], [0.5, 0.25, 0.1, 0.2])

hess

In [None]:
func = (;pars...)->J2(data1; do_plot=false, verbose=false, pars...)
args = ["threshold", "slope"]
seed = [0.05, 10]

n_noise_params = 5

    noise_val = randn(n_noise_params, 1)
    nparams   = length(seed)
    myargs    = Array{Any, 1}(nparams+1)

    for i=1:nparams
        myargs[i] = args[i]
    end
    myargs[nparams+1] = ["noisy", n_noise_params]
    
    eta = 1

myargs
keyword_vgh((;pars...) -> func(; nnoise=n_noise_params, pars...), 
        myargs, [seed ; noise_val])

In [None]:
noiseval, grad_mag = bring_the_noise( (;pars...)->J2(data1; do_plot=false, verbose=false, pars...), 
["threshold", "slope"], [0.5, 10], 5, verbose=false)

J2([data1;noiseval]; do_plot=true, threshold=0.5, slope=10)

In [None]:
a = Array{Any, 1}(3)
a[1] = "a"
a[2] = "bee"
a[3] = ["hmm" 3]
a

In [None]:
nnoise = 1
noise_val = data_sigma*randn(nnoise,1); orig_noise_val = noise_val

args = ["threshold", "slope", ["noisy", nnoise]]
seed = [0.5, 20]
nparams = length(seed)

eta = 1

value, grad, hess = keyword_vgh((;pars...) -> J2(data1, do_plot=true, nnoise=nnoise;pars...), args, [seed;noise_val])
noise_grad = hess[nparams+1:end,1:nparams]*grad[1:nparams]
current_grad_mag = sum(grad.*grad)
@printf("|grad|^2 is %g\n\n", current_grad_mag)

new_noise_val=0; new_noise_grad=0
for i=1:100
    new_noise_val = noise_val + eta*noise_grad

    value, grad, hess = keyword_vgh((;pars...) -> J2(data1, do_plot=false, verbose=false, nnoise=nnoise;pars...), 
        args, [seed;new_noise_val])
    new_noise_grad = hess[nparams+1:end,1:nparams]*grad[1:nparams]
    if i<0
        @printf("%d: eta is %g, |grad|^2 is %g, delta in |grad|^2 is %g\n", i, eta, 
        sum(grad.*grad), sum(grad.*grad)-current_grad_mag)
        @printf("noise_val: ");      print_vector(noise_val);        @printf("\n")
        @printf("new_noise_grad: "); print_vector_g(new_noise_grad); @printf("\n")
    end
    if sum(grad.*grad)-current_grad_mag >= 0
        eta *= 2
        noise_val  = new_noise_val
        noise_grad = new_noise_grad 
        current_grad_mag = sum(grad.*grad)
    else
        eta /=10
    end
end

@printf("|grad|^2 is %g\n\n", sum(grad.*grad))
@printf("%d: eta is %g, |grad|^2 is %g, delta in |grad|^2 is %g\n", i, eta, 
sum(grad.*grad), sum(grad.*grad)-current_grad_mag)
@printf("noise_val: ");      print_vector(noise_val);        @printf("\n")
@printf("new_noise_grad: "); print_vector_g(new_noise_grad); @printf("\n")

# J2(data1, do_plot=true, verbose=true, nnoise=nnoise; make_dict(args, [seed;orig_noise_val])...)

J2(data1, do_plot=true, verbose=true, nnoise=nnoise; make_dict(args, [seed;new_noise_val])...)


In [None]:
new_noise_val

In [None]:
noise_val = [0.2, 0.1]
args = ["threshold", "slope", ["noisy", 2]]
seed = [0.5, 10]
nnoise = length(noise_val)

eta = 1
nparams = length(grad)-nnoise

value, grad, hess = keyword_vgh((;pars...) -> J2(data1, do_plot=true, nnoise=nnoise;pars...), args, [seed;noise_val])
noise_grad = hess[nparams+1:end,1:nparams]*grad[1:nparams]

In [None]:
hess

In [None]:
grad'

In [None]:
hess

In [None]:
grad'

In [None]:
hess[nparams+1:end, 1:nparams]*grad[1:nparams]

In [None]:
J2(data1, do_plot=true, verbose=true, nnoise=nnoise; make_dict(args, [seed;new_noise_val])...)


In [None]:
value, grad, hess = keyword_vgh((;pars...)->J2(data1, do_plot=true;pars...), ["threshold", "slope", "noisy"], 
[0.5, 100, 0.5])

@printf("1000*J2 = %g\n", 1000*0.5*sum(grad[1:2].*grad[1:2]))

grad[1]*hess[3,1] + grad[2]*hess[3,2]

In [None]:

a=[1,2]
b = [3,4]
[a;b]

In [None]:
glu = "%.4f\n"

# @eval @printf($fmt,1,2,3)

@eval @printf($glu, pi)

In [None]:
include("hessian_utils.jl")

# Next cell tries out Marino's idea of doing a new random seed if the minimization gets stuck at a value of the cost function that we don't like.

**2017-08-16 11:11am** : Works, but it's not clear to me that it is better than just restarting the search entirely. Too many searches get stuck at a too small value of the inv_slope.

**Still to do:** work on scaling theta1 and theta2.  Unclear whether that will help.

In [None]:
seed = [0.5, 0.1]
args = ["threshold", "inv_slope"]
bbox = [-20.1 20.1 ; 0.001 20]

func2 = (;pars...) -> func(;beta=0*0.05, pars...)

niters = 0
cost = 10
while cost > 0
    params, traj, cost = bbox_Hessian_keyword_minimization(seed, args, bbox, (;pars...)->func(;beta=0.001, pars...), verbose=true)
    data1 = make_data(npoints=100, data_sigma=10)
    func = (;pars...) -> J(data1; do_plot=true, theta1=0.15, theta2=0.2, pars...)
    seed = params
    niters = niters + 1
    @printf("\n\n\n========== WILL CONSIDER NEXT ITER AT COST = %g ========\n\n\n", cost)
end

@printf("\n\n   niter = %g\n\n", niters)

In [None]:
func2(;make_dict(["threshold", "inv_slope"], [-9, 1])...)

# Sandlot -- trash from here on

In [None]:
theta2 = 0.2
mudata = 0.9
sigmadata = 0.2

ntrials = 100

u1 = randn(ntrials, 1)*sigmadata + mudata
u2 = randn(ntrials, 1)*sigmadata - mudata

v1 = tanh(u1)
v2 = tanh(u2)

diffs = (v1 - v2)/theta2
sampsigma = sqrt(mean((v1-v2).^2))
theta2 = sampsigma
@printf "sampsigma=%g, theta2=%g\n" sampsigma theta2

tdiffs = tanh((v1-v2)/theta2).^2

figure(1); clf(); 
subplot(4,1,1)
plot(u1, "b.", u2, "r."); ylabel("U"); 
subplot(4,1,2)
plot(v1, "b.", v2, "r."); ylabel("V")
subplot(4,1,3)
plot((v1-v2), "."); ylabel("V1-V2")
subplot(4,1,4)
plot(tdiffs, "."); ylabel("TANH OF DIFFS")