# GridWorld Example

Let's see how to make this API work with GridWorld! This reinforcement learning API requires 3 things to be defined before we start running algorithms:

+ BlackBoxModel: defines the problem--see below for an example!
+ Policy: this is where your domain knowledge comes in--define action space and feature functions
+ Solver: This is where the API takes over and you just specify what you want to use

In [None]:
include(joinpath("..","src","ReinforcementLearning.jl"))
using ReinforcementLearning

## Define Black Box Model Functions

The BlackBoxModel type requires the following things to be defined:
+ `model`: a generic type that holds all your model parameters for a specific instance of your problem
+ `init(model,rng)`: generate an initial state
+ `observe(model,rng,state,action=None)`: return an observation based on your state (and action--this isn't quite ironed out yet)
+ `next_state(model,rng,state,action)`: generate a next state given your state, action and problem parameterization
+ `reward(model,rng,state,action)`: generate a reward based on your state and action and problem parameterization
+ `isterminal(model,state,action)`: return a boolean of whether a state (and action) is terminal or not

In [None]:
using PyPlot
using Interact

typealias State Tuple{Int,Int}
typealias Action Tuple{Int,Int}

type GridWorldModel <: Model
  W::Int
  H::Int
  p_other::Float64
  reward_locs::Dict{State,Float64}
  collide_cost::Float64
    A::Array{Action,1}
end

In [None]:
init2(m::GridWorldModel,rng::AbstractRNG) = (rand(rng,1:m.W),rand(rng,1:m.H))
init1(m::GridWorldModel,rng::AbstractRNG) = (1,1)

isend1(rng::AbstractRNG,m::GridWorldModel,s::State,a::Action) = s == (m.W,m.H)
isend2(rng::AbstractRNG,m::GridWorldModel,s::State,a::Action) = false

function reward(rng::AbstractRNG,m::GridWorldModel,s::State,a::Action)
    r = get(m.reward_locs,s,0.)
  x_ = s[1] + a[1]
  y_ = s[2] + a[2]

  if (x_ < 1) || (x_ > m.W)
    r += m.collide_cost
  elseif (y_ < 1) || (x_ > m.H)
    r += m.collide_cost
  end
    
    r -= 0.1 #cost of living
    
  return r

end

function next(rng::AbstractRNG,m::GridWorldModel,s::State,a::Action)
    A_other = setdiff(m.A,[a,(-1*a[1],-1*a[2])])

  if rand(rng) < m.p_other
    _a = A_other[rand(rng,1:length(A_other))]
  else
    _a = a
  end
  x_ = s[1] + _a[1]
  y_ = s[2] + _a[2]

  x_ = max(min(x_,m.W),1)
  y_ = max(min(y_,m.H),1)

  return (x_,y_)
end

Here we also implement some quality of life functions, such as an explicity one-hot feature function for each state-action pair, and a visualization function

In [None]:
function generate_featurefunction(m::GridWorldModel,A::Array{Action,1})

  nb_feat = m.W*m.H*length(A)
  A_indices = [a=>i for (i,a) in enumerate(A)]
  function feature_function(s::State,a::Action)
    active_indices = [s[1]+m.W*(s[2]-1)+m.W*m.H*(A_indices[a]-1)]
    phi = sparsevec(active_indices,ones(length(active_indices)),nb_feat)
    return phi
  end

  return feature_function

end

function visualize(m::GridWorldModel,s::State,a::Action)
  #base grid
  for i = 1:m.W
    for j = 1:m.H
      val = get(m.reward_locs,(i,j),0.)
      if val > 0
        color = "#31B404"
      elseif val < 0
        color = "#FF0000"
      else
        color = "#A4A4A4"
      end
            fill([i;i+1;i+1;i],[j;j;j+1;j+1],color=color,edgecolor="#FFFFFF")
    end #j
  end #i

  #draw agent
  agent_color = "#0101DF"
  x = s[1] + 0.5
  y = s[2] + 0.5
  fill([x-0.5;x;x+0.5;x],[y;y-0.5;y;y+0.5],color=agent_color)
  #draw direction
  arrow(x,y,0.5*a[1],0.5*a[2],width=0.1,head_width=0.15,head_length=0.5)

end

function visualize(m::GridWorldModel,S::Array{State,1},A::Array{Action,1})
  assert(length(S) == length(A))
  f = figure()
  @manipulate for i = 1:length(S); withfig(f) do
    visualize(m,S[i],A[i]) end
  end
end

In [None]:
_A = Action[(0,0),(1,0),(-1,0),(0,1),(0,-1)]

In [None]:
W = 20
H = 20
p_other = 0.2
reward_locs = Dict{State,Float64}((W,H)=>10.)
collide_cost = -1.
m = GridWorldModel(W,H,p_other,reward_locs,collide_cost,_A)

We now define the BlackBoxModel type. Note that we do not include an observation function in the constructor--in this case, it uses a default identity observation model

In [None]:
bbm = BlackBoxModel(m,init1,next,reward,isend1) 

## Setting Up the Policy

In general for a policy, we have to define an ActionSpace (which we require to be exactly or a subset of the true action space), and feature function, which maps the state into a vector.

Tile coding is provided (the API for tilecoding needs work, however) for a quick and dirty function approximator in the continuous domain. For concreteness/generality, we include a function `cast_mc_state`, which in the most general case, will convert whatever state representation you have into an array of numbers

In [None]:
feature_function = generate_featurefunction(m,_A)
A = DiscreteActionSpace(_A)

In [None]:
policy = EpsilonGreedyPolicy(feature_function,A,rng=MersenneTwister(3234),eps=0.1)

## Choose and Set up your Solver

Currently, the following solvers are supported:
+ Forgetful LSTD(\lambda) / LS-SARSA (untested)
+ SARSA(\lamda) 
+ Q(\lambda) (unimplemented)
+ GQ(\lambda) (unimplemented)
+ Double Q learning (untested)
+ Deterministic Policy Gradient (unimplemented)
+ (Natural) Actor-Critic (unimplemented
+ LSPI/Batch TD (untested)
+ True Online TD

We just ask that you know a-priori how big your feature vectors are to make initialization easy

In [None]:
#there might be a smart way to stick this into a constructor, but for now...
nb_features = length(policy.feature_function(bbm.state,domain(A)[1]))
updater = ForgetfulLSTDParam(nb_features,alpha=0.001/3)
#updater = SARSAParam(nb_features,lambda=0.7,init_method="unif_rand",trace_type="replacing")
updater = TrueOnlineTDParam(nb_features,lambda=0.95,init_method="unif_rand")
#mem_size = 50
#updater = LSPIParam(nb_features,mem_size,del=0.01,discount=0.99)

## Actually set up the real solver

Some random cool things supported include:
+ minibatching
+ experience replay
+ adaptive learning rates, e.g.:
    * momentum
    * nesterov momentum
    * rmsprop
    * adagrad
    * adadelta
    * adam
+ simulated annealing (probably shouldn't support this)


In [None]:
solver = Solver(updater,
                lr=0.01,
                nb_episodes=2000,
                nb_timesteps=1000,
                discount=0.99,
                annealer=NullAnnealer(),
                mb=NullMinibatcher(),
                er=NullExperienceReplayer(),
                display_interval=10)

In [None]:
trained_policy = solve(solver,bbm,policy)

## Evaluate Policy
Basically just run a couple of simulations -- the simulator api is a subset of the stuff you see in solver

In [None]:
sim = Simulator(discount=1.,nb_sim=50,nb_timesteps=1000,visualizer=visualize) #stuff...

In [None]:
#returns average reward for now...
bbm.init = init1
R_avg = simulate(sim,bbm,trained_policy)

In [None]:
visualize(m,sim.hist.S,sim.hist.A)

Note that we have to call visualize externally. Currently getting the visualization to work two or three function calls in isn't quite working.