# Module 4 - Programming Assignment

## Directions

There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at `EN605.445@gmail.com` if you do not understand something.

<div style="background: mistyrose; color: firebrick; border: 2px solid darkred; padding: 5px; margin: 10px;">
You must follow the directions *exactly* or you will get a 0 on the assignment.
</div>

You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: `<jhed_id>.zip`. It will not include any other information. Inside this zip file should be the following directory structure:

```
<jhed_id>
    |
    +--module-04-programming.ipynb
    +--module-04-programming.html
    +--world.txt
    +--test01.txt
    +--(any other files)
```

For example, do not name  your directory `programming_assignment_01` and do not name your directory `smith122_pr1` or any else. It must be only your JHED ID.

In [1]:
from IPython.core.display import *
from StringIO import StringIO
import copy, sys, random, math

# add whatever else you need from the Anaconda packages

## Reinforcement Learning with Q-Learning

The world for this problem is very similar to the one from Module 1 that we solved with A\* search but this time we're going to use a different approach.

We're replacing the deterministic movement with stochastic movement. This means, when the agent moves "south" instead of always going "south", there is a probability distribution of possible successor states "south", "east", "north" and "west". Thus we may not end up in the state we planned!

There are a variety of ways to handle this problem. For example, if using A\* search, if the agent finds itself off the solution, you can simply calculate a new solution from where the agent ended up. Although this sounds like a really bad idea, it has actually been shown to work really well in Video Games that use formal Planning algorithms (which we will cover later). When these algorithms were first designed, this was unthinkable. Thank you, Moore's Law!

Another approach is to use Reinforcement Learning which covers problems where there is some kind of general uncertainty. We're going to model that uncertainty a bit unrealistically here but it'll show you how the algorithm works.

As far as RL is concerned, there are a variety of options there: model-based and model-free, Value Iteration, Q-Learning and SARSA. You are going to use Value Iteration and Q-Learning.

## The World Representation

As before, we're going to simplify the problem by working in a grid world. The symbols that form the grid have a special meaning as they specify the type of the terrain and the cost to enter a grid cell with that type of terrain:

```
token   terrain    cost 
.       plains     1
*       forest     3
^       hills      5
~       swamp      7
x       mountains  impassible
```

When you go from a plains node to a forest node it costs 3. When you go from a forest node to a plains node, it costs 1. You can think of the grid as a big graph. Each grid cell (terrain symbol) is a node and there are edges to the north, south, east and west (except at the edges).

There are quite a few differences between A\* Search and Reinforcement Learning but one of the most salient is that A\* Search returns a plan of N steps that gets us from A to Z, for example, A->C->E->G.... Reinforcement Learning, on the other hand, returns  a *policy* that tells us the best thing to do **for every and any state.**

For example, the policy might say that the best thing to do in A is go to C. However, we might find ourselves in D instead. But the policy covers this possibility, it might say, D->E. Trying this action might land us in C and the policy will say, C->E, etc. At least with offline learning, everything will be learned in advance (in online learning, you can only learn by doing and so you may act according to a known but suboptimal policy).

Nevertheless, if you were asked for a "best case" plan from (0, 0) to (n-1, n-1), you could (and will) be able to read it off the policy because there is a best action for every state. You will be asked to provide this in your assignment.

## Reading the Map

To avoid global variables, we have a <code>read_world()</code> function that takes a filename and returns the world as `List` of `List`s. **The same coordinates reversal applies: (x, y) is world[ y][ x] as from PR01.**

In [2]:
def read_world( filename):
    with open( filename, 'r') as f:
        world_data = [x for x in f.readlines()]
    f.closed
    world = []
    for line in world_data:
        line = line.strip()
        if line == "": continue
        world.append([x for x in line])
    return world

Next we create a dict of movement costs. Note that we've negated them this time because RL requires negative costs and positive rewards:

In [3]:
costs = { '.': -1, '*': -3, '^': -5, '~': -7}
costs

{'*': -3, '.': -1, '^': -5, '~': -7}

and a list of offsets for `cardinal_moves`. You'll need to work this into your actions, A, parameter.

In [4]:
cardinal_moves = [(0,-1), (1,0), (0,1), (-1,0)]

And now the confusing bits begin. We must program both the Q-Learning algorithm and a *simulator*. The Q-Learning algorithm doesn't know T but the simulator *must*. Essentially the *simulator* is any time you apply a move and check to see what state you actually end up in (rather than the state you planned to end up in).

The transition function your *simulation* should use, T, is 0.70 for the desired direction, and 0.10 each for the other possible directions. That is, if I select "up" then 70% of the time, I go up but 10% of the time I go left, 10% of the time I go right and 10% of the time I go down. If you're at the edge of the map, you simply bounce back to the current state.

You need to implement `q_learning()` with the following parameters:

+ world: a `List` of `List`s of terrain (this is S from S, A, T, gamma, R)
+ costs: a `Dict` of costs by terrain (this is part of R)
+ goal: A `Tuple` of (x, y) stating the goal state.
+ reward: The reward for achieving the goal state.
+ actions: a `List` of possible actions, A, as offsets.
+ gamma: the discount rate
+ alpha: the learning rate

you will return a policy: 

`{(x1, y1): action1, (x2, y2): action2, ...}`

Remember...a policy is what to do in any state for all the states. Notice how this is different that A\* search which only returns actions to take from the start to the goal. This also explains why `q_learning` doesn't take a `start` state!

You should also define a function `pretty_print_policy( cols, rows, policy)` that takes a policy and prints it out as a grid using "^" for up, "<" for left, "v" for down and ">" for right. Note that it doesn't need the `world` because the policy has a move for every state. However, you do need to know how big the grid is so you can pull the values out of the `Dict` that is returned.

```
vvvvvvv
vvvvvvv
vvvvvvv
>>>>>>v
^^^>>>v
^^^>>>v
^^^>>>G
```

(Note that that policy is completely made up and only illustrative of the desired output).

There are a lot of details that I have left up to you. For example, when do you stop? Is there a strategy for learning the policy? Watch and re-watch the lecture on Q-Learning. Ask questions. You need to implement a way to pick initial states for each iteration and you need a way to balance exploration and exploitation while learning. You may have to experiment with different gamma and alpha values. Be careful with your reward...the best reward is related to the discount rate and the approxmiate number of actions you need to reach the goal.

* If everything is otherwise the same, do you think that the path from (0,0) to the goal would be the same for both A\* Search and Q-Learning?
* What do you think if you have a map that looks like:

```
><>>^
>>>>v
>>>>v
>>>>v
>>>>G
```

has this converged? Is this a "correct" policy?

**I strongly suggest that you implement Value Iteration on your own to solve these problems.** Why? Because Value Iteration will find the policy to which Q-Learning should coverge in the limit. If you include your Value Iteration implementation and output, I will count it towards your submission grade.

Remember that you should follow the general style guidelines for this course: well-named, short, focused functions with limited indentation using Markdown documentation that explains their implementation and the AI concepts behind them.

This assignment sometimes wrecks havoc with IPython notebook, save often. Put your helper functions here, along with their documentation. There should be one Markdown cell for the documentation, followed by one Codecell for the implementation.  Additionally, you may want to start your programming in a regular text editor/IDE because RL **takes a long time to run**.

----

### Common Helper Functions ###
The following functions are used by both Q-learning and Value Iteration.

** transCosts(world, moves, costs, goal, reward) **  
The function produces all valid transitions from all locations/states as well as the cost/reward for said transitions.  

The input to the functions are: world represented as list of lists of terrains as represented by strings, the entire moveset as cartesian offsets, the costs as a dictionary where the keys are the terrains as strings, the goal as a tuple of list indices, and reward as a int 

The function loops through all states, and for every valid states (i.e. passible and non-goal states), attemps all actions in the `moves` input from the state. Only valid transitions are added to the output, meaning that a transition onto impassible terrain or out of bound are not returned. For all transition, a corresponding cost/reward is attached to all transitions either as the cost of the destination terrain, or the reward if transition from the state results in the goal.

The output is a list of list, where the indices represent the indices of the world. For example, `R[0][4] = [((0, 1), -1), ((-1, 0), 9)]` means that from the state of `[0][4]`, one could either move up (cartesian offset of (0,1)) incurring a cost of 1, or move left, earning a reward of 9. 

** getPolicyFromQ(Q, Rs) **  
The function takes a set Q-values and the corresponding transitions, and returns both the maximum Q-value and the corresponding transition. The input could be thought of as set of transitions from a specific state, and the Q-value of these transitions. The function finds the maximum Q-value and the move that produce this.

In [5]:
def transCosts(world, moves, costs, goal, reward):
    R = [[-9 for c in r] for r in world] 

    for r,row in enumerate(world):
        for c,terr in enumerate(row):
            if (world[r][c]=='x') or ((r,c) == goal):
                continue # skip impassible or goal state

            acts = list()            
            for m,offset in enumerate(moves):
                x,y = (r+offset[1], c+offset[0]) # new coordinates
                if (0<=x<len(world) and 0<=y<len(row)) and (world[x][y] != 'x'):
                    #acts.append( (offset, costs[world[r][c]]) )
                    if (x,y) == goal:
                        acts.append((offset, reward ))
                    else:
                        acts.append((offset, costs[world[x][y]] ))
            R[r][c] = acts

    R[goal[0]][goal[1]] = [0]
    return R

def getPolicyFromQ(Q, Rs):
    moves = [m for m,r in Rs] # all moves
    V = max(Q) # max reward of all moves
    policy = moves[Q.index(V)] # mark policy of the max reward
    return (V, policy)

** pretty_print_policy(world, goal, policy) **  
The function takes a policy as a dictionary and prints the policy as a rectangular representation where each location represent the "best move" from the location. The dictionary representing the policy has keys that are indices to the list of list representing the world, and the values are the cartesian offset to these indices. 

The function loops over every location in the world and printing the directional symbol as indicated by the policy. For impassible and goal locations, the functions prints `x` and `G` respectively

In [6]:
def pretty_print_policy(world, goal, policy):
    m = {(0,1):'v', (1,0):'>', (0,-1):'^', (-1,0):'<', -1:'x'}
    out = copy.deepcopy(world) # pre-allocate
    
    for r,row in enumerate(world):
        for c,terrain in enumerate(row):
            out[r][c] = m[policy.get((r,c),-1)] # if cannot find mark as impassible
    out[goal[0]][goal[1]] = 'G' # mark the goal

    for row in out:
        print('\t' + ''.join(row))

-----

### Q Learning Helper Functions ###
These functions are only used by Q-Learning Program

** pickStartingState(world, goal) **  
The function picks a random starting point, and returns the first valid one found (i.e. passible and non-goal state).

** initializeQs(Ts) **  
Given a set of transitions for every state, the function initializes a corresponding Q-value set for all states and all transitions for each of the state, setting them all to 0. This is used to initialize Q for the Q-Learning program.

** randomizeAction(nActs, desired, unplanned) **  
Given the number of valid actions, the index of the desired action, the probability of resulting in an unplanned transition, the function probabilistically return the index of the action taken.

The `unplanned` represents the probability of taking one singular undesired action. Depending on the number of valid action, the probability of picking the desired action is (1 - Pr(unplanned) * (number of valid actions - 1)). If only one valid action is present, the function always take the singular action.

** getDesiredAction(Q, Ts, visits) **  
The E-Greedy algorithm that determines which action is taken from a specific state. Hard-coded 50% probability of picking a random action, with 50% of picking based on the current Q-value.

In [7]:
def pickStartingState(world, goal): # pick start randomly
    while True:
        r = random.randrange(len(world))
        c = random.randrange(len(world[r]))
        if (world[r][c]!='x') and ((r,c)!=goal):
            return (r,c)
        
def initializeQs(Ts):
    Qs = copy.deepcopy(Ts)
    for r,row in enumerate(Ts):
        for c,acts in enumerate(row):
            if type(acts) is int:
                continue
            else:
                Qs[r][c] = [0 for a in acts]
    return Qs

def randomizeAction(nActs, desired, unplanned):
    if nActs == 1:
        return desired
    
    undesired = [x for x in xrange(nActs) if x!=desired]
    r = random.random()
    thresh = unplanned * (nActs-1)
    if r > thresh:
        return desired
    else:
        ind = int(math.floor(r / thresh * (nActs-1)))
        return undesired[ind]

def getDesiredAction(Q, Ts, visits):
    if random.random() < 0.5:
        return random.randrange(len(Q)) # randomly pick desired action
    else:
        return Q.index(max(Q)) # pick based on largest Q

### Q Learning Program ###

The program is split into two functions, where `q_learning(...)` is the overall Q-Learning program, and `q_episode(...)` simulates a singular episode of Q-Learning.

** q_episode(Qs, Ts, a, g, pr, world, goal) **  
Runs one episode of Q-Learning. The function simulates one trip from a randomized starting state to a goal state.

From the starting state, the function uses E-Greedy algorithm with 50% chance of picking based by current Q-value, and 50% randomly. For the chosen state, the function then updates the `Q[s,a]` according to the algorithm, using the set of transition generated by `transCosts()` function. The function stops when the current state is a goal state.

** q_learning( world, costs, goal, reward, actions, gamma, alpha) **  
This is the main function of Q-Learning program. The function starts off by generating all valid transitions for all states in the world. It then initializes set of Qs for these state-transition combinations. With a hard-coded unplanned action probability of 0.1, the program performs 1000 episodes of Q-learning using the `q_episode(...)` function. With the generated Q, the function then determines the policy based on these Q-values by looping through all valid states using `getPolicyFromQ(...)`.

In [8]:
def q_episode(Qs, Ts, a, g, pr, world, goal):
    visits = [[0 for x in row] for row in world] # visit count
    (r, c) = pickStartingState(world, goal) # pick starting poisition
    
    while (r,c) != goal:
        visits[r][c] += 1 # increment visit count for start position
        nActions = len(Ts[r][c])
        desired = getDesiredAction(Qs[r][c], Ts[r][c], visits)
        actual = randomizeAction(nActions, desired, pr)
        offset,reward = Ts[r][c][actual]

        r_new, c_new = (r+offset[1], c+offset[0])
        maxQ = max(Qs[r_new][c_new])
        Qs[r][c][desired] = (1-a)*Qs[r][c][desired] + a * (reward + g*maxQ)
        r,c = (r_new, c_new) # update the state
        
    return Qs
    
def q_learning( world, costs, goal, reward, actions, gamma, alpha):
    Ts = transCosts(world, actions, costs, goal, reward) # unk. transitions
    Qs = initializeQs(Ts)
    
    P_unplan = 0.1
    
    t = 0
    while t < 1000:
        Qs = q_episode( Qs, Ts, alpha, gamma, P_unplan, world, goal)
        t += 1
    
    policy = dict()
    for r,row in enumerate(world):
        for c,terr in enumerate(row):
            if (world[r][c]=='x') or ((r,c) == goal):
                continue # skip impassible or goal state
            tmp, policy[(r,c)] = getPolicyFromQ(Qs[r][c], Ts[r][c])
    
    return policy


## Small World

In [9]:
test_world = read_world( "small.txt")
test_world

[['.', 'x', '.', '.', '.', 'x'],
 ['.', '^', '.', '*', '.', '^'],
 ['.', '^', '.', '^', '.', '*'],
 ['.', '^', '.', '^', '.', '*'],
 ['.', '.', '.', 'x', '.', '.']]

In [10]:
reward = 100
goal = (4,5)
disc = 0.85
alpha = 0.25

test_policy = q_learning(test_world, costs, goal, reward, cardinal_moves, disc, alpha)

In [11]:
pretty_print_policy(test_world, goal, test_policy)

	vx>>vx
	v>v>vv
	v>>v>v
	v>>>vv
	>>^x>G


## Full World

In [12]:
full_world = read_world( "world.txt")

In [13]:
reward = 2000
goal = (26,26)
disc = 0.7
alpha = 0.25

full_policy = q_learning(full_world, costs, goal, reward, cardinal_moves, disc, alpha)

In [14]:
pretty_print_policy(full_world, goal, full_policy)

	vvv<vvv<<<<>>>>>>>>>><<<<><
	^<^<<<<<<vvv>>>>^^xxxxxxx>^
	^^^<xx^<<v<<<>>^^xxxv<vxx^^
	^>^^<xxx>vv<<^^>>>vvv<<xx>v
	v><^^xx>>vv<<>>^^vvvvvxxx>v
	v<^^xx>v>vv<<<>^>>^><<<x>>v
	^<<xx>>>v<v<xxx^^>>>^<<>>>v
	v<v<>>>^^<^<<<xxxv^v><<v>vv
	v<<<<>>>^<><<^xxv<>v<^vv>>v
	^^<^^>>v^^^xxxxv<v>^<^>>vvv
	^^^<>^^^<^xxx>>>^<vvxxx>vvv
	^^<<vv>^<xx>v<^^<>><<xxv^>v
	^<vv>><<<xxv^v^^>>>v<x>v>vv
	v<vv>>v<vv>><<<<^v>^<<><vvv
	v<<<x>^<<<>^^^^<vv>v<<x>vvv
	v><xxx^<<<^vxxx^>v<^^xx>>>v
	vvxx>>^><^>vvvxxx><xxxvv>vv
	><<xx>^^^>v>>>vvxxxx>>>>>>v
	^^^xxx^<^v>>v>v<<<>>>>>>>vv
	v<^<xxx>>v>>>>>vv>vv>^^vvvv
	^^<v<vxx>>>>>^x>>>vv<>>vv>v
	^^vvvvvxxx^^xx>>^^<^vv>vv>v
	^<<<<<<vvxxxxv<^^^>>vv>vv>v
	^^^^^<vvv>v>><^^^xx>>>>vvvv
	^x^^^vvv><<xxx^<xxvxx>>v>vv
	^xxx>v>><v^vxxxx>vvvxxx>>vv
	>>><<<><<<<<<<>>>>>>>>>>>>G


-----

## Value Iteration (if submitting)

Provide implementation and output of policy.

### Value Iteration Helper Functions ##

** getQs(r, c, Rs, V_last, T_unplanned, gamma) **  
This function calculates the Q-value for the value iteration algorithm for a specific state. 

The function uses a doubly-nested loop of each transitions from the state, and additively calculates the Q by using gamma, the probability of planned move, and the corresponding V-value from the last iteration. 

** maxDiffInV(v1, v2) **  
Given two set of V values, this function calculates the largest difference for all states. This is used to derive the convergence creteria.

In [15]:
def getQs(r, c, Rs, V_last, T_unplanned, gamma):
    Q = [0 for tmp in Rs]
    T_planned = 1 - (3-len(Rs))*T_unplanned # probability of planned move

    for m,(tmp,reward) in enumerate(Rs):
        Q[m] += reward
        for n,(xy,tmp2) in enumerate(Rs):
            pr = T_planned if m==n else T_unplanned
            x,y = (r+xy[1], c+xy[0])
            Q[m] += gamma * (pr * V_last[x][y])
    return Q

def maxDiffInV(v1, v2):
    tmp = zip([x for v in v1 for x in v], [x for v in v2 for x in v])
    return max([abs(x-y) for x,y in tmp])

### Value Iteration Program ###

The function starts off by pre-calculating all valid transitions for all states. It uses a hard-coded convergence criteria of 1E-10 of largest utility differences and the probability of taking an unplanned move of 10%. 

Based on the Value Iteration algorithm, the program loops through all states, and for all valid transition of each states, it updates the Q-value based on the reward associated for each transition. 

At the end of one iteration, it calculates the maximum utility differences of all states, and stops the algorithm when the convergence criteria is reached. The convergence criteria is that the maximum difference be lower than 1E-10, or 1000 iterations, whichever is reached first.

In [16]:
def value_iteration(world, costs, goal, reward, actions, gamma):
    Rs = transCosts(world, actions, costs, goal, reward)
    eps = 1E-10
    T = 0.1
    
    V = [[0 for c in r] for r in world] # pre-allocate with zeroes
    #V[goal[0]][goal[1]] = reward # define reward
    Q = [[0 for c in r] for r in world] # pre-allocate with zeroes
    pols = dict() # dict for storing policy moves
    
    convg, t = (False, 0) # loop stoppage criteria
    while (not convg) and (t < 1000): # max 1000 iterations if not converge
        V_last = copy.deepcopy(V) # copy V_last
        for r,row in enumerate(world):
            for c,terr in enumerate(row):
                if (world[r][c]=='x') or ((r,c)==goal):
                    continue # skip impassible or goal states
                Q[r][c] = getQs(r, c, Rs[r][c], V_last, T, gamma)
                V[r][c], pols[(r,c)] = getPolicyFromQ(Q[r][c], Rs[r][c])
                
        maxDiff = maxDiffInV(V, V_last)
        convg,t = (maxDiff < eps),t+1
        # print('Iter #' + repr(t) + ', max utility diff = ' + repr(maxDiff) )
    print('Total iterations: '+repr(t)+', max util. diff = '+ repr(maxDiff) )
    return pols

In [17]:
reward = 100
goal = (4,5)
disc = 0.7

a = value_iteration( test_world, costs, goal, reward, cardinal_moves, disc)
pretty_print_policy( test_world, goal, a)

Total iterations: 75, max util. diff = 9.526246458335663e-11
	vxvvvx
	v>v>v<
	>>>>vv
	>>^>>v
	^^^x>G


In [18]:
reward = 1000
goal = (26,26)
disc = 0.7

b = value_iteration( full_world, costs, goal, reward, cardinal_moves, disc)
pretty_print_policy( full_world, goal, b)

Total iterations: 175, max util. diff = 7.567280135845067e-11
	vvvvvvvvvvvvvvvvv<<<<<>>>vv
	>vv<>>>>>vvvvvvv<<xxxxxxxvv
	>vv<xx^>>>>>>>vvvxxxvv<xxvv
	>vv<<xxx>>>>>>>vvvvvv<<xxvv
	>v<<<xxvv>>>>>>>>vvvvvxxxvv
	>v<<xxvvvv^^^^>>>>>vvvvxvv<
	>vvxxv>vvv<^xxx^>>>>vvvvvv<
	>vvvvvvvv<<<<<xxx>>>>>>vvv<
	>>v>>>vv<<<<<<xx>>>>>>>>vv<
	>>>>>>vv<<<xxxx>>>>^^^>>vv<
	>>>>>vvv<<xxxv>>>^^^xxx>vv<
	>>>>>vvv<xxvv>>>^^^<<xx>vv<
	>>>>>>vvvxxvvv^^>>^vvx>>vv<
	>>^^>>>vvvvv<<<>>>>>>>>>vv<
	>>^^x>>>vvv<<^^^>>>^^^x>vv<
	>^^xxx>>vvvvxxx^>^^^^xxvvv<
	^^xx>>>>>vvvvvxxx^^xxxvvv<<
	>^<xx>>>>>>vvvvvxxxxvvvvv<<
	>^<xxx>>>>>>>>>vvvvvvvvvvv<
	>^<<xxx>>>>^^^>>>v>>>>vvvv<
	>^<<<vxx^^^^^^x>>>>v>>vvvv<
	>^<<<vvxxx^^xx>>>>>>>>>vvv<
	>^^<<<vvvxxxx>>>^^>>>>>>vv<
	^^^^^>vv>>>>>>^^^xx^>>>>vv<
	^x^^>>>>>>^xxx^^xxvxx^>>>vv
	^xxx>>^^^^^<xxxx>>vvxxx>>vv
	^>>>>^^^^^^<<<>>>>>>>>>>>>G
