# Lecture worksheet 23 solutions

## Question 1

### Q1.1 True/False

(a) In reinforcement learning, the rewards must be the same for all states.


**SOLUTION**: False: we set the rewards higher for states that we want to get to, and lower for states we want to avoid.




(b) Q-learning is used when the transition probabilities are unknown.



**SOLUTION**: True: Q-learning is used when we don't know the transition probabilities and instead only observe trajectories.


### Q1.2 Fill-in-the-blank and short-answer

(a) (*Choose one of s, a, or s' to fill in both blanks*) T(s, a, s') defines a probability distribution over ______ (in other words, if     we fix the other two inputs and sum across all values of ______, T should sum   to 1)

**SOLUTION**: $s'$: $T$ is a conditional distribution over new states conditioned on previous state and action.



(b) What is the benefit of using the dynamic programming algorithm for value iteration?

**SOLUTION**: The dynamic programming algorithm is faster (more efficient).

## Question 2: Q-iteration

The following cell contains the code used in lecture for value iteration. Modify it to also save the best value of the Q-function $Q(s, a)$ in addition to the value function $V^*(s)$.

**SOLUTION**: Along with `V_arr`, we need to keep track of an array with the Q-values. We need an additional dimension, to keep track of the action as well as the state:

In [None]:
# Defined as part of the setup
states = [...]  # all states
actions = [...]  # all actions
γ = 0.9  # Discount factor
T = 1000000  # max time
def T(s, a, s_new):
    """
    Probability of ending in state s_new when starting in state s and
    taking action a
    """
    pass
    
def R(s, a, s_new):
    """
    Reward for going from state s to state s_new by taking action a
    """
    pass
    
## FAST VERSION
V_arr = np.zeros([len(states), T])
### SOLUTION: ADDED LINE
Q_arr = np.zeros([len(states), len(actions), T])

for t in range(T-1, -1, -1):
    for s in states:
        best_so_far = -np.inf
        for a in actions:
            q = 0  # Q(s, a)
            for s_new in states:
                q += (
                    T(s, a, s_new) * 
                    (R(s, a, s_new) + γ * V_arr[s_new, t+1])
                )
            ### SOLUTION: ADDED LINE
            Q_arr[s, a, t] = q
            best_so_far = max(best_so_far, q)
        V_arr[s, t] = best_so_far

## Q3: GridWorld


Consider the following GridWorld environment:

![](grid_world.png)

where `start` represents the initial state, $\times$ represents an inaccessible state (like the example in lecture), and the $-1$ and $100$ states are terminal states with corresponding rewards. All other states have a reward of $0$.

### Q3.1

Assume state transitions are deterministic. In other words, if our action is to move in a particular direction, we always move in that direction; unless there is a wall, in which case we stay in that same state.

If $\gamma = 0.9$, compute the optimal value $V^*$ for each state.



**SOLUTION**: 

1. The two terminal states have undefined $V^*$ values, since there are no actions that can be taken from those states. 
2. The computation is easiest to do if we start with the state right next to the goal (reward=1) state. We know that the best possible expected sum of rewards comes from going right to the goal state. In this case, the reward for the transition is 1, and the future reward is 0 (since the process ends at a terminal state). So, the optimal value of that state is 1.
3. Next, we can do the two states next to it: for each of those, the optimal action is to move toward the goal. The reward for that transition is 0, and the value of that new state (the one next to the goal) is just as we computed in step 2 above, 1. So, we multiply that by our discount factor 0.9 to obtain a value of 0.9.
4. For the states next to the ones we just did, we can follow a similar process to obtain values of $0.9^2$, and then $0.9^3$ for the states next to those, and then finally $0.9^4$ for the state in the bottom left.


### Q3.2

Compute the optimal Q-function at the `start` state for each of the four actions. Based on your answer, what would the optimal policy be for this state (in other words, what is $\pi($ `start` $)$?



**SOLUTION**:

Recall the definition of $Q$:
$$
Q(s, a) = \sum_{s'} T(s, a, s') \left[R(s, a, s') + \gamma V^*(s')\right]
$$

In this case:
* the transition probabilities are all 1 or 0, which means we only have one term when computing the sum above. 
* For all actions other than `right`, the reward is 0, and for `right`, the reward is -100.
* The value of the next state $s'$ is as we computed above in Q3.1.

Using this information, we can compute the Q-values as follows:
* Q(`start`, `right`) = $-100 + 0 = -100$
* Q(`start`, `up`) = $0 + \gamma \times 1 = 0.9$
* Q(`start`, `left`) = $0 + \gamma \times 0.9 = 0.9^2$
* Q(`start`, `down`) = $0 + \gamma \times 0.9^2 = 0.9^3$

The optimal action (i.e., the one corresponding to the largest value out of these four) is `up`.