## Name: Andrew Caide
### CSCI S-89C Deep Reinforcement Learning      
### Part II of Assignment 6      

## Problem 1 (10 points)

Consider Environment that has five states: 1, 2, 3, 4, and 5. Possible transitions are: (1) 1->1, 1->2; (2) 2->1, 2->2, 2->3; (3) 3->2, 3->3, 3->4; (4) 4->3, 4->4, 4->5; (5) 5->4, 5->5.

Actions of the Agent are decoded by -1, 0, and +1, which correspond to its intention to move left, stay, and move right, respectively. The Environment, however, does not always respond to these intentions exactly, and there is 10% chance that action 0 will result in moving to the left (if moving to the left is admissible), and +1 action will result in staying - in other words, there is an "east wind." More specifically, the non-zero transition probabilities $p(s^\prime,r|s,a)$ are<br>

$p(s^\prime=1,r=0|s=1,a=0)=1$,<br>
$p(s^\prime=1,r=0|s=1,a=+1)=0.1,p(s^\prime=2,r=0|s=1,a=+1)=0.9$,<br>

$p(s^\prime=1,r=0|s=2,a=-1)=1$,<br>
$p(s^\prime=1,r=0|s=2,a=0)=0.1,p(s^\prime=2,r=0|s=2,a=0)=0.9$,<br>
$p(s^\prime=2,r=0|s=2,a=+1)=0.1,p(s^\prime=3,r=1|s=2,a=+1)=0.9$,<br>

$p(s^\prime=2,r=0|s=3,a=-1)=1$,<br>
$p(s^\prime=2,r=0|s=3,a=0)=0.1,p(s^\prime=3,r=1|s=3,a=0)=0.9$,<br>
$p(s^\prime=3,r=1|s=3,a=+1)=0.1,p(s^\prime=4,r=0|s=3,a=+1)=0.9$,<br>

etc.

Further, we assume that whenever the process enters state 3, the Environment generates reward = 1. In all other cases the reward is 0. For example, transition 2->3 will result in reward 1, transition 3->3 will result in reward 1, transition 3->2 will result in reward 0, transition 2->2 will result in reward 0, etc.



Further, assume that the agent does not know about the wind or what rewards to expect. It chooses to stay in all states, i.e. the policy is
$\pi(-1|1)=0, \pi(0|1)=1, \pi(+1|1)=0$,<br>
$\pi(-1|2)=0, \pi(0|2)=1, \pi(+1|2)=0$,<br>
$\pi(-1|3)=0, \pi(0|3)=1, \pi(+1|3)=0$,<br>
$\pi(-1|4)=0, \pi(0|4)=1, \pi(+1|4)=0$,<br>
etc.

# GOAL:
Please estimate the state-value function using one-step Temporal Difference (TD) prediction. Let’s use $\gamma=0.9$ and run the episodes for $T=100$.

---

### States:
$S \in\{S_{1},S_{2},S_{3},S_{4},S_{5} \} $;    



#### Actions:
$A \in\{-1, 0, 1\} if S  \in\{S_{2},S_{3},S_{4}\}$;     
$A \in\{0, 1\} if S == S_{1}$;  
$A \in\{-1, 0\} if S == S_{5}$;  

In [1]:
import random
import numpy as np
import pandas as pd

class Environment:
    def __init__(self, S0 = 1):
        self.time = 0
        self.state = S0

    def admissible_actions(self):
        A = list((-1,0,1))
        if self.state == 1: A.remove(-1)
        if self.state == 5: A.remove(1) 
        return A
    
    def check_state(self):
        return self.state
    
    def set_state(self, new_state):
        self.state = new_state
        
    def get_stateValue(state):
        statValues = {1:0, 2:0, 3:1, 4:0, 5:1}
        return stateValues[state]

    def get_reward(self, action):
        self.time += 1
        move = action
        # If (we're not in s1 and moving to the right) or (if we're on the right and move == 1)
        if (self.state > 1 and move > -1) or (self.state == 1 and move > 0):
            move = np.random.choice([move-1, move],p=[0.1,0.9])
        self.state += move
        
        if self.state == 3:
            reward = 1
        else:
            reward = 0
        return reward
        
#########
    
class Agent:
    def __init__(self):
        self.current_reward = 0.0

    def step(self, env, take_action = False, policy = 1):
        
        # If taking action: policy 1 = random movement, policy 2 = converge to S3
        if take_action:
            action_selected = random.choice(env.admissible_actions())
        else:
            action_selected = 0 # Stay put
        
        reward = env.get_reward(action_selected)            
        self.current_reward = reward

In [2]:
def one_step_td(n=1, State0=1, gamma=0.9, alpha = 0.1, act = False, episodes = 100):
    
    if not State0:
        S0 = random.randint(2,5)
    
    # Initialize V(s) arbitrarily for all S
    V = pd.DataFrame(0, index=range(1,6), columns={"V"})
    

    # Loop for each episode
    for i in range(episodes):
        
        # Set up Env and conditions
        env = Environment(State0)
        agent = Agent()
        T = float('inf')
        t=0
        states = [State0]
        rewards = [0]
        
        while True:
            t +=1
            if t<T:
                # Take an action according to policy (dont't act!)
                agent.step(env, act)
                # Observe and store nexrt reward and states
                states.append(env.check_state())
                rewards.append(agent.current_reward)
              
                #1 is a terminal state; we can't move once we hit S1
                if env.check_state() == 1: 
                    T = t + 1
                
            tao = t-n+1
            if tao>= 0:
                returns = 0
                # Compute G
                for t in range(tao + 1, min(T, tao+n)):
                    returns += gamma**(t-tao-i) * rewards[t]
                if tao + n-1 < T:
                    # G <- G + gamma**V(S_{tao+n})
                    # Having issues indexing, I think something weird is going on here.
                    returns += gamma**n + V.loc[states[tao+n-1]].values[0]
                
                # Find the state to update from list, update it
                state_to_update = states[tao-1]
                if state_to_update != 1:
                    V.loc[state_to_update] += alpha * (returns-V.loc[state_to_update])
            if tao == T-1:
                break
    return V

---

Let's take a look at the values if we set the starting point to be stage 5:

In [3]:
one_step_td(State0 = 5,n = 1, alpha = 0.5, episodes = 100) 

Unnamed: 0,V
1,0.0
2,2.581481
3,8.357125
4,13.39557
5,15.941986


Weird, I'm expecting 0's at 5 if we're doing n==1...

Below is a bootstrap result from 1 to 5

In [4]:
V = [one_step_td(State0 = n,n = 1, alpha = 0.5, episodes = 100) for n in range(1,6)]
pd.concat(V, axis=1).mean()

V    0.000000
V    0.599145
V    3.148424
V    6.154287
V    7.926816
dtype: float64

## Problem 2 (10 points)

In this problem, the agent will obtain the optimal policy via Double Q-learning. Please run the Double Q-learning algorithm - make sure to generate each pair $(S_t,A_t)$ using $\varepsilon$-soft policy with respect to curret action-value function $(Q_1+Q_2)/2$. Use same $\gamma=0.9$ and $T=100$.

Does the final policy appear to be optimal?

In [11]:
class Agent:
    def __init__(self):
        self.current_reward = 0.0
        self.scan = 1
        self.scan_done = False

    def step(self, env, take_action = False, policy = 1):
        
        # If taking action: policy 1 = random movement, policy 2 = converge to S3
        if take_action:
            action_selected = take_action
        else:
            action_selected = 0 # Stay put
        
        reward = env.get_reward(action_selected)            
        self.current_reward = reward

In [12]:
def get_policy_from_q(q, eps):
    # Get the optimal policy given a specific state
    # Identify max value, set it as the e-greedy value
    if any(q ==0):
        policy = pd.DataFrame(eps/2, index=range(1,2), columns={-1,0,1})
        policy.loc[:,q.idxmax(1)] = 1-eps + eps/2
        policy.loc[:,q.idxmin(1)] = 0
    else:
        policy = pd.DataFrame(eps/3, index=range(1,2), columns={-1,0,1})
        policy.loc[:,q.idxmax(1)] = 1-eps + eps/3
    
    return(policy.values.tolist()[0])

In [13]:
def double_q(alpha = 0.1, gamma = 0.9, S0 = 5, episodes = 100, eps = .3):
    # initiate Q1, Q2 for all S and A, such that terminal = 0
    Q1 = pd.DataFrame(1, index=range(1,6), columns={-1,0,1})
    Q2 = pd.DataFrame(1, index=range(1,6), columns={-1,0,1})
    Q1.loc[1,:] = 0
    Q2.loc[1,:] = 0
    Q1.loc[5,1] = 0
    Q2.loc[5,1] = 0
    
    for episode in range(episodes):
        if not S0:
            S0 = random.randint(2,5)
        env = Environment(S0)
        agent = Agent()
        S = env.check_state()
        
        while True:
            action_vals = (Q1+Q2)/2
            Q = action_vals.loc[S]
            
            # Make greedy policy
            policy = get_policy_from_q(Q, eps)
            A = np.random.choice([0, 1,-1], 1, p=policy)[0] #This might be the problem
            
            # Take action A, observe R, S'
            agent.step(env, A)
            reward = agent.current_reward # R
            S1 = env.check_state()        # S'
            
            # With a 0.5 probability:
            if random.choice([True,False]):
                # Argmax Q1(S',*)
                best_q1_action = Q1.loc[S1].idxmax(1)
                
                Q1.loc[S,A] = Q1.loc[S,A] +\
                    alpha*(reward +\
                           gamma*Q2.loc[S1,best_q1_action] - Q1.loc[S,A])
                
            else:
                # Argmax Q2(S',*)
                best_q2_action = Q2.loc[S1].idxmax(1)
                
                Q2.loc[S,A] = Q2.loc[S,A] +\
                    alpha*(reward +\
                           gamma*Q1.loc[S1,best_q2_action] - Q2.loc[S,A])
            
            S = S1
            # If S is terminal:
            if S == 1:
                break
    Q_final = (Q1+ Q2)/2
    return Q_final.divide(Q_final.max(axis=1), axis=0)

In [16]:
qs = double_q(S0 = False)
qs

Unnamed: 0,0,1,-1
1,,,
2,0.782179,1.0,0.001219
3,1.0,0.934627,0.908086
4,0.861481,0.720623,1.0
5,0.701426,0.0,1.0


---

Yes - the action-values derived from the algorithm hone-in on $S_{3}$. Highly unoptimal state-action values have been identified as Q(2,-1), Q(4, {0,1}), etc. In most cases, moving to $S_{3}$ is prefered by an order of 2-3x the other action values in their respective states.
However, what's interesting is that $Q(3, {1,-1})$ aren't as low as I'd expect them to be.