"""Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.

https://en.wikipedia.org/wiki/Q-learning"""

In [165]:
pip install torchvision

Collecting torchvision
  Downloading torchvision-0.9.1-cp38-cp38-macosx_10_9_x86_64.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 8.7 MB/s 
Collecting torch==1.8.1
  Downloading torch-1.8.1-cp38-none-macosx_10_9_x86_64.whl (119.6 MB)
[K     |████████████████████████████████| 119.6 MB 49.4 MB/s 
Installing collected packages: torch, torchvision
Successfully installed torch-1.8.1 torchvision-0.9.1
Note: you may need to restart the kernel to use updated packages.


In [175]:
#imports used for task2 
import torch
is_cuda_available = torch.cuda.is_available()
device = torch.device("cuda" if is_cuda_available else "cpu")
print(device)
from typing import Tuple
from q_maze import QMaze, Action
import numpy as np
import pandas as pd 
from e_greedy_pol import E_greedy_policy
import random as random

cpu


### Computing action value functions using E_greedy_policy

For this method we will use the E_greedy_policy to help us compute and estimate the **q-value** of each state.

The **q-value** is the **mean** expected future reward following an action from a given state. Rather than storing all of our experience and taking the mean over them, we can use each experience to update an exponentially weighted average forget that exprience.







In [167]:
#Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly
class E_greedy_policy:
    def __init__(self, epsilon, decay):

        self.epsilon = epsilon #initial value of epsilon
        self.epsilon_start = epsilon
        self.decay = decay #parameter used to control how much the agent should explore and exploit when using epislon-greedy policy.

    # This function is used to select the action with max values 
    #For us to be able to select max values we need to know the state and the q_values
    def __call__(self, state, q_values): 

        is_greedy = random.random() > self.epsilon

        if is_greedy:
            # we select a greedy action by getting the max q_values from the grid
            action_index = np.argmax(q_values[state])
        else:
            # else we get a random choice from action
            action_index = random.choice(list(Action)).value.index
        #while selected_action = None
        selected_action = None
        #we pick an action from our possible action moves
        for a in list(Action):
            if a.value.index == action_index:
                selected_action = a
        return selected_action

    #TODO understand what this is doing
    def update_epsilon(self):
        self.epsilon = self.epsilon * self.decay

    #TODO do we need this?
    def reset(self):
        self.epsilon = self.epsilon_start

In [169]:
class Qlearning:
    """Instant diff parameters for calc Q"""
    def __init__(self, policy, env, gamma, alpha):
        self.policy = policy
        self.gamma = gamma
        self.alpha = alpha

        self.env = env.size
        self.coord_to_index_state = env.coord_to_index_state

        self.q_values = np.zeros( (self.env * self.env,(len(list(Action) ))))


    #We are updating the values from our q.values table after each step
    def update_values(self, s_current, action_next, r_next, s_next, action_next_next):

        self.q_values[s_current, action_next] = self.q_values[s_current, action_next] + self.alpha * (
        r_next+ self.gamma * self.q_values[s_next, action_next_next]- self.q_values)



    #we are assigning the maximum qvalues to the grid in the maze so we can calculate the optimum route the agent must take in order to maximise reward
    def new_values(self):

        value_grid = np.zeros((self.env, self.env))

        for i in range(self.env):
            for j in range(self.env):

                s = self.coord_to_index_state[i, j]

                value_grid[i, j] = max(self.q_values[s])

        return value_grid




In [170]:
maze = QMaze(20)
maze.reset()
maze.display()

X X X X X X X X X X X X X X X X X X X X 
X X . X . X X . A . . . X . X X . X . X 
X . . . . . . . X X X X X . X X . . . X 
X . X X X X X . X X . . . . . X . X X X 
X X X X X X X . X . . X X X X X . . . X 
X X . . . . . . X . X X . X . X . X X X 
X . . X X X X . X . . . . . . X . X . X 
X . X X . . . . . . X X . X . . . . . X 
X X X X X X X . X X X . . X . X . X . X 
X X . X X . . . . . X X X X . X . X . X 
X . . . X X X . X X X . X X X X X X . X 
X X X . X . . . . X X . . . X . X X . X 
X . . . X X X X . X . . X . . . X X . X 
X . X . . X X X . . . X X X X X X . . X 
X . X X . . . . . X . . X X . . X X . X 
X . . X X . X . X X X . . . . X X . . X 
X . X X X . X . . X X . X X X X X X X X 
X X X . X X X X . . X . . . X . X . . X 
X . . . . . . . . X X . X . . . . . X X 
X X X X X X X X X X X X X X X X X O X X 



## An epsilon-greedy policy
We can combine our random policy and our greedy policy to make an improved policy that both explores its environment and exploits its current knowledge. An $\epsilon$-greedy (epsilon-greedy) policy is one which exploits what it knows most of the time, but with probability $\epsilon$ will instead select a random action to try.

## Do we need to keep exploring once we are confident in the values of states?

As our agent explores more, it becomes more confident in predicting how valuable any state is. Once it knows a lot, it should start to explore less and exploit what it knows more. That means that we should decrease epsilon over time.

Let's implement it

In [171]:

epolicy = E_greedy_policy(1, 0.999)
epolicy.reset()
qlearning = Qlearning(epolicy,maze, 0.9, 0.1)

s = maze.reset()
done = False

In [172]:
#we are trying to use epsilon after each step 
turns_elapsed = []
while not done:
    action = epolicy(s,qlearning.q_values)
    s_next,r, done = maze.step(action)
    next_action = epolicy(s_next, qlearning.q_values)

    qlearning.update_values(s, action.value.index,r,s_next,next_action.value.index)
    eps_policy.update_epsilon()

    s = s_next
   # self, s_current, action_next, r_next, s_next, action_next_next


   values = qlearning.new_values()


    

ValueError: setting an array element with a sequence.