## Introduzione 
Il seguente notebook è frutto della visione del video [An Introduction to Q-Learning](https://youtu.be/wN3rxIKmMgE?list=PLIfPjWrv526bMF8_vx9BqWjec-F-g-lQO) di TheComputerScientist.

Dopo aver introdotto l'algoritmo di [Q-Learning](https://en.wikipedia.org/wiki/Q-learning) si passa con l'instanziare e provare tale metodo nell'environment **FrozenLake-v1** nella sua versione **non slippery** (*per ora si evita di trattare il caso di un environment stocatisco):

***

## Import e setup dell'ambiente di lavoro

In [8]:
import gym
import random
import time
import numpy as np
from IPython.display import clear_output

Di seguito si aggiunge all'insieme degli environment **FrozenLakeNoSlip-v1**:

In [9]:
from gym.envs.registration import register

try:
    register(
        id="FrozenLakeNoSlip-v1",
        entry_point="gym.envs.toy_text.frozen_lake:FrozenLakeEnv",
        kwargs={"map_name": "4x4", "is_slippery": False },
        max_episode_steps=100,
        reward_threshold=0.70,  # optimum = 0.74
    )
except:
    pass

  logger.warn(f"Overriding environment {spec.id}")


***

## Creazione dell'environment  e di un agent stupido

In [10]:
env_name = "FrozenLake-v1"
env = gym.make(env_name)

In [11]:
print("Observation space of", env_name, "environment\n", env.observation_space)
print("Action space of", env_name, "environment\n", env.action_space)

Observation space of FrozenLake-v1 environment
 Discrete(16)
Action space of FrozenLake-v1 environment
 Discrete(4)


L'```Agent``` definito di seguito è la versione ottenuta nel notebook [OpenAIGym-02](https://github.com/dgdeleonardis/gym-notebooks/blob/main/notebooks/OpenAIGym-02.ipynb), creato per lavorare sia con ambienti discreti che continui: 

In [12]:
class Agent:
    def __init__(self, env):
        self.is_discrete = type(env.action_space) == gym.spaces.discrete.Discrete
        
        if self.is_discrete:
            self.action_size = env.action_space.n
        else:
            self.action_low = env.action_space.low
            self.action_high = env.action_space.high
            self.action_shape = env.action_space.shape
        
    def get_action(self, state):
        if self.is_discrete:
            action = random.choice(range(self.action_size))
        else:
            action = np.random.uniform(self.action_low,
                                       self.action_high,
                                      self.action_shape)
        return action

Successivamente viene effettuato un semplice test:

In [13]:
agent = Agent(env)
state = env.reset()

done = False
while not done:
    action = agent.get_action(state)
    state, reward, done, info = env.step(action)
    print("state:", state, "action:", action)
    env.render()
    time.sleep(0.5)
        
env.close

state: 0 action: 3
state: 1 action: 1
state: 1 action: 0
state: 5 action: 1


<bound method Wrapper.close of <TimeLimit<OrderEnforcing<PassiveEnvChecker<FrozenLakeEnv<FrozenLake-v1>>>>>>

In questo caso a differenza degli esempi visti in precedenza, non facciamo eseguire al nostro ```agent``` un numero di ```action``` prestabilito, ma fintantochè non ha raggiunto il suo goal.
***

## Implementazione di un agente Q-Learning

Per implementare l'algoritmo di **Q-Learning** andiamo a creare un nuovo agente ```QAgent``` estendendo ```Agent```:

In [14]:
class QAgent(Agent):
    def __init__(self, env, discount_rate=0.6, learning_rate=0.01):
        super().__init__(env)
        self.state_size = env.observation_space.n
        print("State size:", self.state_size)
        
        self.epsilon = 1.0
        self.discount_rate = discount_rate
        self.learning_rate = learning_rate
        
        self.build_model()
        
    def build_model(self):
        self.q_table = 1e-5 * np.random.random([self.state_size, self.action_size])
        print(self.q_table)
    
    # dato uno stato andiamo a prendere, nella colonna dello stato, l'action con il q_value maggiore
    def get_action(self, state):
        q_state = self.q_table[state]
        action_greedy = np.argmax(q_state)
        action_random = super().get_action(state)
        return action_random if random.random() < self.epsilon else action_greedy
    
    def train(self, experience):
        state, action, next_state, reward, done = experience
        
        q_next = self.q_table[next_state]
        q_next = np.zeros([self.action_size]) if done else q_next
        q_target = reward + self.discount_rate * np.max(q_next)
        q_update = q_target - self.q_table[state, action]
        self.q_table[state, action] += self.learning_rate * q_update
        print(self.q_table)
        if done:
            self.epsilon *= 0.99
        

Nel metodo ```train(...)``` è stata implementata tale formula: 

![q-learning formula](https://miro.medium.com/max/1400/1*EQ-tDj-iMdsHlGKUR81Xgw.png)

E ora si passa al test del nuovo agente:

In [None]:
agent = QAgent(env)

total_reward = 0
for ep in range(1000):
    state = env.reset()

    done = False
    while not done:
        action = agent.get_action(state)
        next_state, reward, done, info = env.step(action)
        agent.train((state, action, next_state, reward, done))
        state = next_state
        total_reward += reward
        print("state:", state, "action:", action)
        print("Episode: {}, Total reward: {}, epsilon {}".format(ep, total_reward, agent.epsilon))
        env.render()
        #time.sleep(0.0001)
        clear_output(wait = True)

env.close

[[8.28379570e-06 4.50272582e-06 7.44787844e-06 6.82813507e-06]
 [2.17667919e-06 8.13121352e-06 2.09619310e-06 4.48213576e-06]
 [5.51084558e-06 9.19768562e-06 1.02553119e-06 3.04250602e-06]
 [8.03665189e-06 8.92141417e-07 1.07446302e-06 1.44028841e-06]
 [9.40677928e-06 4.67422556e-06 5.26118803e-06 5.88385723e-06]
 [7.77638916e-06 5.11463809e-06 6.45733761e-06 6.07331566e-07]
 [3.84390417e-06 7.16529172e-06 9.19509046e-06 1.35398728e-06]
 [3.87080046e-06 8.61109049e-06 1.36064747e-06 3.26437854e-06]
 [2.18511418e-06 1.87583577e-06 3.64380486e-06 5.83792384e-06]
 [6.30063357e-06 9.53502547e-06 9.07337616e-06 1.29948537e-06]
 [6.37606104e-06 6.52284540e-05 7.38389623e-06 4.72325143e-06]
 [2.32556098e-06 7.14568561e-06 2.57643767e-06 7.59460327e-07]
 [6.76533977e-06 6.26634667e-07 3.23307920e-06 3.54985409e-06]
 [6.30771098e-06 2.37262449e-06 3.14847624e-06 8.65089781e-06]
 [1.79473552e-06 9.37102920e-06 9.36764030e-06 1.00060167e-02]
 [3.25885762e-06 5.27699687e-06 5.00109509e-06 2.552723