# <center> WSI Ćwiczenie nr.6 - Algorytm Q-learning</center>

### <center>Adam Wróblewski</center>


### Cel ćwiczenia, eksperymentów:
Celem ćwiczenia jest implementacja algorytmu Q-learning, następnie stworzenie agenta rozwiązującego problem *TAXI* z biblioteki *gym* i zbadanie wpływu poszczególnych hiperparametrów na działanie tego algorytmu.


In [2]:
#import neccessary packages
import gym
import numpy as np
import random

In [3]:
#initialize TAXI enviroment 
env = gym.make('Taxi-v3', render_mode='ansi')
state = env.reset()
# print(env.render())
# taxi_row, taxi_col, passenger_index, destination_index = env.decode(state[0])
# print(taxi_row, taxi_col, passenger_index, destination_index)

In [1]:
#Agent class with methods responsible for learning (creating Q table) and executing simulations based on created Q table
class Agent:
    def __init__(self, env, learning_rate, discount, epsilon):
        self.env = env
        self.learning_rate = learning_rate
        self.discount = discount
        self.epsilon = epsilon
    
    def get_parameters(self):
        dict = {
            "env":self.env,
            "learning_rate":self.learning_rate,
            "discount": self.discount,
            "epsilon": self.epsilon
        }
        return dict

    def q_learnning(self, epochs, period_of_evaluation = 5000, num_of_evaluations = 100):
        env = self.env
        learning_rate = self.learning_rate
        discount = self.discount
        epsilon = self.epsilon
        
        q_table = np.zeros((env.observation_space.n, env.action_space.n))
        for i in range(epochs): 
            state = env.reset()
            state = state[0]
            terminated = False
            while terminated == False:
                if np.random.random() < epsilon: #exploration
                    action = env.action_space.sample()
                else:
                    action = np.argmax(q_table[state])

                action_result = env.step(action)
                next_state = action_result[0]
                reward = action_result[1]
                terminated = action_result[2]

                q_prev = q_table[state, action]
                new_q = q_prev + learning_rate*(reward + discount*max(q_table[next_state]) - q_prev)
                new_q_1 = (1-learning_rate) * q_prev + learning_rate * (reward + discount*max(q_table[next_state]))
                q_table[state, action] = new_q

                state = next_state
                
            if ((i+1) % period_of_evaluation == 0):
                avg_score = self.evaluate(q_table, num_of_evaluations)
                print(f"Episode {i+1}/{epochs}, avg score of {num_of_evaluations} evaluation = {avg_score}")
                
        print("learning ended")
        return q_table
    
    
    def evaluate(self, q_table, num_of_evaluations):
        scores_table = []
        for i in range(num_of_evaluations):
            score = self.run_simulation(q_table)
            scores_table.append(score)
        avg_score = sum(scores_table)/len(scores_table)
        return avg_score
        
    
    def run_simulation(self, q_table, visual = False):
        env = self.env
                
        state = env.reset()
        state = state[0]
        if visual: print(env.render())
        
        terminated = False
        rewards = 0
        for i in range(100):
            if visual: print("Step: ", i)
            action = np.argmax(q_table[state])
            action_result = env.step(action)
            next_state = action_result[0]
            reward = action_result[1]
            terminated = action_result[2]

            rewards += reward
            if visual:
                print(env.render())
                print("score: ", rewards)
            state = next_state

            if terminated == True:
                return rewards
        return rewards
    

Stworzenie bazowego agenta, na podstawie którego zbadam wpływ poszczególnych parametrów na działaanie algorytmu Q learning:

In [4]:
agent1 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.1)
my_q_table1 = agent1.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)
# agent.run_simulation(q_table=my_q_table, visual=True)

Episode 2500/100000, avg score of 100 evaluation = -28.84
Episode 5000/100000, avg score of 100 evaluation = -2.44
Episode 7500/100000, avg score of 100 evaluation = -0.41
Episode 10000/100000, avg score of 100 evaluation = 2.56
Episode 12500/100000, avg score of 100 evaluation = 8.28
Episode 15000/100000, avg score of 100 evaluation = 4.87
Episode 17500/100000, avg score of 100 evaluation = 6.69
Episode 20000/100000, avg score of 100 evaluation = 7.82
Episode 22500/100000, avg score of 100 evaluation = 7.83
Episode 25000/100000, avg score of 100 evaluation = 8.03
Episode 27500/100000, avg score of 100 evaluation = 7.96
Episode 30000/100000, avg score of 100 evaluation = 8.35
Episode 32500/100000, avg score of 100 evaluation = 7.99
Episode 35000/100000, avg score of 100 evaluation = 7.85
Episode 37500/100000, avg score of 100 evaluation = 7.57
Episode 40000/100000, avg score of 100 evaluation = 7.91
Episode 42500/100000, avg score of 100 evaluation = 8.13
Episode 45000/100000, avg scor

### Zbadanie wpływu parametru learning rate:
Parametr ten jest często oznaczany jako $\beta$  lub $\alpha$, mówi on o tym jak szybko algorytm będzie zmieniał wartości w tablicy Q, innymi słowy, jak ważne są nowo zdobyte informacje o środowisku względem tych które już posiada. </br>
Zwiększenie go, sprawi że wartości w tablicy Q będą zmieniały się szybciej(bardziej), co powinno skutkować szybszym uczeniem się agenta, i tym samym osiągnięciem satysfakcjonującego wyniku w mniejszej liczbie iteracji. 

In [5]:
agent2 = Agent(env=env, learning_rate=0.3, discount=0.7, epsilon=0.1)
my_q_table2 = agent2.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = 0.55
Episode 5000/100000, avg score of 100 evaluation = 7.99
Episode 7500/100000, avg score of 100 evaluation = 7.22
Episode 10000/100000, avg score of 100 evaluation = 8.04
Episode 12500/100000, avg score of 100 evaluation = 7.79
Episode 15000/100000, avg score of 100 evaluation = 7.67
Episode 17500/100000, avg score of 100 evaluation = 8.12
Episode 20000/100000, avg score of 100 evaluation = 7.87
Episode 22500/100000, avg score of 100 evaluation = 7.81
Episode 25000/100000, avg score of 100 evaluation = 7.86
Episode 27500/100000, avg score of 100 evaluation = 8.0
Episode 30000/100000, avg score of 100 evaluation = 7.52
Episode 32500/100000, avg score of 100 evaluation = 8.06
Episode 35000/100000, avg score of 100 evaluation = 8.04
Episode 37500/100000, avg score of 100 evaluation = 8.09
Episode 40000/100000, avg score of 100 evaluation = 8.04
Episode 42500/100000, avg score of 100 evaluation = 8.09
Episode 45000/100000, avg score of 

In [6]:
agent3 = Agent(env=env, learning_rate=0.5, discount=0.7, epsilon=0.1)
my_q_table3 = agent3.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = 5.62
Episode 5000/100000, avg score of 100 evaluation = 7.9
Episode 7500/100000, avg score of 100 evaluation = 7.74
Episode 10000/100000, avg score of 100 evaluation = 7.63
Episode 12500/100000, avg score of 100 evaluation = 7.97
Episode 15000/100000, avg score of 100 evaluation = 7.5
Episode 17500/100000, avg score of 100 evaluation = 7.75
Episode 20000/100000, avg score of 100 evaluation = 7.57
Episode 22500/100000, avg score of 100 evaluation = 7.69
Episode 25000/100000, avg score of 100 evaluation = 8.03
Episode 27500/100000, avg score of 100 evaluation = 7.98
Episode 30000/100000, avg score of 100 evaluation = 8.4
Episode 32500/100000, avg score of 100 evaluation = 7.09
Episode 35000/100000, avg score of 100 evaluation = 8.16
Episode 37500/100000, avg score of 100 evaluation = 8.16
Episode 40000/100000, avg score of 100 evaluation = 7.94
Episode 42500/100000, avg score of 100 evaluation = 8.04
Episode 45000/100000, avg score of 10

In [7]:
agent4 = Agent(env=env, learning_rate=0.7, discount=0.7, epsilon=0.1)
my_q_table4 = agent4.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = 7.7
Episode 5000/100000, avg score of 100 evaluation = 7.87
Episode 7500/100000, avg score of 100 evaluation = 7.66
Episode 10000/100000, avg score of 100 evaluation = 7.81
Episode 12500/100000, avg score of 100 evaluation = 7.86
Episode 15000/100000, avg score of 100 evaluation = 7.98
Episode 17500/100000, avg score of 100 evaluation = 8.09
Episode 20000/100000, avg score of 100 evaluation = 8.21
Episode 22500/100000, avg score of 100 evaluation = 8.17
Episode 25000/100000, avg score of 100 evaluation = 7.99
Episode 27500/100000, avg score of 100 evaluation = 8.17
Episode 30000/100000, avg score of 100 evaluation = 7.93
Episode 32500/100000, avg score of 100 evaluation = 7.8
Episode 35000/100000, avg score of 100 evaluation = 7.95
Episode 37500/100000, avg score of 100 evaluation = 7.95
Episode 40000/100000, avg score of 100 evaluation = 8.25
Episode 42500/100000, avg score of 100 evaluation = 8.11
Episode 45000/100000, avg score of 1

In [8]:
agent5 = Agent(env=env, learning_rate=0.9, discount=0.7, epsilon=0.1)
my_q_table5 = agent5.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = 7.76
Episode 5000/100000, avg score of 100 evaluation = 7.49
Episode 7500/100000, avg score of 100 evaluation = 7.59
Episode 10000/100000, avg score of 100 evaluation = 7.92
Episode 12500/100000, avg score of 100 evaluation = 7.59
Episode 15000/100000, avg score of 100 evaluation = 7.96
Episode 17500/100000, avg score of 100 evaluation = 7.77
Episode 20000/100000, avg score of 100 evaluation = 8.32
Episode 22500/100000, avg score of 100 evaluation = 8.19
Episode 25000/100000, avg score of 100 evaluation = 8.13
Episode 27500/100000, avg score of 100 evaluation = 8.31
Episode 30000/100000, avg score of 100 evaluation = 8.05
Episode 32500/100000, avg score of 100 evaluation = 7.58
Episode 35000/100000, avg score of 100 evaluation = 7.86
Episode 37500/100000, avg score of 100 evaluation = 8.53
Episode 40000/100000, avg score of 100 evaluation = 7.56
Episode 42500/100000, avg score of 100 evaluation = 7.98
Episode 45000/100000, avg score of

### Zbadanie wpływu parametru discount:
Parametr ten jest często oznaczany jako $\gamma$, mówi on o tym jak bardzo mają być brane pod uwagę nagrody które agent może otrzymać w przyszłości względem tych które otrzymuje "natychmiast" - w danym kroku.</br> Duża wartość $\gamma$ oznacza że przyszłe nagrody są bardzo ważne, natmiast w miarę zmniejszania tego parametru przyszłe nagrody jakie agent może otrzymać mają mniejszy wpływ, za to nagrody otrzymywane w danym kroku będą bardziej znaczące.

In [17]:
agent10 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.1)
my_q_table11 = agent10.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -29.3
Episode 5000/100000, avg score of 100 evaluation = -6.9
Episode 7500/100000, avg score of 100 evaluation = 1.98
Episode 10000/100000, avg score of 100 evaluation = 3.83
Episode 12500/100000, avg score of 100 evaluation = 3.94
Episode 15000/100000, avg score of 100 evaluation = 4.62
Episode 17500/100000, avg score of 100 evaluation = 7.02
Episode 20000/100000, avg score of 100 evaluation = 5.86
Episode 22500/100000, avg score of 100 evaluation = 7.57
Episode 25000/100000, avg score of 100 evaluation = 7.49
Episode 27500/100000, avg score of 100 evaluation = 7.99
Episode 30000/100000, avg score of 100 evaluation = 7.82
Episode 32500/100000, avg score of 100 evaluation = 7.8
Episode 35000/100000, avg score of 100 evaluation = 7.7
Episode 37500/100000, avg score of 100 evaluation = 7.77
Episode 40000/100000, avg score of 100 evaluation = 7.87
Episode 42500/100000, avg score of 100 evaluation = 7.39
Episode 45000/100000, avg score of 

In [9]:
agent11 = Agent(env=env, learning_rate=0.1, discount=0.9, epsilon=0.1)
my_q_table11 = agent11.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = 2.72
Episode 5000/100000, avg score of 100 evaluation = 7.92
Episode 7500/100000, avg score of 100 evaluation = 7.83
Episode 10000/100000, avg score of 100 evaluation = 7.7
Episode 12500/100000, avg score of 100 evaluation = 7.98
Episode 15000/100000, avg score of 100 evaluation = 7.59
Episode 17500/100000, avg score of 100 evaluation = 8.19
Episode 20000/100000, avg score of 100 evaluation = 7.78
Episode 22500/100000, avg score of 100 evaluation = 8.2
Episode 25000/100000, avg score of 100 evaluation = 7.57
Episode 27500/100000, avg score of 100 evaluation = 7.74
Episode 30000/100000, avg score of 100 evaluation = 7.79
Episode 32500/100000, avg score of 100 evaluation = 8.03
Episode 35000/100000, avg score of 100 evaluation = 7.91
Episode 37500/100000, avg score of 100 evaluation = 7.91
Episode 40000/100000, avg score of 100 evaluation = 7.79
Episode 42500/100000, avg score of 100 evaluation = 7.86
Episode 45000/100000, avg score of 1

In [10]:
agent12 = Agent(env=env, learning_rate=0.1, discount=0.5, epsilon=0.1)
my_q_table12 = agent12.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -50.48
Episode 5000/100000, avg score of 100 evaluation = -30.19
Episode 7500/100000, avg score of 100 evaluation = -4.68
Episode 10000/100000, avg score of 100 evaluation = -5.73
Episode 12500/100000, avg score of 100 evaluation = 0.29
Episode 15000/100000, avg score of 100 evaluation = -1.13
Episode 17500/100000, avg score of 100 evaluation = 2.59
Episode 20000/100000, avg score of 100 evaluation = 4.9
Episode 22500/100000, avg score of 100 evaluation = 8.09
Episode 25000/100000, avg score of 100 evaluation = 7.25
Episode 27500/100000, avg score of 100 evaluation = 6.82
Episode 30000/100000, avg score of 100 evaluation = 8.0
Episode 32500/100000, avg score of 100 evaluation = 5.68
Episode 35000/100000, avg score of 100 evaluation = 7.87
Episode 37500/100000, avg score of 100 evaluation = 7.94
Episode 40000/100000, avg score of 100 evaluation = 7.79
Episode 42500/100000, avg score of 100 evaluation = 7.76
Episode 45000/100000, avg sco

In [11]:
agent13 = Agent(env=env, learning_rate=0.1, discount=0.3, epsilon=0.1)
my_q_table13 = agent13.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -53.99
Episode 5000/100000, avg score of 100 evaluation = -45.74
Episode 7500/100000, avg score of 100 evaluation = -23.48
Episode 10000/100000, avg score of 100 evaluation = -3.75
Episode 12500/100000, avg score of 100 evaluation = 1.54
Episode 15000/100000, avg score of 100 evaluation = -3.85
Episode 17500/100000, avg score of 100 evaluation = 1.56
Episode 20000/100000, avg score of 100 evaluation = 1.31
Episode 22500/100000, avg score of 100 evaluation = 6.27
Episode 25000/100000, avg score of 100 evaluation = 7.34
Episode 27500/100000, avg score of 100 evaluation = 7.74
Episode 30000/100000, avg score of 100 evaluation = 7.75
Episode 32500/100000, avg score of 100 evaluation = 6.85
Episode 35000/100000, avg score of 100 evaluation = 7.81
Episode 37500/100000, avg score of 100 evaluation = 7.83
Episode 40000/100000, avg score of 100 evaluation = 7.87
Episode 42500/100000, avg score of 100 evaluation = 7.19
Episode 45000/100000, avg 

In [12]:
agent14 = Agent(env=env, learning_rate=0.1, discount=0.1, epsilon=0.1)
my_q_table14 = agent14.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -70.18
Episode 5000/100000, avg score of 100 evaluation = -43.16
Episode 7500/100000, avg score of 100 evaluation = -22.79
Episode 10000/100000, avg score of 100 evaluation = -15.54
Episode 12500/100000, avg score of 100 evaluation = -4.01
Episode 15000/100000, avg score of 100 evaluation = 0.73
Episode 17500/100000, avg score of 100 evaluation = 2.58
Episode 20000/100000, avg score of 100 evaluation = -0.49
Episode 22500/100000, avg score of 100 evaluation = -0.71
Episode 25000/100000, avg score of 100 evaluation = 7.16
Episode 27500/100000, avg score of 100 evaluation = 4.56
Episode 30000/100000, avg score of 100 evaluation = 4.91
Episode 32500/100000, avg score of 100 evaluation = 5.57
Episode 35000/100000, avg score of 100 evaluation = 6.02
Episode 37500/100000, avg score of 100 evaluation = 5.89
Episode 40000/100000, avg score of 100 evaluation = 4.82
Episode 42500/100000, avg score of 100 evaluation = 5.94
Episode 45000/100000, a

### Zbadanie wpływu parametru epsilon:
Parametr ten wskazuje na prawdopodobieństwo z jakim ma być wybrana losowa akcja spośród wszystkich dostępnych względem wyboru najlepszej akcji wynikającej z tablicy Q. Im większa wartość parametry *epsilon* tym agent bardziej eksploruje ( wybiera akcję losowo i tym samym poznaje środowsko), natomiast mała wartość będzie oznaczać eksploatację - podążeanie względem strategii zapisanej w tablicy Q. 

In [19]:
agent110 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.1)
my_q_table110 = agent110.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -24.95
Episode 5000/100000, avg score of 100 evaluation = -6.86
Episode 7500/100000, avg score of 100 evaluation = -3.68
Episode 10000/100000, avg score of 100 evaluation = 3.88
Episode 12500/100000, avg score of 100 evaluation = 1.31
Episode 15000/100000, avg score of 100 evaluation = 4.81
Episode 17500/100000, avg score of 100 evaluation = 7.2
Episode 20000/100000, avg score of 100 evaluation = 7.03
Episode 22500/100000, avg score of 100 evaluation = 6.59
Episode 25000/100000, avg score of 100 evaluation = 7.69
Episode 27500/100000, avg score of 100 evaluation = 7.4
Episode 30000/100000, avg score of 100 evaluation = 7.89
Episode 32500/100000, avg score of 100 evaluation = 8.1
Episode 35000/100000, avg score of 100 evaluation = 8.01
Episode 37500/100000, avg score of 100 evaluation = 7.74
Episode 40000/100000, avg score of 100 evaluation = 7.85
Episode 42500/100000, avg score of 100 evaluation = 8.13
Episode 45000/100000, avg score o

In [5]:
agent111 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.3)
my_q_table111 = agent111.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -18.41
Episode 5000/100000, avg score of 100 evaluation = -8.86
Episode 7500/100000, avg score of 100 evaluation = 1.83
Episode 10000/100000, avg score of 100 evaluation = 4.85
Episode 12500/100000, avg score of 100 evaluation = 3.03
Episode 15000/100000, avg score of 100 evaluation = 6.0
Episode 17500/100000, avg score of 100 evaluation = 8.22
Episode 20000/100000, avg score of 100 evaluation = 7.98
Episode 22500/100000, avg score of 100 evaluation = 8.02
Episode 25000/100000, avg score of 100 evaluation = 7.73
Episode 27500/100000, avg score of 100 evaluation = 7.94
Episode 30000/100000, avg score of 100 evaluation = 7.94
Episode 32500/100000, avg score of 100 evaluation = 7.74
Episode 35000/100000, avg score of 100 evaluation = 7.77
Episode 37500/100000, avg score of 100 evaluation = 8.16
Episode 40000/100000, avg score of 100 evaluation = 7.66
Episode 42500/100000, avg score of 100 evaluation = 7.82
Episode 45000/100000, avg score 

In [14]:
agent112 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.5)
my_q_table112 = agent112.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -21.88
Episode 5000/100000, avg score of 100 evaluation = -3.83
Episode 7500/100000, avg score of 100 evaluation = 5.88
Episode 10000/100000, avg score of 100 evaluation = 3.71
Episode 12500/100000, avg score of 100 evaluation = 7.94
Episode 15000/100000, avg score of 100 evaluation = 7.91
Episode 17500/100000, avg score of 100 evaluation = 8.39
Episode 20000/100000, avg score of 100 evaluation = 7.92
Episode 22500/100000, avg score of 100 evaluation = 7.83
Episode 25000/100000, avg score of 100 evaluation = 7.39
Episode 27500/100000, avg score of 100 evaluation = 7.85
Episode 30000/100000, avg score of 100 evaluation = 8.08
Episode 32500/100000, avg score of 100 evaluation = 8.1
Episode 35000/100000, avg score of 100 evaluation = 7.88
Episode 37500/100000, avg score of 100 evaluation = 7.98
Episode 40000/100000, avg score of 100 evaluation = 7.58
Episode 42500/100000, avg score of 100 evaluation = 8.37
Episode 45000/100000, avg score 

In [15]:
agent113 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.7)
my_q_table113 = agent113.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = -4.47
Episode 5000/100000, avg score of 100 evaluation = 6.28
Episode 7500/100000, avg score of 100 evaluation = 6.76
Episode 10000/100000, avg score of 100 evaluation = 8.12
Episode 12500/100000, avg score of 100 evaluation = 8.26
Episode 15000/100000, avg score of 100 evaluation = 7.87
Episode 17500/100000, avg score of 100 evaluation = 7.81
Episode 20000/100000, avg score of 100 evaluation = 8.34
Episode 22500/100000, avg score of 100 evaluation = 7.76
Episode 25000/100000, avg score of 100 evaluation = 7.9
Episode 27500/100000, avg score of 100 evaluation = 7.95
Episode 30000/100000, avg score of 100 evaluation = 8.24
Episode 32500/100000, avg score of 100 evaluation = 7.87
Episode 35000/100000, avg score of 100 evaluation = 8.09
Episode 37500/100000, avg score of 100 evaluation = 8.2
Episode 40000/100000, avg score of 100 evaluation = 7.83
Episode 42500/100000, avg score of 100 evaluation = 7.65
Episode 45000/100000, avg score of 

In [16]:
agent114 = Agent(env=env, learning_rate=0.1, discount=0.7, epsilon=0.9)
my_q_table114 = agent114.q_learnning(epochs=100000, period_of_evaluation=2500, num_of_evaluations=100)

Episode 2500/100000, avg score of 100 evaluation = 8.36
Episode 5000/100000, avg score of 100 evaluation = 7.7
Episode 7500/100000, avg score of 100 evaluation = 8.21
Episode 10000/100000, avg score of 100 evaluation = 7.87
Episode 12500/100000, avg score of 100 evaluation = 8.23
Episode 15000/100000, avg score of 100 evaluation = 7.77
Episode 17500/100000, avg score of 100 evaluation = 7.25
Episode 20000/100000, avg score of 100 evaluation = 7.77
Episode 22500/100000, avg score of 100 evaluation = 7.99
Episode 25000/100000, avg score of 100 evaluation = 7.57
Episode 27500/100000, avg score of 100 evaluation = 7.77
Episode 30000/100000, avg score of 100 evaluation = 7.82
Episode 32500/100000, avg score of 100 evaluation = 8.16
Episode 35000/100000, avg score of 100 evaluation = 7.83
Episode 37500/100000, avg score of 100 evaluation = 8.49
Episode 40000/100000, avg score of 100 evaluation = 7.75
Episode 42500/100000, avg score of 100 evaluation = 7.84
Episode 45000/100000, avg score of 

### Wnioski:
z przeprowadzonych eksperymentów można stwierdzić że algorytm Q learning świetnie sobie radzi z poznawaniem i nauką nieznanego wcześniej środowska, jednak ważnym aspektem jest prawidłowy dobór jego parametrów. Zaobserwowałem że zwiększenie każdego z parametrów: *learning rate*, *discount* oraz *epsilon* zmniejsza liczbę iteracji (epok) potrzebą douzyskania satysfakcjonującego średniego wyniku, jednak parametr *epsilon* zdaje się mieć najmniejszy wpływ.<br><br> Warto zaznaczyć że algorytm ten nie jest zasobożerny i nie wymaga dużej liczby iteracji do osiągnięcia dobrego rezultatu działania, przykładowo już dla ok 2500 iteracji uczenia jesteśmy w stanie wytrenować agenta (dla problemu TAXI). <br><br> Niewielkie wachania średniego wyniku (7 - 8) po dłuższym czasie uczenia wynikają z losowości problemu - losowego punktu startowego, końcowego i punktu odbioru pasażera.

Wadą algorytmu q lerning jest to że zależy od dobrze zdefiniwanej tablicy Q, przy złożonych problemach, środowiskach może być wymagana bardzo duża liczba iteracji uczących aby uzyskać opytmalne wyniki. Ponadto algorytm ten jest dość prymitywny - po prostu oceniamy każdą możliwą konfigurację środowskia i na tej podstawie wykonujemy działanie.  