# Question 1:
The number of states depends on:
1. $\textbf{Grid Size}$: The size of the grid or environment in which the Pacman game is played affects the number of states. A larger grid will typically have more states since each grid cell can be considered a potential state.

2. $\textbf{Dot Locations}$: The locations of the dots (food) that Pacman needs to collect also affect the number of states. If the dots can be placed in any cell, each possible arrangement of dots represents a state.

3. $\textbf{Walls}$: The presence of walls restricts the agent's movement. The layout of walls and the agent's ability to navigate around them impact the number of states.

4. $\textbf{Ghost Locations}$ (if present): If there are ghosts in the environment, their positions also contribute to the number of states. Each combination of Pacman's position and ghost positions can be considered a state.

Summary: $\newline$
Any combination of a environment can be a state (unless we are aware of the place of walls, in this case, there will be less states).
So in this environment, we know the place of walls and we will have (43 Dots, 2 Emptys, 17 Walls), and we have $2^{43}$ states(every combination of Empty and Dot for 43 place).

An idea for state reduction: $\newline$
We can use a $3\times3$ (any odd number) matrix to store different states. If we use a larger matrix, the training might overfit on the environment and can't be generalized to other environments. In this way we will have the Agent in middle of matrix (position (1, 1)) and the other 8 element of the matrix can either be Dot, Wall and Empty. So we will have $3^{8}$ different combination of matrices. We also have used this reduction to reduce Q_table states from $2^{43}$ to $3^8$.

# Question 2:


$\textbf{Actions}: \newline$
We have 4 actions in each state. We can go to $\textbf{right, left, up}$ and $\textbf{down}$ (whitout considering that we can't go through walls).

$\textbf{State}: \newline$
As we mentioned in the first question, We will have $3^8$ state for every pacman environment. And for each state, we have 4 actions, so we will have a Q_table with $3^8$ rows and 4 columns.

$\textbf{Reward}: \newline$
1. If the action leads us to a Wall: The reward will be -2.
2. If the action leads us to a Dot: In this case we return a positive reward equals to 3.
3. If the action leads us to an Empty place in the environment: We return a reward -1. This would prevent of loops.

$\textbf{Goal State}: \newline$
Our main and goal state is to collect all the Dots on the environment.

In [28]:
import numpy as np
import random
import pygame

# Hash function:

This function is used to convert a state to an index so that we can access to the Q_values of that state.

It is as same as converting a binary number to integer, but instead of 0 and 1's, we have E(0), W(1), and D(2).

In [29]:
def hash_function(state):
    s = ""
    for i in range(len(state)):
        for j in range(len(state[i])):
            if not (i == 1 and j == 1):
                s += state[i][j]
    
    index = 0
    for i in range(len(s)):
        if s[i] == 'E':
            continue
        elif s[i] == 'W':
            index += 3**i
        elif s[i] == 'D':
            index += 2*(3**i)
            
    return index            

# Select action:

This function is used to determine the action that we are going to do:
1. It return one of the lagest q_value action by "random_value" chance (which increases while iterations passes in train section).
2. And by 1-random_value chance, it will have a random action.

In [30]:
def choose_next(q_values, random_value):
    rnd = random.random()
    
    if (rnd < random_value):
        l = []
        m = max(q_values)
        for i in range(len(q_values)):
            if q_values[i] == m:
                l.append(i)
        return random.choice(l)
    
    else:
        return random.randint(0, 3)

# Reward:

This function Determines that what is the reward for our "action" when we are at the "state".

In [31]:
def calc_reward(state, action):
    if action == 0: # Right
        if state[1][2] == 'W': return -2
        elif state[1][2] == 'E': return -1
        elif state[1][2] == 'D': return 3
        
    elif action == 1: # Left
        if state[1][0] == 'W': return -2
        elif state[1][0] == 'E': return -1
        elif state[1][0] == 'D': return 3
    
    elif action == 2: # Up
        if state[0][1] == 'W': return -2
        elif state[0][1] == 'E': return -1
        elif state[0][1] == 'D': return 3
        
    else: # Down
        if state[2][1] == 'W': return -2
        elif state[2][1] == 'E': return -1
        elif state[2][1] == 'D': return 3

# Update:
This is the update function, it updates the state and environment based on the current state and the given action.

In [32]:
def update_state(i, j, environment, action, n):
    if action == 0: j += 1  
    elif action == 1: j -= 1
    elif action == 2: i -= 1
    else: i += 1
    
    state = [[environment[s][k] for k in range(j-1, j+2)] for s in range(i-1, i+2)]
    
    if action == 0:
        state[1][0] = 'E'
        state[1][1] = 'A'
        environment[i][j-1] = 'E'
        environment[i][j] = 'A'
         
    elif action == 1:
        state[1][2] = 'E'
        state[1][1] = 'A'
        environment[i][j+1] = 'E'
        environment[i][j] = 'A'
        
    elif action == 2:
        state[2][1] = 'E'
        state[1][1] = 'A'
        environment[i+1][j] = 'E'
        environment[i][j] = 'A'
        
    else:
        state[0][1] = 'E'
        state[1][1] = 'A'
        environment[i-1][j] = 'E'
        environment[i][j] = 'A'

    return state, i, j, environment

This function is checking if the next place that we are going to by action, is not a wall.

In [33]:
def not_wall(state, action):
    if action == 0 and state[1][2] == 'E':
        return 1
    elif action == 1 and state[1][0] == 'E':
        return 1
    elif action == 2 and state[0][1] == 'E':
        return 1
    elif action == 3 and state[2][1] == 'E':
        return 1
    else:
        return 0

# Train:

1. First of all, we find the place of agent.
2. Then we run the game for many iterations (2000 epoch at most for each iteration).
3. While playing the game, in each epoch, first find the index of Q_table.
4. Then determine an action based on Q_table and random action(random in train, helps us to train every case that can happen).
5. For the determined action, there is a reward.
6. After getting a reward, we update the state and the environment using the update_state function.
7. Then update the Q_table for the state that we were in and the action we did by it's formula:
$$ Q(s, a) = (1-\alpha)\times Q(s, a) + \alpha \times (R + \gamma\times Q_{max}(s', a'))$$
8. And then do the same...

In [34]:
def Train(n_dots, o_environment, n, alpha, gamma, Q_table, inc_rate):
    for i in range(len(o_environment)):
        for j in range(len(o_environment[i])):
            if o_environment[i][j] == 'A':
                i_o = i
                j_o = j
                break

    rnd = 0.05
    while(rnd <= 0.9):
        i = i_o
        j = j_o
        dots = n_dots
        environment = [[] for i in range(len(o_environment))]
        state = [[o_environment[p][s] for s in range(j-1, j+2)] for p in range(i-1, i+2)]
        for s in range(len(o_environment)):
            environment[s] = o_environment[s].copy()
        
        # while(dots):
        for h in range(2000):
            if not dots:
                break

            index = hash_function(state)  # Find index using hash
            action = choose_next(Q_table[index], rnd) # Find action
            reward = calc_reward(state, action) # Getting the reward
                
            if reward > 0:
                state, i, j, environment = update_state(i, j, environment, action, n)
                dots -= 1
            
            elif not_wall(state, action):
                state, i, j, environment = update_state(i, j, environment, action, n)
                
            # Updating the Q_table
            Q_table[index][action] = (1-alpha)*Q_table[index][action] + alpha*(reward + gamma*max(Q_table[hash_function(state)]))
        rnd *= inc_rate

# Visualization:

In [35]:
def draw_map(screen, map_data):    
    WIDTH, HEIGHT = 90, 70
    WHITE = (255, 255, 255)
    RED = (255, 0, 0)
    YELLOW = (255, 255, 0)
    
    for y, row in enumerate(map_data):
        for x, cell in enumerate(row):
            rect = pygame.Rect(x * WIDTH, y * HEIGHT, WIDTH, HEIGHT)
            if cell == 'W':
                pygame.draw.rect(screen, RED, rect)
            elif cell == 'A':
                pygame.draw.circle(screen, YELLOW, (x * WIDTH + WIDTH // 2, y * HEIGHT + HEIGHT // 2), 35)
            elif cell == 'D':
                pygame.draw.circle(screen, WHITE, (x * WIDTH + WIDTH // 2, y * HEIGHT + HEIGHT // 2), 10)

In [36]:
def Animation(environment, reward):
    WIDTH, HEIGHT = 90, 70
    WHITE = (255, 255, 255)
    RED = (255, 0, 0)
    YELLOW = (255, 255, 0)

    map_data = np.array(environment)

    window_size = (WIDTH * map_data.shape[1], HEIGHT * map_data.shape[0])
    screen = pygame.display.set_mode(window_size)
    font = pygame.font.Font(None, 40)

    pygame.display.set_caption("Pac-Man")

    screen.fill((0, 0, 0))
    draw_map(screen, map_data)
    screen.blit(font.render(f"Score: {reward}", True, YELLOW), (15, 15))
    pygame.display.flip()
    pygame.display.update()
    pygame.time.delay(300)

This function chooses one of the best actions for Test.

In [37]:
def choose_next_test(q_values):
    l = []
    m = max(q_values)
    for i in range(len(q_values)):
        if q_values[i] == m:
            l.append(i)
    return random.choice(l)

# Test:
Test is the same as Train, but:
1. we handle the loop by determining a random action while haven't eaten a Dot for while. It is done by counting the non Dot places that the agent have visited in count_empty.
2. And we don't update the Q_table.

In [38]:
def Test(n_dots, o_environment, n, Q_table):       
    for i in range(len(o_environment)):
        for j in range(len(o_environment[i])):
            if o_environment[i][j] == 'A':
                i_o = i
                j_o = j
                break
            
    dots = n_dots
    environment = [[] for i in range(len(o_environment))]
    for s in range(len(o_environment)):
        environment[s] = o_environment[s].copy()
        
    state = [[o_environment[p][s] for s in range(j_o-1, j_o+2)] for p in range(i_o-1, i_o+2)]
    
    pygame.init()
    total_reward = 0
    Animation(environment, total_reward)
    ite = 0
    count_empty = 0
    while(dots):
        ite += 1
        
        index = hash_function(state)
        if count_empty > 20:
            l = []
            for ac in range(4):
                if calc_reward(state, ac) != -2:
                    l.append(ac)
            action = random.choice(l)
            
        else:  
            action = choose_next_test(Q_table[index])
        reward = calc_reward(state, action)
        total_reward += reward
            
        if reward > 0:
            state, i_o, j_o, environment = update_state(i_o, j_o, environment, action, n)
            dots -= 1
            count_empty = -1
        
        elif not_wall(state, action):
            state, i_o, j_o, environment = update_state(i_o, j_o, environment, action, n)
        Animation(environment, total_reward)
        if pygame.QUIT in [e.type for e in pygame.event.get()]:
            pygame.quit()
            pygame.display.quit()
            break
        count_empty += 1
    pygame.quit()
    print("Reward: ", total_reward, " Iteration: ", ite)

The given environment:

In [39]:
main_environment1 =[['W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W'],
                   ['W', 'A', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'W'],
                   ['W', 'D', 'W', 'W', 'W', 'D', 'W', 'W', 'W', 'D', 'W'],
                   ['W', 'D', 'W', 'D', 'D', 'D', 'D', 'D', 'W', 'D', 'W'],
                   ['W', 'D', 'D', 'D', 'W', 'E', 'W', 'D', 'D', 'D', 'W'],
                   ['W', 'D', 'W', 'D', 'W', 'E', 'W', 'D', 'W', 'D', 'W'],
                   ['W', 'D', 'W', 'D', 'D', 'W', 'D', 'D', 'W', 'D', 'W'],
                   ['W', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'W'],
                   ['W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W']]

The alternative environment:

In [40]:
main_environment2 =[['W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W'],
                    ['W', 'A', 'D', 'D', 'D', 'D', 'D', 'D', 'W'],
                    ['W', 'D', 'W', 'W', 'D', 'W', 'W', 'D', 'W'],
                    ['W', 'D', 'W', 'D', 'D', 'D', 'W', 'D', 'W'],
                    ['W', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'W'],
                    ['W', 'D', 'W', 'W', 'D', 'W', 'W', 'D', 'W'],
                    ['W', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'W'],
                    ['W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W']]

# Test and Train using:
Training by these parameters:
$$ \alpha = 0.25, 0.5, 0.75$$
$$ \gamma = 0.25, 0.5, 1$$

For all tests:
1. We construct a Q_table.
2. Train the pacman using determined parameters.
3. Testing on the given environment.
4. Testing on the alternative environment.

# Q_table:
1. Our Q_table have $3^8$ rows, each is a state for our $3\times3$ state matrix.
2. And for each state, we have 4 action.
3. Thus the Q_table is $3^8\times4$ matrix.

In [41]:
Q_table_1_1 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.25, 0.25, Q_table_1_1, 1.0001)

In [63]:
for q_value in Q_table_1_1:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[-1.33228998 -2.33283568 -2.33323475 -1.33254043]
[-1.3333321  -2.33297816 -1.18884313 -1.33333189]
[-1.02649647 -1.17306466  3.88496868 -1.29687928]
[-0.1609496  -2.0568498  -2.04115697 -1.3031442 ]
[-2.33286518 -1.32136464 -2.33319895 -1.33333208]
[-2.04414445 -0.11405198 -2.0335538  -1.31594638]
[-2.33249846 -1.33311795 -1.33275086 -1.33170775]
[-1.17429846 -1.04296792  3.21879385 -1.32107155]
[-2.33297468 -2.33306281 -1.33318733 -1.33333138]
[-2.00853382 -2.00815849 -0.01617746 -1.30581532]
[-1.27981842 -1.15530846  2.71686847 -0.95838598]
[-1.01217522 -1.01047168  3.97528361 -1.28985511]
[-2.00269596 -2.00343315 -0.00921795 -1.32240866]
[-2.00412429 -2.0027683  -0.01401403 -1.3023317 ]
[-1.00515056 -1.00427294  3.98989039 -1.31457916]
[-1.0033885  -1.00360325  3.98764288 -1.32820544]
[-1.32193296  2.68537649 -1.31220657 -1.04807385]
[-1.04182345  3.80925618 -1.03975582 -1.06641433]
[-1.33285021  2.67246491 -1.05008053 -1.07063815]
[-1.07900397  2.97957176  3.94710089 -1.06463217]


In [68]:
Test(43, main_environment1, 3, Q_table_1_1)

Reward:  52  Iteration:  120


In [69]:
Test(31, main_environment2, 3, Q_table_1_1)

Reward:  -275  Iteration:  365


In [43]:
Q_table_1_2 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.25, 0.5, Q_table_1_2, 1.0001)

In [59]:
for q_value in Q_table_1_2:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[-1.99934257 -2.99513327 -2.96748043 -1.99959431]
[-1.99997195 -2.99820651 -1.99985087 -1.99984582]
[-0.51536573 -0.35588263  3.9769     -1.79690244]
[ 1.5451556  -1.41077092 -1.36233715 -1.64917192]
[-2.9818842  -1.99972809 -2.97375527 -1.99970319]
[-1.66866687  0.95375186 -1.5302985  -1.84090921]
[-2.99126897 -1.99945209 -1.93790049 -1.99960686]
[ 0.20535479 -0.42075715  5.75642411 -1.88008684]
[-2.98175799 -2.95294817 -1.99924096 -1.99919162]
[-1.17616163 -1.17789912  1.64836581 -1.71339437]
[-1.06783298 -1.0667423   2.20924263 -1.18872986]
[ 0.75182736  0.75091913  5.65640033 -1.75487257]
[-1.12635853 -1.06490278  1.71883145 -1.77355754]
[-1.10331289 -1.08673996  1.86870529 -1.67402442]
[ 0.95025713  0.93770221  5.81729637 -1.90117264]
[ 0.94077635  0.90420741  5.97952168 -1.92698153]
[-0.8253578   2.11471206 -0.78804332 -0.65654665]
[ 0.52750883  4.83728773  0.51775569 -0.76924787]
[-0.98791224  2.03507502 -0.55922669 -0.78659919]
[ 0.72509093  3.73369004  4.84904896 -0.81715476]


In [71]:
Test(43, main_environment1, 3, Q_table_1_2)

Reward:  -206  Iteration:  378


In [72]:
Test(31, main_environment2, 3, Q_table_1_2)

Reward:  -48  Iteration:  141


In [45]:
Q_table_1_3 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.25, 1, Q_table_1_3, 1.0001)

In [46]:
for q_value in Q_table_1_3:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[3908.94793482 3902.59912266 3902.93396159 3902.83760283]
[3906.30830564 3903.49061191 3904.12572277 3903.23821892]
[3963.90220794 3963.63425924 3987.07498845 3960.06676076]
[3988.76386195 3958.14941604 3957.70516071 3931.09948443]
[3903.00606213 3905.69754278 3903.05319297 3907.69114002]
[3973.37353262 4002.47036903 3969.48962625 3939.83207106]
[3906.31327111 3903.37976593 3914.49103534 3902.90040694]
[3957.11716566 3975.75366055 3970.20046517 3934.09676993]
[3903.83796369 3903.54186353 3907.95987255 3904.13981252]
[3927.43842496 3927.26601549 3927.19934614 3925.80018453]
[ 621.45373667  868.53024374 1462.42547766 1188.70361035]
[2749.97433119 2077.81588137 2887.72576115 3941.19891383]
[3955.30024736 3955.03391498 3965.47228477 3943.01677046]
[3952.44086456 3952.5285646  3952.66184254 3930.74172215]
[3862.690986   3987.9958121  4111.64990605 3885.0693176 ]
[3959.3883027  3961.76806828 3966.34368929 3944.89426227]
[3988.5302306  3969.10925871 3988.63580186 3988.74367495]
[4012.92623697

In [74]:
Test(43, main_environment1, 3, Q_table_1_3)

Reward:  -476  Iteration:  463


In [75]:
Test(31, main_environment2, 3, Q_table_1_3)

Reward:  -468  Iteration:  433


In [47]:
Q_table_1_4 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.5, 0.25, Q_table_1_4, 1.0001)

In [48]:
for q_value in Q_table_1_4:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[-1.33333333 -2.33217676 -2.33320747 -1.33333333]
[-1.33333332 -2.33225237 -1.33333332 -1.33333332]
[-1.0407582  -1.10483138  3.45940479 -1.23658873]
[-0.00736    -2.02932851 -2.01655073 -1.29305962]
[-2.33328545 -1.33333045 -2.33332242 -1.33333332]
[-2.03796583 -0.23307847 -2.03893661 -1.29873914]
[-2.33332729 -1.33333332 -1.33333332 -1.33333333]
[-1.04999779 -1.00934745  2.99925654 -1.32476306]
[-2.3333151  -2.333324   -1.33333332 -1.33333332]
[-2.00459609 -2.00649358 -0.0171384  -1.27461518]
[-1.32404394 -1.32102825  2.69183943 -1.14041243]
[-1.01115909 -1.01777448  3.98541865 -1.32106823]
[-2.00731095e+00 -2.00109065e+00 -2.32938998e-04 -1.29542564e+00]
[-2.00248588 -2.00742168 -0.01488463 -1.31744682]
[-1.00331595 -1.0022392   3.95792534 -1.23785341]
[-1.0015167  -1.00152163  3.98544924 -1.32096691]
[-1.3171069   2.67009076 -1.31247057 -1.04458977]
[-1.01928525  3.7615935  -1.01307847 -1.07231225]
[-1.32999191  2.66735377 -1.03457867 -1.07462455]
[-1.03111604  2.96419931  3.989844

In [77]:
Test(43, main_environment1, 3, Q_table_1_4)

Reward:  21  Iteration:  151


In [78]:
Test(31, main_environment2, 3, Q_table_1_4)

Reward:  -119  Iteration:  208


In [49]:
Q_table_1_5 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.5, 0.5, Q_table_1_5, 1.0001)

In [50]:
for q_value in Q_table_1_5:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[-1.99996034 -2.99988212 -2.99973596 -1.9999532 ]
[-1.99737836 -2.99184076 -1.26146316 -1.99672949]
[-0.83009638 -0.57512463  4.23465626 -1.91353641]
[ 0.34867884 -1.20840552 -1.56865322 -1.40791695]
[-2.99444206 -1.94776687 -2.98780665 -1.99999791]
[-1.71602459  0.54448474 -1.71546707 -1.61634332]
[-2.99891508 -1.99995158 -1.24613972 -1.999949  ]
[ 0.05610912 -0.11351506  5.47592396 -1.98230516]
[-2.98896005 -2.99751141 -1.99999968 -1.71387864]
[-1.15939074 -1.28263145  0.90978611 -1.59609156]
[-0.83521613 -0.83319632  2.12031333 -1.51360175]
[ 0.76462029  0.7271584   5.65439795 -1.39531805]
[-1.06789881 -1.11657113  1.97750813 -1.60959545]
[-1.06016587 -1.09003148  1.69339946 -1.59249666]
[ 0.96514272  0.94940407  5.94706799 -1.50944089]
[ 0.91822652  0.95906985  5.99679067 -1.9647115 ]
[-0.89601765  2.02980024 -0.9300282  -0.7469382 ]
[ 0.32256563  4.61690417  0.37191016 -0.8900929 ]
[-0.99268389  2.01792426 -0.59576178 -0.66147008]
[ 0.64284446  3.67965027  5.0945537  -0.66440827]


In [24]:
Test(43, main_environment1, 3, Q_table_1_5)

Reward:  120  Iteration:  52


In [81]:
Test(31, main_environment2, 3, Q_table_1_5)

Reward:  -355  Iteration:  431


In [51]:
Q_table_1_6 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.5, 1, Q_table_1_6, 1.0001)

In [52]:
for q_value in Q_table_1_6:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[11531.82771782 11531.45277756 11530.79898841 11531.63790446]
[11533.31441165 11533.43780853 11533.34193326 11533.49241754]
[11615.31289789 11614.74730757 11585.68586181 11590.33741631]
[11556.99996433 11557.18304965 11557.00447094 11555.94833934]
[11532.70485072 11533.0759542  11532.7164161  11533.23836254]
[11632.31134492 11644.32807225 11614.79565881 11558.92266899]
[11531.53692594 11532.04225691 11532.11564792 11531.75991644]
[11557.17342481 11557.40546793 11590.43012662 11547.11978614]
[11528.57378499 11527.96817    11528.63424551 11533.96129088]
[11586.01540989 11585.73945182 11604.19777458 11586.83320851]
[ 7956.24728975  9202.89698193  7259.06826604 11149.38164725]
[10578.48799429 10580.94512233  9639.02949489 11550.80075018]
[11593.09128948 11593.48852372 11620.00099149 11586.29909422]
[10647.97722465 11588.68151695 11603.35376534 11060.86512887]
[11686.32854644 11611.44512308 11728.32210388 11637.96653708]
[11594.82041032 11594.63746356 11593.84132915 11574.36658513]
[11635.4

In [83]:
Test(43, main_environment1, 3, Q_table_1_6)

Reward:  -627  Iteration:  532


In [84]:
Test(31, main_environment2, 3, Q_table_1_6)

Reward:  -715  Iteration:  654


In [53]:
Q_table_1_7 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.75, 0.25, Q_table_1_7, 1.0001)

In [54]:
for q_value in Q_table_1_7:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[-1.33333333 -2.3333333  -2.33332994 -1.33333333]
[-1.33333333 -2.33333326 -1.33333333 -1.33333333]
[-1.00955711 -1.06778556  2.99903303 -1.28488133]
[-0.03695716 -2.06481013 -2.03695935 -1.28429973]
[-2.33321543 -1.33333333 -2.33322076 -1.33333333]
[-2.04205817 -0.01030147 -2.06635933 -1.31694262]
[-2.33333098 -1.33333333 -1.33333333 -1.33333333]
[-1.26631182 -1.07802612  2.72916669 -1.1613446 ]
[-2.3333333  -2.33333333 -1.33333333 -1.33333333]
[-2.00442046 -2.00383706 -0.0153758  -1.26885151]
[-1.28532353 -1.33305315  2.66724362 -1.27932565]
[-1.00839972 -1.05615829  3.98124821 -1.32939103]
[-2.00343834e+00 -2.00159921e+00 -5.80192324e-04 -1.26300995e+00]
[-2.00776549 -2.00172795 -0.01377638 -1.33101454]
[-1.0372333  -1.00087461  3.99869324 -1.23314818]
[-1.00118424 -1.00020996  3.98270706 -1.30699471]
[-1.3330047   2.66709124 -1.33302435 -1.07925682]
[-1.00673566  3.79445295 -1.06258386 -1.07295511]
[-1.32337574  2.66758712 -1.05241614 -1.0801188 ]
[-1.01261908  2.93280993  2.924152

In [86]:
Test(43, main_environment1, 3, Q_table_1_7)

Reward:  66  Iteration:  106


In [87]:
Test(31, main_environment2, 3, Q_table_1_7)

Reward:  -110  Iteration:  214


In [55]:
Q_table_1_8 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.75, 0.5, Q_table_1_8, 1.0001)

In [56]:
for q_value in Q_table_1_8:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[-1.99999998 -2.99998524 -2.99861149 -1.92052493]
[-1.99999445 -2.99943094 -1.99999789 -1.99999641]
[-0.22704492 -0.97732946  2.00219535 -1.92164858]
[ 0.78262485 -1.59645752 -1.29844377 -1.41829702]
[-2.99346287 -1.99999476 -2.93867324 -1.99999209]
[-1.84849723  0.21266665 -1.98875279 -1.55549267]
[-2.99999827 -1.99999997 -1.99999997 -1.99715575]
[ 0.81272875 -0.33065703  5.99649243 -1.86058421]
[-2.99724707 -2.99995288 -1.99999997 -1.89425552]
[-1.15628277 -1.19204216  1.58210695 -1.34755473]
[-0.7894801  -0.88072559  2.04402941 -1.83578757]
[ 0.82160821  0.81222247  5.3351547  -1.80476951]
[-1.09009501 -1.09608742  1.93695281 -1.80313308]
[-1.01804722 -1.31208368  1.93748633 -1.47719063]
[ 0.71353843  0.98169945  5.80110993 -1.98926463]
[ 0.98897042  0.91431825  5.84240744 -1.68620475]
[-0.42412008  2.01872445 -0.88363009 -0.95252987]
[ 0.85777272  4.35921074  0.68660121 -0.96940822]
[-0.99229887  2.06493336 -0.88530745 -0.82845844]
[ 0.52534386  3.3226814   4.27509093 -0.8264475 ]


In [89]:
Test(43, main_environment1, 3, Q_table_1_8)

Reward:  30  Iteration:  142


In [90]:
Test(31, main_environment2, 3, Q_table_1_8)

Reward:  -108  Iteration:  201


In [57]:
Q_table_1_9 = np.array([[0 for j in range(4)] for i in range(3**8)]).astype(float)

Train(43, main_environment1, 3, 0.75, 1, Q_table_1_9, 1.0001)

In [58]:
for q_value in Q_table_1_9:
    if max(q_value) > 0 or min(q_value) < 0:
        print(q_value)

[31764.30201184 31751.43158415 31751.67626909 31752.03891915]
[31763.56791366 31762.54886964 31763.01557412 31763.53435503]
[31815.77995901 31786.51849055 31774.71952173 31811.91108291]
[31782.25401186 31780.34903262 31780.67202287 31775.75433289]
[31764.00353619 31763.89607201 31763.14475302 31765.21063758]
[31814.86249853 31815.14990174 31811.3544451  31791.24472035]
[31751.42051403 31755.93537814 31777.34818885 31752.08727115]
[31774.65445362 31780.97948898 31779.23456344 31769.87735532]
[31760.1327171  31762.46066314 31763.1851424  31762.51910747]
[31763.71608708 31667.1135962  31810.76458852 31789.26734486]
[24844.25353244 24932.27309632 28328.85304919 26167.4347529 ]
[26537.95425648 23681.24119521 27542.65198204 30436.08785784]
[31268.34136107 28624.50850023 31818.14713828 29215.0154759 ]
[31751.07500382 30267.95154418 29954.53267399 31782.1063571 ]
[28423.40066178 27615.3133527  30677.53801295 31396.34439381]
[31798.06856604 31821.71456378 31882.37226343 31775.1180015 ]
[26666.3

In [92]:
Test(43, main_environment1, 3, Q_table_1_9)

Reward:  -579  Iteration:  584


In [93]:
Test(31, main_environment2, 3, Q_table_1_9)

Reward:  -397  Iteration:  392


As we can see, the best case was $\alpha = 0.5$ and $\gamma = 0.5$ for the given environment.

And the best case was $\alpha = 0.25$ and $\gamma = 0.5$ for the alternative environment.


${\bullet}$ $\alpha:$ learning rate
$\newline$
${\bullet}$ $\gamma:$  discount

# Conclusion:

${\bullet}$ We can see that increasing $\alpha$ alone, didn't brought much to training.

${\bullet}$ On the other increasing $\gamma$ always tend to a worst learning for agent.

${\bullet}$ As we can see, different combinations of $\alpha$ and $\gamma$, changes the q_values. In some cases they're too large and some of them are small. This happens as $\gamma$ increases.