# AI Exam

Consider the following environment:

<img src="images/road_env.jpg" style="zoom: 40%;"/>

The agent starts in cell $(0, 0)$ and must reach the goal in cell $(8,6)$. The agent can move in the four directions (except when a wall is present), and for each step taken the agent receives a negative reward (-0.05).
In cells representing roads with intersections, the agent must wait for the traffic light to turn green before proceeding. At busy intersections (indicated by two traffic lights in the same cell), the agent will have to wait a long time to cross the intersection. This implies that if the agent tries to move to another cell, the action has a 20% chance of success, while in the remaining 80% of cases the agent will remain in the cell representing the intersection and will have to try to move again. On the other hand, intersections with only one traffic light are less busy and in this case actions will only have a 20% chance of failing.

Consider the problem of computing the optimal policy for the environment reported above and use the provided code to print it. How does the policy change with respect to the discount factor? Analyse and motivate such behaviour.

<span style="color:green">La funzione di check implementata non controlla le celle corrispondenti ai muri</span>

In [1]:
import os, sys 

module_path = os.path.abspath(os.path.join('tools'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gym, envs
from utils.ai_lab_functions import *
import numpy as np
from timeit import default_timer as timer
from tqdm import tqdm as tqdm

env_name = 'RoadEnv-v0'
env = gym.make(env_name)

env.render()

print("\nActions encoding: ", env.actions)

# Remember that you can know the type of a cell whenever you need by accessing the grid element of the environment:
print("Cell type of start state: ",env.grid[env.startstate])
print("Cell type of goal state: ",env.grid[env.goalstate])
state = 15 # a very busy intersection
print(f"Cell type of cell {env.state_to_pos(state)}: ",env.grid[state])
state = 10 # a less busy intersection
print(f"Cell type of cell {env.state_to_pos(state)}: ",env.grid[state])
print(f"Probability of effectivelty performing action 'R' from cell (1,1) to cell (1,2): {env.T[state, 1, state+1]}") 

[['S' 'R' 'W' 'W' 'W' 'W' 'R' 'W' 'W']
 ['W' 'Ts' 'R' 'R' 'R' 'R' 'Tl' 'R' 'R']
 ['W' 'R' 'W' 'W' 'W' 'W' 'R' 'W' 'W']
 ['R' 'Ts' 'R' 'Ts' 'R' 'R' 'Ts' 'W' 'W']
 ['W' 'W' 'W' 'R' 'W' 'W' 'R' 'Ts' 'R']
 ['W' 'R' 'R' 'Tl' 'W' 'W' 'W' 'R' 'W']
 ['W' 'R' 'W' 'R' 'Ts' 'R' 'R' 'Tl' 'R']
 ['W' 'R' 'W' 'W' 'R' 'W' 'W' 'R' 'W']
 ['R' 'Ts' 'R' 'R' 'Tl' 'R' 'G' 'Ts' 'R']]

Actions encoding:  {0: 'L', 1: 'R', 2: 'U', 3: 'D'}
Cell type of start state:  S
Cell type of goal state:  G
Cell type of cell (1, 6):  Tl
Cell type of cell (1, 1):  Ts
Probability of effectivelty performing action 'R' from cell (1,1) to cell (1,2): 0.8


In [2]:
def value_iteration(environment, maxiters=500, discount=0.9, max_error=1e-3): 

    U_1 = [0 for _ in range(environment.observation_space.n)] 
    delta = 0 
    
    while True:
        maxiters -= 1
        U = U_1.copy()
        delta = 0
        for state in range(environment.observation_space.n):
            
            max_array = [0 for _ in range(environment.action_space.n)] 
            for action in range(environment.action_space.n):
                for next_state in range(environment.observation_space.n):
                    max_array[action] += env.T[state, action, next_state] * U[next_state]
                    
            if env.grid[state] == "G":
                U_1[state] = environment.RS[state]
            else:           
                U_1[state] = environment.RS[state] + discount * max(max_array)
                
            if abs(U_1[state] - U[state]) > delta: 
                delta = abs(U_1[state] - U[state])         
                
        if maxiters <= 0 or delta < (max_error*(1-discount)/discount):
            break  
    return values_to_policy(np.asarray(U), environment) 

In [3]:
t = timer()

solution = value_iteration(env)

print("\nEXECUTION TIME: \n{}\n".format(round(timer() - t, 4)))

solution_render = np.vectorize(env.actions.get)(solution.reshape(env.rows, env.cols))

print("Solution: \n{}\n".format(solution_render))

check_sol(solution_render)


EXECUTION TIME: 
0.2866

Solution: 
[['R' 'D' 'L' 'L' 'L' 'L' 'D' 'L' 'L']
 ['L' 'D' 'L' 'R' 'R' 'R' 'D' 'L' 'L']
 ['L' 'D' 'L' 'L' 'L' 'L' 'D' 'L' 'L']
 ['R' 'R' 'R' 'R' 'R' 'R' 'D' 'L' 'L']
 ['L' 'L' 'L' 'D' 'L' 'L' 'R' 'D' 'L']
 ['L' 'D' 'L' 'D' 'L' 'L' 'L' 'D' 'L']
 ['L' 'D' 'L' 'R' 'D' 'L' 'R' 'D' 'L']
 ['L' 'D' 'L' 'L' 'D' 'L' 'L' 'D' 'L']
 ['R' 'R' 'R' 'R' 'R' 'R' 'L' 'L' 'L']]

[1m[92mYour solution is correct!
[0m
