
**<h1><div style="text-align: center;"> INM707 Phomenes Environment </div></h1>**
**<h2><div style="text-align: center;"> Assigment: Creating a learning environment and policies </div></h2>**


**<h3><div> Game Description</div></h3>**

The agent needs to collect words with the short sound 'ʊ' and  find the shorter path to the goal within a limited time. The agent gets a positive reward if collecting the words with the correct sound. A negative reward is given for collecting the incorrect word, hitting the walls, hitting moving obstacles and dragging in empty cells. Each time step has a penalization of -1.

The environment has the same number of words for each of the sound :'ʊ' (short u sound),  ʌ (open-middle set) and uː (closed long u). These together with moving obstacles will challenge the agent to achieve the goal.


**<h3><div> Environment</div></h3>**

The phoneme environment is a configurable $N\times M$ array of integers representing objects.
All objects except the wall are placed randomly in the environment. Each object is represented as follows:

- 0 : empty cell
- 1 : moving obstacle
- 2 : 'ʊ' word
- 3 : 'ʌ' word
- 4 : 'u:' word
- 5 : Agent
- 6 : Goal
- 7 : Boundaries/walls


The words are randomly extracted from the phonemes lists. The grid can be adapted to collect the three sounds or any of their combinations by a minimal change in the rewards and policies functions. For a more advanced task each word with the same sound can be encoded with its own number. In this work the mission is to collect/learn the phonetic sound 'ʊ'.


- The available area to placed objects is  total grid area - the boundary area

$ a = M\times N - 2 \times (M + N) - 4$


- The total number of words on the grid is given by floor division of the are by 3 (3 phonemes sounds):

$ w  = \lfloor \frac{a}{3} \rfloor $


- The words per soud is given by:


$  w_i = \lfloor \frac{w}{3} \rfloor $


- The number of obstacles on the grid is:


$ o = \lfloor \frac{a}{9} \rfloor $


- There is only one goal (G) and one learner (A).

There are $n+1$ agents on the board, $a_0,…,a_n$, where $a_0$ is the learner agent and the rest are the movable obstacles. 


**<h3><div> Actions</div></h3>**

The actions available at each time step are:
- up
- down
- left 
- right
- grab 
After taking an action, the agent gets a reward and transitions to a new state. Then the environment sends a signal indicating whether the game is over or not. 

**<h3><div> Observations</div></h3>**

The observation of the environment is a dictionary that contains:
- relative coordinates to all words in the grid
- relative coordinates to the goal 
- relative coordinates to the obstacles
- a neighbourhood 3x3 array with the encoded values 
- a counter indicating the words left
- relative distance to the obstacles
- current location of the agent


**<h3><div>Policies</div></h3>**
- Goal-oriented "Biased policy" - the agent only grabs if it is located at the same position as the defined task phoneme and heads towards the Goal.
- Random policy - takes actions randomly if not phoneme at its position on the grid.
- Combined policy  with  $ p <= \epsilon $ explores, otherwise follows the biased policy.


**<h3><div>Rewards</div></h3>**

- -1 per each time step
- -20 for hitting a moving obstacle 
- -10 for grabbing in an empty cell or hitting a wasll
- -10 for grabbing a word with 'ʊ' sound
- -20 for grabbing ʌ_pos and uː
- 100 if grabbing the correct sound

-  reaching the goal if all $ʊ$ were collected  $a\times w_i$
-  reaching the goal and $ʊ$ left  $ a \times (w_i - ʊ_l)$
-  if time step reached and the agent is not the goal on the  $ -a$


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
%cd /content/drive/MyDrive/INM707/task_1

## Packages

In [None]:
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from collections import namedtuple, defaultdict
from tqdm import tqdm

seed = 123
rng = np.random.default_rng(seed)


<h2><div2>Section 1: The Environment</div2></h2>

<h3><div3>1.1 Phonemes list</div3></h3>

In [None]:
spell = "book took drum luck hush brush who tool new jury blush true through sue ball suit knew " \
        "fool loose lose pull room good boot look wolf rug foot sugar put dune hook doom cook " \
        "June cushion one could shoe woods bookshelf blue during rural noodles hush bug woman " \
        "football full would do too soon hood food pool you threw Lou two supper plumber publish " \
        "cup come "

phonemes = "bʊk tʊk drʌm lʌk hʌʃ brʌʃ huː tuːl njuː ʤʊəri blʌʃ truː θruː sjuː bɔːl sjuːt njuː " \
           "fuːl luːs luːz pʊl ruːm gʊd buːt lʊk wʊlf rʌg fʊt ʃʊgə pʊt djuːn hʊk duːm kʊk ʤuːn " \
           "kʊʃən wʌn kʊd ʃuː wʊdz bʊkʃɛlf bluː djʊərɪŋ rʊərəl nuːdlz hʌʃ bʌg wʊmən fʊtbɔːl fʊl " \
           "wʊd duː tuː suːn hʊd fuːd puːl juː θruː luː tuː sʌpə plʌmə pʌblɪʃ kʌp kʌm "

spell_list = list(spell.split(" "))
phonemes_list = list(phonemes.split(" "))

phonetic_dict = dict(zip(phonemes_list, spell_list, ))

# noinspection NonAsciiCharacters
ʊ_sound = list()
# noinspection NonAsciiCharacters
uː_sound = list()
# noinspection NonAsciiCharacters
ʌ_sound = list()

for pho in phonemes_list:
    if 'ʊ' in pho:
        ʊ_sound.append(phonetic_dict[pho])
    elif 'uː' in pho:
        uː_sound.append(phonetic_dict[pho])
    elif 'ʌ' in pho:
        ʌ_sound.append(phonetic_dict[pho])

<h3><div3>1.1 Phonemes list</div3></h3>

In [None]:
# noinspection NonAsciiCharacters
class Phonemes:

    def __init__(self, size, action, **env_inf):
        """
        param action: a napmedtuple with agent's actions
        param size: is a tuple with number of column and raws
        param env_inf: a dictionary containing informatio about the seed and task
        """
        
        self.size = size
        self.grid = np.zeros(size)
        self.up = action('up', 0, -1, 0)
        self.down = action('down', 1, 1, 0)
        self.left = action('left', 2, 0, -1)
        self.right = action('right', 3, 0, 1)
        self.grab = action('grab', 4, 0, 0)
        self.seed = env_inf['seed']
        self.task = env_inf['sound']
        self.bound = 2 * np.sum(self.size) - 4
        self.area = np.prod(self.size) - self.bound

        # Boundaries
        self.grid[0, :], self.grid[:, 0], self.grid[:, -1], self.grid[-1, :] = 7, 7, 7, 7

        # Words to place on the grid on ly a third of the available area
        self.num_words = self.area // 3

        assert self.num_words >= 3, 'Increase the size of the environment'

        self.obstacles = self.area // 9

        # sum all objects in the environment, words, obstacles, goal and agent
        self.total_objects = self.num_words + self.obstacles + 2  # goal + agent
        self.total_agents = self.obstacles + 1  # learner agent

        # Ramdomly choosing words
        self.short_u = np.random.choice(ʊ_sound, self.num_words // 3)
        self.open_middle_a = np.random.choice(ʌ_sound, self.num_words // 3)
        self.long_u = np.random.choice(uː_sound, self.num_words // 3)
        
        self.agent_pos = None
        self.ʊ_pos = None
        self.ʌ_pos = None
        self.uː_pos = None
        self.obstacle_pos = None
        self.goal_pos = None
        self.time_step = 0
        self.time_limit = round((self.area + self.obstacles) * 3)
        self.dict_map_display = {0: '_', 1: '*', 2: 'ʊ', 3: 'ʌ', 4: 'u:', 5: 'A', 6: 'G', 7: 'X'}

    def env_step(self, action, prints=True):
        """
        This metods returns the observations, reward and boolean done
        transitions to another position are checked
        if the agent grab in a position with a word then
        the item is remove from the list of words
        after the agent action the obstacles are move and the observation is updated
        """
        done = False
        
        (x, y) = self.agent_pos
        
        if prints: 
            print('Agent position: {} |  Agent action: {} | Goal: {}'.format(self.agent_pos, action,
                                                                        self.goal_pos))
        reward = -1
        self.time_step += 1

        #############################
        # Undertaking an action
        #############################

        if action == self.up.name:

            self.agent_pos = (x + self.up.delta_x, y)

            if self.agent_pos[0] < 1:
                self.agent_pos = (x, y)
                reward = -10

            elif self.agent_pos in self.obstacle_pos:
                self.agent_pos = (x, y)
                reward = -20

        elif action == self.down.name:
            self.agent_pos = (x + self.down.delta_x, y)

            if self.agent_pos[0] > self.size[0] - 2:
                self.agent_pos = (x, y)
                reward = -10

            elif self.agent_pos in self.obstacle_pos:
                self.agent_pos = (x, y)
                reward = -20

        elif action == self.left.name:

            self.agent_pos = (x, y + self.left.delta_y)
            if self.agent_pos[1] < 1:
                self.agent_pos = (x, y)
                reward = -10

            elif self.agent_pos in self.obstacle_pos:
                self.agent_pos = (x, y)
                reward = -20

        elif action == self.right.name:
            self.agent_pos = (x, y + self.right.delta_y)

            if self.agent_pos[1] > self.size[1] - 2:
                self.agent_pos = (x, y)
                reward = -10

            elif self.agent_pos in self.obstacle_pos:
                self.agent_pos = (x, y)
                reward = -20
                
        elif action == self.grab.name and self.agent_pos in self.ʊ_pos:
            # update list of items left
            self.ʊ_pos.remove(self.agent_pos)
            self.agent_pos = (x, y)

            if self.task == 'short_u':
                reward = 100
            else:
                reward = -10

        elif action == self.grab.name and self.agent_pos in self.ʌ_pos:
            # update list of items left
            self.ʌ_pos.remove(self.agent_pos)
            self.agent_pos = self.agent_pos

            if self.task == 'middle_open':
                reward = 100
            else:
                reward = -10

        elif action == self.grab.name and self.agent_pos in self.uː_pos:

            self.uː_pos.remove(self.agent_pos)
            self.agent_pos = self.agent_pos
            if self.task == 'long_u':
                reward = 100
            else:
                reward = -10

        elif action == self.grab.name and self.agent_pos not in (
                self.ʊ_pos + self.ʌ_pos + self.uː_pos):
            self.agent_pos = self.agent_pos
            reward = -10

        else:
            reward = -1

        #############################
        # Verifying terminal state
        #############################

        # Time limit reached
        w = self.num_words//3 - len(self.ʊ_pos)
        if self.time_step == self.time_limit and self.agent_pos != self.goal_pos:
            done = True
            w = self.num_words//3 - len(self.ʊ_pos)

            if prints:
                print('Episode done')
                print('Last reward: {}'.format(reward))
                print('Words with {} sound collected: {}'.format('ʊ', w))   

        elif self.agent_pos == self.goal_pos and self.time_step <= self.time_limit:
            done = True
            
            if len(self.ʊ_pos) == 0:
                a = self.num_words // 3
            elif len(self.ʊ_pos) > 0:
                a = self.num_words//3 - len(self.ʊ_pos)
                
            else:
                a = -1

            reward = self.area * a

            if prints:
                print('Episode done')
                print('Last reward: {}'.format(reward))
                print('Words with {} sound collected: {}'.format('ʊ', w))                                           
            
        else:
            obst_pos, obs_reward = self.move_obstacles()
            reward = reward - obs_reward
            if prints:
                print('Step reward: {} | Obstacles positions: {}'. format(reward, obst_pos))
            
        observation = self.observe()

        return observation, reward, done, self.time_step

    def move_obstacles(self):
        """
        This function moves randomly the obstacles in the grid and updates the list
        of their position self.obstacle_pos for displaying
        """

        directions = [self.up, self.down, self.left, self.right]
        
        obs_reward = 0

        for i in range(len(self.obstacle_pos)):

            new_pos = np.array(self.obstacle_pos[i])

            (x, y) = new_pos
            
            idx = int(np.random.choice(4, 1))
            action = directions[idx]

            if action == self.up or action == self.down:

                new_pos = (x + action.delta_x, y)

                if new_pos[0] < 1:
                    new_pos = (x, y)

                elif new_pos[0] > self.size[0] - 2:
                    new_pos = (x, y)

                elif new_pos in self.obstacle_pos:
                    new_pos = (x, y)

                elif new_pos == self.agent_pos:
                    new_pos = (x, y)
                    obs_reward = 20

                else:
                    obs_reward = 0

                self.obstacle_pos[i] = new_pos

            else:

                new_pos = (x, y + action.delta_y)

                if new_pos[1] < 1:
                    new_pos = (x, y)

                elif new_pos[1] > self.size[1] - 2:
                    new_pos = (x, y)

                elif new_pos in self.obstacle_pos:
                    new_pos = (x, y)

                elif new_pos == self.agent_pos:
                    new_pos = (x, y)
                    obs_reward = 20

                else:
                    obs_reward = 0

                self.obstacle_pos[i] = new_pos

        return self.obstacle_pos, obs_reward

    @staticmethod
    def position_to_index(position, size):
        """
        param position: x,y coordinates
        return: coordinates index
        """
        return np.ravel_multi_index(position, size)

    def observe(self):
        """
        Returns a dictionary of the current observation of the environment
        including distance to the goal, to the obsatcles and the words left
        in the environment. The agent cannot see a word or the goal if an obstacle is
        superimposed, but knows the location of the words.
        """
        o = dict()

        distance_to_obs = list()
        distance_to_task = list()

        # Distance to the obstacles
        for pos in self.obstacle_pos:
            distance_to_obs.append((np.array(pos) - np.array(self.agent_pos)))

        # Distance to ʊ words
        if self.task == 'short_u':
            for pos in self.ʊ_pos:
                distance_to_task.append((np.array(pos) - np.array(self.agent_pos)))
        elif self.task == 'middle_open':
            for pos in self.ʌ_pos:
                distance_to_task.append((np.array(pos) - np.array(self.agent_pos)))
        else:
            for pos in self.uː_pos:
                distance_to_task.append((np.array(pos) - np.array(self.agent_pos)))

        o['obstacles'] = distance_to_obs
        o['dist_goal'] = np.array(self.goal_pos) - np.array(self.agent_pos)
        o['ʊ_pos'] = distance_to_task
        o['ʊ_coords'] = self.ʊ_pos
        o['agent_pos'] = self.agent_pos
        o['pho_left'] = np.array((len(self.ʊ_pos), len(self.ʌ_pos), len(self.uː_pos)))

        ob_rep, env_ob, _ = self.display()

        # Agent surroundings
        o['neigh'] = env_ob[self.agent_pos[0] - 1:
                            self.agent_pos[0] + 2, self.agent_pos[1] - 1:
                            self.agent_pos[1] + 2]

        return o

    def display(self):
        """
        Displays the action of the agent and the location of the words, goal and obstacles
        :return: string of the evironment, an array with agent observation (3X3) and array of
        environment to render using sns.
        """

        envir_rend = self.grid.copy()

        envir_rend[self.goal_pos] = 6

        for pos in self.ʊ_pos:
            envir_rend[pos] = 2

        for pos in self.ʌ_pos:
            envir_rend[pos] = 3

        for pos in self.uː_pos:
            envir_rend[pos] = 4

        for obs in self.obstacle_pos:
            envir_rend[obs] = 1

        env_ob = envir_rend.copy()

        envir_rend[self.agent_pos] = 5

        rend_grid = ""

        for r in range(self.size[0]):

            line = ''

            for c in range(self.size[1]):
                string_rend = self.dict_map_display[envir_rend[r, c]]

                line += '{0:2}'.format(string_rend)

            rend_grid += line + '\n'

        return rend_grid, env_ob, envir_rend

    def reset(self):
        """
        Randomly places phonemes, obstacles, goal and agent
        :return: observation of the environment
        """

        self.time_step = 0

        coord = list()

        for r in range(1, self.size[0] - 1):
            for c in range(1, self.size[1] - 1):
                coord.append((r, c))

        if self.seed:
            rng.shuffle(coord)
            
        else:
            np.random.shuffle(coord)

        self.ʊ_pos = list()
        self.uː_pos = list()
        self.ʌ_pos = list()
        self.obstacle_pos = list()

        phonemes = self.num_words // 3

        for phoneme in range(phonemes):
            self.ʊ_pos.append(coord.pop())
            self.uː_pos.append(coord.pop())
            self.ʌ_pos.append(coord.pop())

        for obs in range(self.obstacles):
            self.obstacle_pos.append(coord.pop())

        self.goal_pos = coord.pop()

        self.agent_pos = coord.pop()

        observation = self.observe()

        return observation
    

In [None]:
def rend_sns(env_array):
    """
    Convert a numpy array to a sns heat map
    :param env_array: an array representing the evironment/grid
    :return: a heat map with of array
    """
    
    fig,ax = plt.subplots(1, figsize=(6,4))

    # Colors for each of the unique items on the grid for the heatmap
    cmap = ['#ffffd9', '#202603', '#c2e699', '#7fcdbb', '#1d91c0', '#2ac01d', '#f1dc18',
            '#041f61']
    items = len(np.unique(env_array))
    sns.heatmap(env_array, linewidth=0.5, cmap=ListedColormap(cmap), ax=ax)
    colorbar = ax.collections[0].colorbar
    m = colorbar.vmax - colorbar.vmin
    colorbar.set_ticks(
        [colorbar.vmin + 0.5 * m/ items + m * i / items for i in range(items)])
    colorbar.set_ticklabels(['empty', 'obstacle', 'ʊ', 'ʌ', 'u :', 'agent', 'goal', 'wall'])
    plt.show()


<h2><div>Section 2 : Policies</div></h2>

In [None]:
def biased_policy(observation, actions):
    """
    This a goal oriented function, directs the agent towards the goal
    only grabs when it is superimposed with the with phonem ʊ
    :param actions: list of action to perform
    :param observation:  adictionary with observation of the environment
    :return: an action to be executed by the agent
    """
    
    coord = observation['dist_goal']
    agent = observation['agent_pos']
    obs = observation['neigh']
    short_u = observation['ʊ_coords']

    if agent in short_u:

        action = actions[4]

    elif 1 or 7 not in obs:

        action = np.random.choice(actions[0:-1])

    elif 7 not in obs[:, 1:] and 7 not in obs[1:, 1:]:
        action = np.random.choice([actions[1], actions[3]])

    elif 7 in obs[:, 2]:
        action = actions[2]

    elif 1 not in obs[:, 1:]:

        action = np.random.choice([actions[0], actions[1], actions[3]])

    elif coord[0] < 0 < coord[1]:

        action = np.random.choice([actions[0], actions[3]])

    # elif coord[0] > 0 and coord[1] > 0:
    elif coord[0] > 0 < coord[1]:

        action = np.random.choice([actions[1], actions[3]])

    elif coord[0] > 0 > coord[1]:

        action = np.random.choice([actions[1], actions[2]])

    elif coord[0] < 0 > coord[1]:
        action = np.random.choice([actions[1], actions[2]])

    elif coord[0] == 0 and coord[1] < 0:
        action = actions[2]

    elif coord[0] == 0 and coord[1] > 0:

        action = actions[3]

    elif coord[0] > 0 and coord[1] == 0:
        action = actions[1]

    elif coord[0] < 0 and coord[1] == 0:
        action = actions[0]

    return action

In [None]:
def random_policy(observation, actions):
    """
    Chose a random action from a list of actions
    :param observation:
    :param actions:
    :return: an action to be executed by the agent
    """  
    agent = observation['agent_pos']
    short_u = observation['ʊ_coords']

    if agent in short_u:
        action = actions[4]
    else:
        action = np.random.choice(actions)    
    
    return action
    

In [None]:
def combined_policy(observation, actions):
    """
    Explore (epsilon) or follows a biased policy
    :param observation:  a dictionary with observation of the environment
    :param actions: list of action to perform
    :return: an action to be executed by the agent
    """

    epsilon  = 0.01
    
    prob = np.random.random()
    
    if prob <= epsilon:
        
        action = np.random.choice(actions)
              
    else:   
        
        action = biased_policy(observation, actions)

    return action

<h2><div>Section 3. Running and Plotting Functions</div></h2>

<h3><div> 3.1 Running the experiment </div></h3>

In [None]:
def run_experiment(environment, episode_stats, policy, pol_name='biased', number_of_episodes=100,
                   display=True, rend_str=True, prints=True):
    """
    Run the experiment
    :param episode_stats:
    :param environment: a method, the enviroment created
    :param episode_stats:  anmed tuple to store the stats
    :param policy: a funtion policy to take an action
    :param pol_name: a string with name of the policy
    :param number_of_episodes: number time the experiment runs
    :param display: boolean show or not the environment on a heat map
    :param rend_str: boolean show the environment in string characters
    :param prints: boolean provides information about the game
    :return: a namedtuple with statistics of the episode
    """

    assert pol_name in ['biased', 'random',
                        'combined'], 'Name of the policies: biased, random and combined'

    actions = ['up', 'down', 'left', 'right', 'grab']
    episode_reward = list()
    episode_length = list()
    episode_mean = list()
    reward_std = list()

    for _ in tqdm(range(number_of_episodes)):

        reward_list = list()

        # initialize state
        observation = environment.reset()

        # indicate terminal state
        done = False
        # log the accumulated regard
        rewards = 0

        # track time steps
        time_step = 0

        # repeat for each step of episode, until state is terminal
        while not done:

            # increase step counter - for display
            time_step += 1

            # choose action from state
            action = policy(observation, actions)

            # perform an action, observe, reward and done
            next_observation, reward, done, steps = environment.env_step(action, prints=prints)

            # observation <- next_observation
            observation = next_observation

            # accumulate reward
            rewards += reward

            reward_list.append(reward)

            str_env, _, env_rend = environment.display()

            if rend_str:
                print(str_env)

            if display:
                print("Display on")
                rend_sns(env_rend)

        reward_std.append(np.std(reward_list))
        episode_reward.append(rewards)
        episode_mean.append(np.mean(reward_list))
        episode_length.append(time_step)

    best_episode = np.argmax(episode_reward, axis=0)
    print('Max reward: {} | Episode: {} | Steps: {} | Policy: {} '.format(
        np.max(episode_reward), best_episode + 1, episode_length[best_episode], pol_name))

    stats = episode_stats(length_episodes=np.array(episode_length),
                          reward_episodes=np.array(episode_reward),
                          episode_mean_reward=episode_mean, episode_std=np.array(reward_std))

    return stats


<h3><div>3.2 Visualising Statistics</div></h3>

In [None]:
def plot_episodes_stats(stats, pol_name, episodes=None, smoothing_window=10, hideplot=False,
                       env_dim=None):
    """
    :param stats: a namedtuple containing the stats
    :param pol_name: policy to train the agent
    :param episodes: number of episodes run by the agent
    :param smoothing_window: intiger, number of observations per eavh window
    :param hideplot: boolean to display the plots
    :param env_dim: string with the environment dimensions
    :return: plots
    Note: This code was adapted from Microsoft, Introduction to Reinforcement Learning.
    """

    assert pol_name in ['biased', 'random',
                        'combined'], 'Policies allowed: biased, random and combined'

    figs_dir = os.path.join(os.getcwd(), 'plots', pol_name)
    os.makedirs(figs_dir, exist_ok=True)
    size = (7, 4)

    # Plot the episode length over time
    fig1 = plt.figure(figsize=size)
    x = np.arange(1, episodes + 1)
    plt.plot(x, stats.length_episodes, color='#0000B3')
    plt.xlabel("Episode")
    plt.ylabel("Episode Length")
    plt.title("Episode Length")
    plt.savefig(os.path.join(figs_dir, 'episodes_{}_{}_{}.png'.format(episodes, pol_name, env_dim)))
    if hideplot:
        plt.close()
    else:
        plt.rcParams.update({'font.size': 10})
        plt.show()

    # Plot the episode reward over time
    fig2 = plt.figure(figsize=size)
    rewards_smoothed = pd.Series(stats.reward_episodes).rolling(smoothing_window,
                                                                min_periods=smoothing_window).mean()
    plt.plot(x, rewards_smoothed, color='#0000B3')
    plt.xlabel("Episode")
    plt.ylabel("Episode Reward (Smoothed)")
    plt.title("Episode Reward over Time (Smoothed over window size {})".format(smoothing_window))
    plt.savefig(os.path.join(figs_dir, 'reward_{}_{}_{}.png'.format(episodes, pol_name, env_dim)))
    if hideplot:
        plt.close(fig2)
    else:
        plt.rcParams.update({'font.size': 10})
    plt.show(fig2)

    # Plot the episode mean reward per episode
    fig3 = plt.figure(figsize=size)
    mean_smoothed = pd.Series(stats.episode_mean_reward). \
        rolling(smoothing_window, min_periods=smoothing_window).mean()
    plt.plot(x, mean_smoothed, color='#0000B3')
    plt.fill_between(x, mean_smoothed - stats.episode_std / 2,
                     mean_smoothed + stats.episode_std / 2,
                     color='#0000B3', alpha=0.2)
    plt.xlabel("Episode")
    plt.ylabel("Average Reward")
    plt.title("Average Reward per Episode and std (Smoothed over window size {})".format(
        smoothing_window))
    plt.savefig(
        os.path.join(figs_dir, 'average_{}_policy_{}_{}.png'.format(episodes, pol_name, env_dim)))
    if hideplot:
        plt.close(fig3)
    else:
        plt.rcParams.update({'font.size': 10})
    plt.show(fig3)

    return fig1, fig2, fig3


<h2><div>Section 4. Runing the experiment</div></h2>

<h3><div> 4.1 Visualise the initial state of the environment </div></h3>

In [None]:
# Create the tupe to store call the actions
Action = namedtuple('Action', 'name index delta_x delta_y')

In [None]:
# Create tuple to store the statistics
episode_stats = namedtuple("Stats",["length_episodes", "reward_episodes", "episode_mean_reward", "episode_std"])

In [None]:
# Seed the environment and select the task
env_info = {'seed': True, 'sound': 'short_u'}

# Creates the environment
size = (7,7)
my_env= Phonemes(size, Action, **env_info)

obs = my_env.reset()
str_dis, _ , envir_rend= my_env.display()

# Information about the environment
print('Available area: {} | Total objects on the grid: {}'.format(my_env.area, my_env.total_objects))
print('Total number of agents on the grid: {}, {} - learning agent and {} moving obstacle(s)'.
      format(my_env.obstacles + 1, 1, my_env.obstacles))
print('Mission: learn {} sound'.format(my_env.task))
print('Words on the grid: ', my_env.short_u, my_env.open_middle_a, my_env.long_u, sep='\n')
print('Word(s) to find: ', my_env.short_u)

# Renders the environment on a heatmap
print(str_dis)
rend_sns(envir_rend)

In [None]:
num_episodes = 100
pol_name = 'combined'
stats_log = run_experiment(my_env, episode_stats, combined_policy, pol_name, number_of_episodes=num_episodes, 
                           display=False, prints=True)

In [None]:
plots = plot_episodes_stats(stats_log, pol_name, episodes=num_episodes, smoothing_window=1, hideplot=False, env_dim= '5x5')

<h2><div>5. Comparing policies  with different size of evironments</div></h2>

In [None]:
episodes_per_env = 50
envs_per_size = 5
start, end, step = 5, 30, 5


pol_names = ['biased', 'random', 'combined']

pol_dict = defaultdict(list)

for i, policy in tqdm(enumerate([biased_policy, random_policy, combined_policy])):
    
    pol_name = pol_names[i]
    
    aver_reward = list ()
    std_reward = list ()
    
    for size_env in range(start, end+1, step):

        cum_reward = list ()

        for _ in range(envs_per_size):

            my_env = Phonemes((size_env, size_env), Action, **env_info)
        
            exp_stats = run_experiment(my_env, episode_stats, policy, pol_name, episodes_per_env, 
                                       display=False, rend_str=False, prints=False)
            
            cum_reward.append(exp_stats.reward_episodes)

        # Environments stats    
        aver_reward.append(np.mean(cum_reward))
        std_reward.append(np.std(cum_reward))

    pol_dict[pol_name].append([np.asarray(aver_reward), np.asarray(std_reward)])
  

In [None]:
plot_params = {'biased': {'label': 'biased_policy', 'color':'r'}, 
              'random': {'label': 'random_policy', 'color':'b'},
              'combined': {'label': 'combined_policy', 'color':'g'}}

fig_dir = os.path.join(os.getcwd(), 'plots')
os.makedirs(fig_dir, exist_ok=True)

fig = plt.figure(figsize=(7,4))

for k, v in pol_dict.items():
    for m, s in v:
        plt.plot(range(start, end+1, step), m, 'o' + plot_params[k]['color'])
        plt.plot(range(start, end+1, step), m, color = plot_params[k]['color'], 
              label = plot_params[k]['label'] )
        plt.fill_between(range(start,end+1, step), m - s/2, m + s/2, color=plot_params[k]['color'], alpha=0.2)
plt.xlabel('Environment size')
plt.ylabel('Average reward')
plt.legend()
plt.savefig(os.path.join(fig_dir, 'env_comparison_{}_ep.png'.format(episodes_per_env)))
plt.show()