# ENVIRONMENT TUTORIAL

**ABOUT THE ENVIRONMENT**

**Software Requirements:** Numpy and OpenCV2

Fundamentals of the environment:

1. This environment is called Survival Gridworld
2. Survival gridworld is a two-dimensional grid environment where obstacles and rewards are distributed between the start-point and the end-point.
3. The task of the agent is to navigate a path from the start-point to end-point while keeping a positive Energy or Exploration Capacity.
4. Starting Energy value should be defined.
5. The agent is allowed to move in one of the four adjacent directions.
6. Each movement in the gridworld costs a fixed Energy defined by the state trandition penalty δ (delta_s).
7. Path Rewards: Positive values in the grid matrix increase the energy while negative values decreases it.

The state is composed of four (4) channels:

1. Current State - n x n (n should be odd) cut-out of the surroundings relative to the agent’s current position.
2. Previous State - n x n (n should be odd) cut-out of the surroundings relative to the agent’s previous position.
3. Location state - shows the position of the start-point and the goal with the agent’s current position as      well as the agent’s four previous positions.
4. E State - shows the representation of the current Energy or the Exploration Capacity as a 2-D energy bar.

Allowed actions:

1. 0 - Up 
2. 1 - Down 
3. 2 - Left 
4. 3 - Right

Reward function

Let: ρ => Path rewards or Consumable rewards

1. r = +1: If Success
2. r = -1: If Failure
3. r = -0.3: If agent bumps gridworld boundary
4. r = ρ / (ρmax + σ); σ = 5: If ρ > 0
5. r = ρ / (ρmin + σ); σ = 5: If ρ < 0
6. (The constant σ is added to the denominator so that: +1 > r(ρmax), r(ρmin) > -1)

Score function:

1. If Success: score = max(1, (current_energy - initial_energy + 1))
2. If Failure: score = 0

Mechnics for Infinite Resource conversion:

1. The Energy or Exploration Capacity only determines the score using the Score function but not the termination.
2. If the Energy < 0 then Energy = 0.1.
3. The agent does not terminate until Success condition or until it consumes any path_reward <= p_terminate (set by the user).
4. To prevent being stuck indefinitely in a particular episode during training, the agent also terminates at max_steps set by the user.

In [None]:
import environment as _env
from numpy import random

In [None]:
# Using DEFAULT environment
env = _env.Environment(is_default=True)

In [None]:
# Using RANDOM environment
env = _env.Environment(is_default=False, grid_size=[14, 14])

In [None]:
# Using CUSTOM environment

# Initialize default
env = _env.Environment(is_default=True)

# Sample custom gridworld matrix (12 x 12)
gmatrix = [[   0,   5,   0,  -8, -10, -12,  0,   0,   0,  -12,   5,   0],
           [   0,   0,   3, -13,   9,   0,   8,   0,  10,   0,   1,   1],
           [   7,   0,   0,   0,   6,  -5,   2,   2,   4,   0,   3,   0],
           [   0,   0,   0,   4, -17,   0, -17,  -7,   0,   0,   0,   0],
           [ -10,   5, -18,   0, -12,   0,   0,  10, -20,   5,   0, -20],
           [   0,  10,  -1,  -3,   0,   5,   0,   4,   0,   0,   8,   0],
           [  -9, -20,  -8,   0,   5, -12,   3,   0,   0, -10, -20,  10],
           [   0, -14,   9,   0,  -9, -20,   0,   6,   0, -20,   0,  -1],
           [ -14,   0,  -4,   1,  -4,   2,   5,  -4,  10, -18, -20, -13],
           [  -9,   0,   0,   2,   8, -12,   0, -14,   0, -20,   0,  -7],
           [   0,  -7, -20,   0,   0,  -6,   0, -17,  -2,   0,  -8,   0],
           [  -9,   0,  10,   0,   4,   0,   3,   0,   4,   0,   0,  10]]

# Define custom environment
env.custom_environment(gridmatrix=gmatrix, delta_s=1.4, start=[9, 11], end=[2, 0])

In [None]:
# Using INFINITE RESOURCE ENVIRONMENT for other experiment purposes

# Initialize default
env = _env.Environment(is_default=True)

# Set as infinite resource gridworld and define termination parameter (terminate at <= p_terminate)
# max_steps is for defining how many steps will the agent terminate if it cannot reach any terminal states
# max_steps = None is setting to the default at 1.25 x grid_height x grid_width
env.set_inf_resource(set_env=True, p_terminate=-10, max_steps=None)

In [None]:
# Set the n x n cut-size for the Current State and the Previous State
# Default nxn cut-out size = [5, 5]
env.fstate_size = [7, 7]

In [None]:
# Set the initial energy or exploration capacity
initial_e = 20

In [None]:
# Agent interaction loop
def run(initial_e, path_render=False):
    
    # Reset all environment parameters
    state = env.reset(initial_e)
    
    step = 1
    total_rewards = 0.0
    
    while(True):
         
        # Render the environment
        if(True): env.render(delay=500)
    
        # Generate random action
        action = random.randint(low=0, high=4)
        """IF A POLICY MODEL IS USED: action = policy_model(state)"""
        
        next_state, path_reward, reward, end_episode = env.step(action)
        total_rewards += reward
        # path reward is for adding or subtracting to the energy
        # reward is for the reinforcement learning algorithm
        
        print("@step", step, 
              ": path reward = ", path_reward, 
              " reward = ", round(reward, 2))
            
        if(end_episode):
            print("episode score = ", env.get_score(), 
                  " total rewards = ", round(total_rewards, 2))
            
            # Render the environment at end episode
            if(True): env.render(delay=500)
                
            # Render the entire path of the agent
            if(path_render): env.path_render(delay=7000)
            
            env.close_render()
            env.close_path_render()
            break
        else:
            state = next_state # Update current state
            step += 1

In [None]:
run(initial_e, path_render=True)