# Unity ML Agent

This notebook demonstrates the use of the Unity ML-Agents environment.

### Set up an Environment

This notebook was tested with `Unity Hub Version 2.4.3`, `Unity Version 2020.3.4f1`, `Unity ML Agent Version 1.8.0-preview` and python `mlagents-envs` version `0.25.1` (`Release 16`).

Follow the [Unity ML-Agents Documentation](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Readme.md) to set up an Unity ML-Agent environment from scratch.

### Get Images from Unity Environment for Rendering in Notebook

To get the images of the unity scenes, add [these 4 scripts](https://i.imgur.com/nyKG4Yk.png) to the `Main Camera`.

In [None]:
import random
import pprint
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import animation
%matplotlib inline

from IPython.display import display as Display
from IPython.display import HTML
from pyvirtualdisplay import Display as display
display = display(visible=0, size=(1400, 900))
display.start()

from mlagents_envs.base_env import ActionTuple
from mlagents_envs.environment import UnityEnvironment

### Method, Parameters and Style-sheets for Rendering the Scene inside the Notebook

In [None]:
%%html
<style>
.output_wrapper button.btn.btn-default,
.output_wrapper .ui-dialog-titlebar,
.output_wrapper .mpl-message {
    display: none;
}
.output_wrapper .ui-dialog-titlebar + div {
    border: none !important;
    overflow: under !important
}
</style>

In [None]:
def animate_frames(frames):
    'function to animate a list of frames'
    def display_animation(anim):
        plt.close(anim._fig)
        return HTML(anim.to_jshtml())
    plt.axis('off')
    cmap = None if len(frames[0].shape) == 3 else 'Greys'
    patch = plt.imshow(frames[0], cmap=cmap, aspect='auto')
    plt.gcf().set_size_inches(display_width * IMAGE_SCALE / DPI, display_height * IMAGE_SCALE / DPI)
    fanim = animation.FuncAnimation(plt.gcf(), lambda x: patch.set_data(frames[x]), frames = len(frames), interval=ANIMATION_INTERVAL)
    Display(display_animation(fanim))

In [None]:
DPI                                          = 96     # https://www.infobyip.com/detectmonitordpi.php
IMAGE_SCALE                                  = 1.4    # large scale == more buffer space == slow to render
matplotlib.rcParams['animation.embed_limit'] = 2**128 # animation buffer size
ANIMATION_INTERVAL                           = 60     # delay between frames in ms

---

Next, start the environment!

**Before running the code cell below**, change the `ENVIRONMENT` parameter to match the location of the Unity environment: `path/to/<UNITY-ML-APP>`

Also, the `CAMERA_AGENT_NAME_PREFIX` parameter should match the `Behavior Name` corresponding to the `Main Camera`.

In [None]:
ENVIRONMENT = './RollerBall/Linux_Headless/RollerBall.x86_64'

CAMERA_AGENT_NAME_PREFIX = 'Camera'

### Load the Environment

In [None]:
env = UnityEnvironment(file_name=ENVIRONMENT, seed=1, side_channels=[])

### Examine the State and Action Spaces

The simulation contains a single agent navigating through an environment. At each time step, it has five actions to choose from:
- `0` - no action
- `1` - move up 
- `2` - move down
- `3` - move left
- `4` - move right

The state space has `8` dimensions and contains the agent's position, target position, agent's velocity along x and y direction.

A reward of `+1.0` is provided for hitting the target, a penalty of `-2.0` is given for falling off the floor, and for each time step spent navigating, a penalty of `-0.01` is given.

There is also a `Behavior Name` that looks like `Camera?*` and it corresponding to the images of the scene.

In [None]:
env.reset()

agent_info = {}
camera_agent, display_width, display_height = None, None, None
for behavior_name in env.behavior_specs.keys():
    print('\nBehavior Name: {}'.format(behavior_name))
    print('State Space: {}'.format(env.behavior_specs[behavior_name].observation_specs))
    print('Action Space: {}'.format(env.behavior_specs[behavior_name].action_spec))
    if not behavior_name.startswith(CAMERA_AGENT_NAME_PREFIX):
        agent_info[behavior_name] = {
            'state_size': env.behavior_specs[behavior_name].observation_specs[0].shape[0],
            'continuous_action_size': env.behavior_specs[behavior_name].action_spec.continuous_size,
            'discrete_action_n_branches': env.behavior_specs[behavior_name].action_spec.discrete_size,
            'discrete_action_branches': env.behavior_specs[behavior_name].action_spec.discrete_branches
        }
    else:
        camera_agent = behavior_name
        display_height = env.behavior_specs[behavior_name].observation_specs[0].shape[0]
        display_width = env.behavior_specs[behavior_name].observation_specs[0].shape[1]

### IDs and State Action Spaces the Agent

In [None]:
pprint.pprint(agent_info)


num_agents = len(agent_info)
state_size = list(agent_info.values())[0]['state_size']
action_size = list(agent_info.values())[0]['discrete_action_branches'][0]

print('\nNumber of Agents: {}'.format(num_agents))
print('State Size: {}'.format(state_size))
print('Action Size: {}'.format(action_size))

### `Teams` and `Agents`

There can be multiple agent-groups (`teams`) in the environment and each team can have multiple `agents` sharing the same behaviour (state, action, and reward space). However, teams usually follow different behaviour.

To see all the available `teams` and the `agents` (IDs) in each team, execute the following cell.

In [None]:
env.reset()
for agent_group in agent_info:
    decision_steps, _ = env.get_steps(agent_group)
    print('Team Name: {}\tAgent IDs: {}'.format(agent_group, decision_steps.agent_id))

### Take random actions in the environment

In the next code cell, we will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, we will watch the agent's performance, if it selects an action (uniformly) at random with each time step. A window should pop up that allows us to observe the agent, as it moves through the environment.

In [None]:
def random_action(agent_params):
    _continuous = np.random.uniform(low=-1.0, high=1.0, size=(1, agent_params['continuous_action_size']))
    _discrete = np.zeros((1, agent_params['discrete_action_n_branches']), dtype=np.int32)
    for branch_idx in range(agent_params['discrete_action_n_branches']):
        _discrete = np.column_stack([
            np.random.randint(0, agent_params['discrete_action_branches'][branch_idx], size=(1), dtype=np.int32)
            for branch_idx in range(agent_params['discrete_action_n_branches'])
        ])
    return ActionTuple(continuous=_continuous, discrete=_discrete)

In [None]:
RENDER_REAL_TIME = False

In [None]:
if RENDER_REAL_TIME:
    %matplotlib notebook
    fig = plt.figure()
    plt.gcf().set_size_inches(display_width * IMAGE_SCALE / DPI, display_height * IMAGE_SCALE / DPI)
    plt.axis('off')
    img = plt.imshow(np.zeros((display_height, display_width, 3)))
else:
    %matplotlib inline
    frames = []

for episode_i in range(2):
    env.reset()                                                           # reset the environment
    score, terminated, experience = {}, {}, {}
    while True:
        for behavior_name, agent_params in agent_info.items():            # for every agent-group in the environment
            decision_steps, terminal_steps = env.get_steps(behavior_name) # get agents' status in the agent-group
            if behavior_name not in score:
                score[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    score[behavior_name][i] = 0
            if behavior_name not in terminated:
                terminated[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    terminated[behavior_name][i] = False
            if behavior_name not in experience:
                experience[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    experience[behavior_name][i] = {
                        'state': None,
                        'action': None,
                        'reward': None,
                        'next_state': None
                    }
            for agent_id in decision_steps.agent_id:                          # for every agent in the agent-group                
                state = decision_steps[agent_id].obs                          # get the initial state for the agent

                # select an action based on the policy
                # policy.choose(behavior_name, agent_id, state)
                action = random_action(agent_params)                          # select an action

                experience[behavior_name][agent_id]['state'] = state
                experience[behavior_name][agent_id]['action'] = action
                env.set_action_for_agent(behavior_name, agent_id, action)     # send the action to the agent
                # env.set_actions(behavior_name, action)                      # send the action to the agent-group
        if camera_agent:                                                      # render the screen
            decision_steps, _ = env.get_steps(camera_agent)
            if (len(decision_steps.agent_id)):
                camera_agent_id = decision_steps.agent_id[0]
                image = decision_steps[camera_agent_id].obs
                if RENDER_REAL_TIME:
                    img.set_data(image[0])
                    fig.canvas.draw()
                else:
                    frames.append(image[0])
                    

        env.step()                                                            # one step through the environment

        for behavior_name in agent_info:                                      # for every agent-group in the environment
            decision_steps, terminal_steps = env.get_steps(behavior_name)     # get agents' status in the agent-group    
            for agent_id in decision_steps.agent_id:
                reward = decision_steps[agent_id].reward                      # get the reward
                next_state = decision_steps[agent_id].obs                     # get the next state
                score[behavior_name][agent_id] += reward
                experience[behavior_name][agent_id]['reward'] = reward
                experience[behavior_name][agent_id]['next_state'] = next_state
            for agent_id in terminal_steps.agent_id:                          # see if episode finished for an agent
                reward = terminal_steps[agent_id].reward
                score[behavior_name][agent_id] += reward
                experience[behavior_name][agent_id]['reward'] = reward
                experience[behavior_name][agent_id]['next_state'] = None
                terminated[behavior_name][agent_id] = True
                if terminal_steps[agent_id].interrupted:
                    print('agent #{} in the agent-group "{}" has reached the maximum number of steps in the episode'.format(
                        agent_id, behavior_name))
                    
        if camera_agent:                                                      # render the screen
            decision_steps, _ = env.get_steps(camera_agent)
            if (len(decision_steps.agent_id)):
                camera_agent_id = decision_steps.agent_id[0]
                image = decision_steps[camera_agent_id].obs
                if RENDER_REAL_TIME:
                    img.set_data(image[0])
                    fig.canvas.draw()
                else:
                    frames.append(image[0])

        # train the RL agent with the experience tuple
        # agent.step(experience)

        # if EVERY agent in EVERY agent-group is done
        if np.asarray([np.asarray(list(agent_group.values())).all() for agent_group in terminated.values()]).all():
            break

    print(score)

if not RENDER_REAL_TIME: animate_frames(frames)

In [None]:
env.close()