Unity-Technologies · vincentpierre · Dec 9, 2019 · Dec 4, 2019 · Dec 4, 2019 · Dec 4, 2019
diff --git a/docs/Migrating.md b/docs/Migrating.md
@@ -3,6 +3,7 @@
 ## Migrating from master to develop
 
 ### Important changes
+* The low level Python API has changed. You can look at the document [Low Level Python API documentation](Python-API.md) for more information. This should only affect you if you're writing a custom trainer; if you use `mlagents-learn` for training, this should be a transparent change.
 * `CustomResetParameters` are now removed.
 * `reset()` on the Low-Level Python API no longer takes a `train_mode` argument. To modify the performance/speed of the engine, you must use an `EngineConfigurationChannel`
 * `reset()` on the Low-Level Python API no longer takes a `config` argument. `UnityEnvironment` no longer has a `reset_parameters` field. To modify float properties in the environment, you must use a `FloatPropertiesChannel`. For more information, refer to the [Low Level Python API documentation](Python-API.md)

diff --git a/docs/Python-API.md b/docs/Python-API.md
@@ -1,19 +1,17 @@
-# Unity ML-Agents Python Interface and Trainers
-
-The `mlagents` Python package is part of the [ML-Agents
-Toolkit](https://github.com/Unity-Technologies/ml-agents). `mlagents` provides a
-Python API that allows direct interaction with the Unity game engine as well as
-a collection of trainers and algorithms to train agents in Unity environments.
+# Unity ML-Agents Python Low Level API
 
 The `mlagents` Python package contains two components: a low level API which
 allows you to interact directly with a Unity Environment (`mlagents.envs`) and
 an entry point to train (`mlagents-learn`) which allows you to train agents in
 Unity Environments using our implementations of reinforcement learning or
 imitation learning.
 
+You can use the Python Low Level API to interact directly with your learning
+environment, and use it to develop new learning algorithms.
+
 ## mlagents.envs
 
-The ML-Agents Toolkit provides a Python API for controlling the Agent simulation
+The ML-Agents Toolkit Low Level API is a Python API for controlling the simulation
 loop of an environment or game built with Unity. This API is used by the
 training algorithms inside the ML-Agent Toolkit, but you can also write your own
 Python programs using this API. Go [here](../notebooks/getting-started.ipynb)
@@ -24,25 +22,31 @@ The key objects in the Python API include:
 - **UnityEnvironment** — the main interface between the Unity application and
   your code. Use UnityEnvironment to start and control a simulation or training
   session.
-- **BrainInfo** — contains all the data from Agents in the simulation, such as
-  observations and rewards.
-- **BrainParameters** — describes the data elements in a BrainInfo object. For
-  example, provides the array length of an observation in BrainInfo.
+- **BatchedStepResult** — contains the data from Agents belonging to the same
+  "AgentGroup" in the simulation, such as observations and rewards.
+- **AgentGroupSpec** — describes the shape of the data inside a BatchedStepResult.
+  For example, provides the dimensions of the observations of a group.
 
-These classes are all defined in the `ml-agents/mlagents/envs` folder of
-the ML-Agents SDK.
+These classes are all defined in the [base_env](../ml-agents-envs/mlagents/envs/base_env.py)
+script.
+
+An Agent Group is a group of Agents identified by a string name that share the same
+observations and action types. You can think about Agent Group as a group of agents
+that will share the same policy or behavior. All Agents in a group have the same goal
+and reward signals.
 
 To communicate with an Agent in a Unity environment from a Python program, the
-Agent must use a LearningBrain.
-Your code is expected to return
-actions for Agents with LearningBrains.
+Agent in the simulation must have `Behavior Parameters` set to communicate. You
+must set the `Behavior Type` to `Default` and give it a `Behavior Name`.
+
+__Note__: The `Behavior Name` corresponds to the Agent Group name on Python.
 
 _Notice: Currently communication between Unity and Python takes place over an
 open socket without authentication. As such, please make sure that the network
 where training takes place is secure. This will be addressed in a future
 release._
 
-### Loading a Unity Environment
+## Loading a Unity Environment
 
 Python-side communication happens through `UnityEnvironment` which is located in
 `ml-agents/mlagents/envs`. To load a Unity environment from a built binary
@@ -51,7 +55,7 @@ of your Unity environment is 3DBall.app, in python, run:
 
 ```python
 from mlagents.envs.environment import UnityEnvironment
-env = UnityEnvironment(file_name="3DBall", worker_id=0, seed=1)
+env = UnityEnvironment(file_name="3DBall", base_port=5005, seed=1, side_channels=[])
 ```
 
 - `file_name` is the name of the environment binary (located in the root
@@ -62,6 +66,9 @@ env = UnityEnvironment(file_name="3DBall", worker_id=0, seed=1)
   training process. In environments which do not involve physics calculations,
   setting the seed enables reproducible experimentation by ensuring that the
   environment and trainers utilize the same random seed.
+- `side_channels` provides a way to exchange data with the Unity simulation that
+  is not related to the reinforcement learning loop. For example: configurations
+  or properties. More on them in the [Modifying the environment from Python](Python-API.md#modifying-the-environment-from-python) section.
 
 If you want to directly interact with the Editor, you need to use
 `file_name=None`, then press the :arrow_forward: button in the Editor when the
@@ -70,59 +77,125 @@ displayed on the screen
 
 ### Interacting with a Unity Environment
 
-A BrainInfo object contains the following fields:
-
-- **`visual_observations`** : A list of 4 dimensional numpy arrays. Matrix n of
-  the list corresponds to the n<sup>th</sup> observation of the Brain.
-- **`vector_observations`** : A two dimensional numpy array of dimension `(batch
-  size, vector observation size)`.
-- **`rewards`** : A list as long as the number of Agents using the Brain
-  containing the rewards they each obtained at the previous step.
-- **`local_done`** : A list as long as the number of Agents using the Brain
-  containing  `done` flags (whether or not the Agent is done).
-- **`max_reached`** : A list as long as the number of Agents using the Brain
-  containing true if the Agents reached their max steps.
-- **`agents`** : A list of the unique ids of the Agents using the Brain.
-
-Once loaded, you can use your UnityEnvironment object, which referenced by a
-variable named `env` in this example, can be used in the following way:
-
-- **Print : `print(str(env))`**
-  Prints all parameters relevant to the loaded environment and the
-  Brains.
-- **Reset : `env.reset()`**
-  Send a reset signal to the environment, and provides a dictionary mapping
-  Brain names to BrainInfo objects.
-- **Step : `env.step(action)`**
-  Sends a step signal to the environment using the actions. For each Brain :
-  - `action` can be one dimensional arrays or two dimensional arrays if you have
-    multiple Agents per Brain.
-
-    Returns a dictionary mapping Brain names to BrainInfo objects.
-
-    For example, to access the BrainInfo belonging to a Brain called
-    'brain_name', and the BrainInfo field 'vector_observations':
-
-    ```python
-    info = env.step()
-    brainInfo = info['brain_name']
-    observations = brainInfo.vector_observations
-    ```
-
-    Note that if you have more than one LearningBrain in the scene, you
-    must provide dictionaries from Brain names to arrays for `action`, `memory`
-    and `value`. For example: If you have two Learning Brains named `brain1` and
-    `brain2` each with one Agent taking two continuous actions, then you can
-    have:
-
-    ```python
-    action = {'brain1':[1.0, 2.0], 'brain2':[3.0,4.0]}
-    ```
-
-    Returns a dictionary mapping Brain names to BrainInfo objects.
-- **Close : `env.close()`**
-  Sends a shutdown signal to the environment and closes the communication
-  socket.
+#### The BaseEnv interface
+
+A `BaseEnv` has the following methods:
+
+ - **Reset : `env.reset()`** Sends a signal to reset the environment. Returns None.
+ - **Step : `env.step()`** Sends a signal to step the environment. Returns None.
+   Note that a "step" for Python does not correspond to either Unity `Update` nor
+   `FixedUpdate`. When `step()` or `reset()` is called, the Unity simulation will
+   move forward until an Agent in the simulation needs a input from Python to act.
+ - **Close : `env.close()`** Sends a shutdown signal to the environment and terminates
+   the communication.
+ - **Get Agent Group Names : `env.get_agent_groups()`** Returns a list of agent group ids.
+   Note that the number of groups can change over time in the simulation if new
+   agent groups are created in the simulation.
+ - **Get Agent Group Spec : `env.get_agent_group_spec(agent_group: str)`** Returns
+   the `AgentGroupSpec` corresponding to the agent_group given as input. An
+   `AgentGroupSpec` contains information such as the observation shapes, the action
+   type (multi-discrete or continuous) and the action shape. Note that the `AgentGroupSpec`
+   for a specific group is fixed throughout the simulation.
+ - **Get Batched Step Result for Agent Group : `env.get_step_result(agent_group: str)`**
+   Returns a `BatchedStepResult` corresponding to the agent_group given as input.
+   A `BatchedStepResult` contains information about the state of the agents in a group
+   such as the observations, the rewards, the done flags and the agent identifiers. The
+   data is in `np.array` of which the first dimension is always the number of agents which
+   requested a decision in the simulation since the last call to `env.step()` note that the
+   number of agents is not guaranteed to remain constant during the simulation.
+ - **Set Actions for Agent Group :`env.set_actions(agent_group: str, action: np.array)`**
+   Sets the actions for a whole agent group. `action` is a 2D `np.array` of `dtype=np.int32`
+   in the discrete action case and `dtype=np.float32` in the continuous action case.
+   The first dimension of `action` is the number of agents that requested a decision
+   since the last call to `env.step()`. The second dimension is the number of discrete actions
+   in multi-discrete action type and the number of actions in continuous action type.
+ - **Set Action for Agent : `env.set_action_for_agent(agent_group: str, agent_id: int, action: np.array)`**
+   Sets the action for a specific Agent in an agent group. `agent_group` is the name of the
+   group the Agent belongs to and `agent_id` is the integer identifier of the Agent. Action
+   is a 1D array of type `dtype=np.int32` and size equal to the number of discrete actions
+   in multi-discrete action type and of type `dtype=np.float32` and size equal to the number
+   of actions in continuous action type.
+
+
+__Note:__ If no action is provided for an agent group between two calls to `env.step()` then
+the default action will be all zeros (in either discrete or continuous action space)
+#### BathedStepResult and StepResult
+
+A `BatchedStepResult` has the following fields :
+
+ - `obs` is a list of numpy arrays observations collected by the group of
+ agent. The first dimension of the array corresponds to the batch size of
+ the group (number of agents requesting a decision since the last call to
+ `env.step()`).
+ - `reward` is a float vector of length batch size. Corresponds to the
+ rewards collected by each agent since the last simulation step.
+ - `done` is an array of booleans of length batch size. Is true if the
+ associated Agent was terminated during the last simulation step.
+ - `max_step` is an array of booleans of length batch size. Is true if the
+ associated Agent reached its maximum number of steps during the last
+ simulation step.
+ - `agent_id` is an int vector of length batch size containing unique
+ identifier for the corresponding Agent. This is used to track Agents
+ across simulation steps.
+ - `action_mask` is an optional list of two dimensional array of booleans.
+ Only available in multi-discrete action space type.
+ Each array corresponds to an action branch. The first dimension of each
+ array is the batch size and the second contains a mask for each action of
+ the branch. If true, the action is not available for the agent during
+ this simulation step.
+
+It also has the two following methods:
+
+ - `n_agents()` Returns the number of agents requesting a decision since
+ the last call to `env.step()`
+ - `get_agent_step_result(agent_id: int)` Returns a `StepResult`
+ for the Agent with the `agent_id` unique identifier.
+
+A `StepResult` has the following fields:
+
+ - `obs` is a list of numpy arrays observations collected by the group of
+ agent. (Each array has one less dimension than the arrays in `BatchedStepResult`)
+ - `reward` is a float. Corresponds to the rewards collected by the agent
+ since the last simulation step.
+ - `done` is a bool. Is true if the Agent was terminated during the last
+ simulation step.
+ - `max_step` is a bool. Is true if the Agent reached its maximum number of
+ steps during the last simulation step.
+ - `agent_id` is an int and an unique identifier for the corresponding Agent.
+ - `action_mask` is an optional list of one dimensional array of booleans.
+ Only available in multi-discrete action space type.
+ Each array corresponds to an action branch. Each array contains a mask
+ for each action of the branch. If true, the action is not available for
+ the agent during this simulation step.
+
+#### AgentGroupSpec
+
+An Agent group can either have discrete or continuous actions. To check which type
+it is, use `spec.is_action_discrete()` or `spec.is_action_continuous()` to see
+which one it is. If discrete, the action tensors are expected to be `np.int32`. If
+continuous, the actions are expected to be `np.float32`.
+
+An `AgentGroupSpec` has the following fields :
+
+ - `observation_shapes` is a List of Tuples of int : Each Tuple corresponds
+ to an observation's dimensions (without the number of agents dimension).
+ The shape tuples have the same ordering as the ordering of the
+ BatchedStepResult and StepResult.
+ - `action_type` is the type of data of the action. it can be discrete or
+ continuous. If discrete, the action tensors are expected to be `np.int32`. If
+ continuous, the actions are expected to be `np.float32`.
+ - `action_size` is an `int` corresponding to the expected dimension of the action
+ array.
+   - In continuous action space it is the number of floats that constitute the action.
+   - In discrete action space (same as multi-discrete) it corresponds to the
+   number of branches (the number of independent actions)
+ - `discrete_action_branches` is a Tuple of int only for discrete action space. Each int
+ corresponds to the number of different options for each branch of the action.
+ For example : In a game direction input (no movement, left, right) and jump input
+ (no jump, jump) there will be two branches (direction and jump), the first one with 3
+ options and the second with 2 options. (`action_size = 2` and
+ `discrete_action_branches = (3,2,)`)
+
 
 ### Modifying the environment from Python
 The Environment can be modified by using side channels to send data to the
@@ -194,8 +267,3 @@ var academy = FindObjectOfType<Academy>();
 var sharedProperties = academy.FloatProperties;
 float property1 = sharedProperties.GetPropertyWithDefault("parameter_1", 0.0f);
 ```
-
-## mlagents-learn
-
-For more detailed documentation on using `mlagents-learn`, check out
-[Training ML-Agents](Training-ML-Agents.md)
diff --git a/gym-unity/gym_unity/envs/__init__.py b/gym-unity/gym_unity/envs/__init__.py
@@ -4,6 +4,10 @@
 import numpy as np
 from mlagents.envs.environment import UnityEnvironment
 from gym import error, spaces
+from mlagents.envs.brain_conversion_utils import (
+    step_result_to_brain_info,
+    group_spec_to_brain_parameters,
+)
 
 
 class UnityGymException(error.Error):
@@ -53,10 +57,9 @@ def __init__(
         )
 
         # Take a single step so that the brain information will be sent over
-        if not self._env.brains:
+        if not self._env.get_agent_groups():
             self._env.step()
 
-        self.name = self._env.academy_name
         self.visual_obs = None
         self._current_state = None
         self._n_agents = None
@@ -67,18 +70,17 @@ def __init__(
         self._allow_multiple_visual_obs = allow_multiple_visual_obs
 
         # Check brain configuration
-        if len(self._env.brains) != 1:
+        if len(self._env.get_agent_groups()) != 1:
             raise UnityGymException(
                 "There can only be one brain in a UnityEnvironment "
                 "if it is wrapped in a gym."
             )
-        if len(self._env.external_brain_names) <= 0:
-            raise UnityGymException(
-                "There are not any external brain in the UnityEnvironment"
-            )
 
-        self.brain_name = self._env.external_brain_names[0]
-        brain = self._env.brains[self.brain_name]
+        self.brain_name = self._env.get_agent_groups()[0]
+        self.name = self.brain_name
+        brain = group_spec_to_brain_parameters(
+            self.brain_name, self._env.get_agent_group_spec(self.brain_name)
+        )
 
         if use_visual and brain.number_visual_observations == 0:
             raise UnityGymException(
@@ -103,7 +105,11 @@ def __init__(
             )
 
         # Check for number of agents in scene.
-        initial_info = self._env.reset()[self.brain_name]
+        self._env.reset()
+        initial_info = step_result_to_brain_info(
+            self._env.get_step_result(self.brain_name),
+            self._env.get_agent_group_spec(self.brain_name),
+        )
         self._check_agents(len(initial_info.agents))
 
         # Set observation and action spaces
@@ -153,7 +159,11 @@ def reset(self):
         Returns: observation (object/list): the initial observation of the
             space.
         """
-        info = self._env.reset()[self.brain_name]
+        self._env.reset()
+        info = step_result_to_brain_info(
+            self._env.get_step_result(self.brain_name),
+            self._env.get_agent_group_spec(self.brain_name),
+        )
         n_agents = len(info.agents)
         self._check_agents(n_agents)
         self.game_over = False
@@ -201,7 +211,13 @@ def step(self, action):
                 # Translate action into list
                 action = self._flattener.lookup_action(action)
 
-        info = self._env.step(action)[self.brain_name]
+        spec = self._env.get_agent_group_spec(self.brain_name)
+        action = np.array(action).reshape((self._n_agents, spec.action_size))
+        self._env.set_actions(self.brain_name, action)
+        self._env.step()
+        info = step_result_to_brain_info(
+            self._env.get_step_result(self.brain_name), spec
+        )
         n_agents = len(info.agents)
         self._check_agents(n_agents)
         self._current_state = info