# Modifying Reinforcement Learning Environments
### Or, How To Better Train Your Robot Ant

In this post we'll talk about reinforcement learning, one of the three main categories of machine learning, as the means to train a simulated robot to complete a task.  Rather than a general introduction, which is done very well in many places, we'll be taking some off-the-shelf solutions and tweaking them in a couple different ways in order to discuss how those modifications affect the learning, rewards, and ultimately the robot agent's behavior.

## A Quick Overview of Reinforcement Learning

Learning a new skill can sometimes be daunting - maybe you read books, take a class, watch videos of people who already know - maybe you prefer to simply dive in and make a mess of things as you go, learning from the process.  Similarly, there are a number of ways in which a computer can be taught a new skill.  In the world of machine learning, reinforcement learning takes the approach of diving right in and trying things out to see how they go.  

Core to reinforcement learning is the simulation of an environment and how it responds to a computer agent taking different actions in that environment.  The agent is dropped into this environment, knowing nothing at all other than an algorithm that feeds it the state of the environment and gives it a numerical reward based on how the agent's previous actions have modified that environment, toward or away from a defined reward function.  

As a basic example, picture a robot ant, given a physics-based environment and a simple goal of getting close to a single coordinate point in that environment - the closer the ant gets to that point, the greater a reward it receives. The reinfocement learning routine gives the ant an input state from the environment, known as the observables, and the ant chooses what next action to take - this is then fed back into the machinery for the next time-step, where the ant receives a reward for how well it progressed toward its ultimate goal. 

The ant doesn't know gravity, doesn't understand how its limbs work, but does have a limited set of actions that it can take.  Through a long series of policy-guided and random actions, trial and error, the ant runs through an entire scenario, or episode, to the point the environment declares the process done.  Most likely at the start of the learning process, that episode ends early and because the ant fell over flailing.  The ant is reset at the start of the scenario, with a new, fresh environment, asked to fulfill the task set out by the reward function that it's been given, with the one alteration: that it now has knowledge of what happened the in the past.  The mapping of the environmental state to optimal action taken by the agent is slowly built through experience and some random noise applied to the action choice to ensure the ant doesn't sit around twirling its thumbs, thinking that's the best it can do.  Given many hundreds of episodes and millions of simulated steps in time, the agent slowly builds a policy that chooses actions that maximize the reward it expects to receive for the entire episode, given past experience. And, eventually, hopefully, out comes an ant that can run.

[running ant animation]

Clearly, the whole process has a lot more going on than the broad-strokes introduction above can cover.  To learn more about what reinfocement learning is and some of the fine details, we encourage the reader to check out the excellent introduction resource at https://spinningup.openai.com/en/latest/

What we are, instead, concerned with here is a more practical look at how to take an algorithm, plus an environment, and adjust them to your own specific needs and goals.  Specifically, we'll be taking the algorithm 'Twin Delayed Deep Deterministic Policy Gradients', or [TD3](https://github.com/sfujim/TD3), and using it to train a simulated robotic ant that is defined in PyBullet3's pre-built [environments](https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet/gym/pybullet_envs).  You can read more about TD3 [here](https://spinningup.openai.com/en/latest/algorithms/td3.html), and we'll be working from a direct fork of the original author's git repo to skip questions about implementation.

## Teaching Ourselves To Run

Should you wish to follow along with our discussion, you can clone our provided repo [here](https://github.com/dillonroach/TD3): assuming that you have conda installed already, the first thing you'll want to do is to create a conda environment from the included environment.yml

`conda env create -f environment.yml`

After this, activate the environment, and then the first thing you'll want to do is launch jupyter lab and check out [TD3notebook.ipynb](https://github.com/dillonroach/TD3/blob/master/TD3notebook.ipynb) - this is a direct translation over from the author-provided `main.py`: all we've done is stashed the configuration variables into a dictionary, named `args`, and shoved all the code that would be executed normally into a function called `main()` so it can be called simply in the notebook.

Looking more closely at `main()` now, you should see that the function is mostly concerned with setting values for variables to be used once you get to the `for` loop, which is where the meat of the work happens. In the following section, note that for the first `start_timesteps` number of time steps the action is simply filled from random sampling of possible choices; this helps fill the replay buffer and give a baseline before actual policy choices are made

```     
        if t < args['start_timesteps']:
            action = env.action_space.sample()
        else:
            action = (
                policy.select_action(np.array(state))
                + np.random.normal(0, max_action * args['expl_noise'], size=action_dim)
            ).clip(-max_action, max_action)
```

The bulk of the actual training happens in only a few lines; the below section takes the selected action from above, applies it to the environment and returns the new environment state, with reward and a done flag.

```
        next_state, reward, done, _ = env.step(action) 
        done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0

        # Store data in replay buffer
        replay_buffer.add(state, action, next_state, reward, done_bool)

        state = next_state
        episode_reward += reward
        
        # Train agent after collecting sufficient data
        if t >= args['start_timesteps']:
            policy.train(replay_buffer, args['batch_size'])
```

Once the environment reaches the described `done` state, an environment reset is called and it starts the process all over again.  Every `eval_freq` number of time steps the machinery evaluates the policy against a number of episodes outside the training process, and saves the current policy for good measure.  That's it.  Nice having all the complicated heavy lifting already coded for us.

If you run the notebook, as is, it will train for two million time steps with all the standard hyperparameters the authors set up and out will pop a policy that  allows the robot ant to sprint like the animation above.

Great! Now lets have our newly-trained ant run to a few different points around the area and see how it behaves.

[animation of ant falling over, not going anywhere]

Oh. As it turns out, the way that the ant is set up, it learns to run off into the positive-x direction.. and ONLY that.  It was never given any opportunity to have any other experiences to learn from, and so it didn't.  The result of the ant running off to the right is very effective at that one, specific, thing - but completely fragile to any other request, or changes to the environment it might run in.

## Modifying The Environment

So, lets look at how we might go about modifying our ant and its environment to make the agent a bit more flexible.
Of course, you can define the entirety of your environment, its actors, the steps and rewards, etc, by yourself from scratch, but that's a lot of work to be done if you don't need to.  Instead, we rely on the wonderful python `subclass` to override certain things about the ant and its environments we might want to alter, while leaving everything else intact.  If you're new to the concept of a subclass, it's rather straight-forward: define `class MyNewClass(AnOldClass):` in your code and `MyNewClass` will carry forward all the functions that `AnOldClass` had, except where you overwrite them with a function having the same name. In our repository, if you now look at [override_ant.py](https://github.com/dillonroach/TD3/blob/master/override_ant.py) we take all the pieces from [the original ant](https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet/gym/pybullet_envs) and give new names to things, while also exposing key functions in each piece we might want to modify.  The subclassing here gets messy, so here's a quick rundown of what inherits from what:

```
    class MyAntBulletEnv(WalkerBaseBulletEnv): # <- this is our entry point to the new environment
          self.robot = MyAnt()
                         \\\
                      class MyAnt(MyWalkerBase): # MyAntBulletEnv sets the robot as MyAnt
                                   \\\
                                   class MyWalkerBase(WalkerBase): #MyAnt subclasses MyWalkerBase, itself from WalkerBase
                                   
                                       def init():
                                           ...
                                       def step():
                                           ...
                      
```

If you take a look at `MyWalkerBase()` and compare to `WalkerBase()` from [the original](https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/robot_locomotors.py) we've copied over `init()` and `step()` as they were written, then tweaked small things.  In this first go, we've only used all this to set the robot target x and y to a new location: 20m, 20m.

In order to now use this new, modified environment, we simply add some registration boilerplate to our notebook:

```
from gym.envs.registration import registry, make, spec


def register(id, *args, **kvargs):
  if id in registry.env_specs:
    return
  else:
    return gym.envs.registration.register(id, *args, **kvargs)
    
register(id='MyAntBulletEnv-v0',
         entry_point='override_ant:MyAntBulletEnv',
         max_episode_steps=1000,
         reward_threshold=2500.0)
```
And then set `"env" : "MyAntBulletEnv-v0",` instead of the original, in the arguments dictionary.  You can see an example of this change in [TD3notebook-MyAnt.ipynb](https://github.com/dillonroach/TD3/blob/master/TD3notebook-MyAnt.ipynb).  If you run this notebook, you'll end up with an ant that runs to (20,20) instead of off to the right for a kilometer.

[animation of ant running to (20,20)]

To be perfectly clear, this was completely unnecessary for this little change; in our original `main()` we could simply have called `env.robot.walk_target_x = 20` and `env.robot.walk_target_y = 20` after those pieces were defined, and we'd have ended with the same result.  The point here was to show you how we set things up for when we want to do much more.  And we'll need to do more, of course, because this new ant walks very well to point (20,20), but will fail to get to any other location you ask of it.

This is effectively reinforcement learning life-lesson number one: if you give your agent a narrowly defined task and train to complete that specific task, it will become good at that task, but any slight modification to what you want it to do.. and things go south quickly.  You would have to re-train it again for every single new task you wanted.  While we're at it, it's also worth mentioning: it's easy to think of defining tasks for our agent in our reward function, but there's a good reason it's called a reward function and not something else - the agent is coaxed along by that reward, but how it eventually solves maximizing that reward may surprise you.  Keeping your reward general enough to not try to specificly pre-determine what you want the agent to do is important.

## Generalizing

So how do we train the agent so that its not as fragile to changing requirements?  As with most things ML, unless you've got a fun trick up your sleeve, more data.  If, rather than training our ant to go to a specific (x,y), we give it random target locations with each episode, it will slowly build knowledge of more than simply one set of motions.  By modifying our `main()` to include the code below after the `done` condition, the agent will learn 'walking' more generally.

```
        if done: 
        
            #samples x,y from a circle of r=sqrt(20**2+20**2)
            r = np.linalg.norm([20,20])
            rand_deg = np.random.randint(0,360) # degrees here for reader clarity, rather than directly in 2pi

            rand_x = r*np.cos(np.pi/180 * rand_deg)
            rand_y = r*np.sin(np.pi/180 * rand_deg)
            
            env.robot.walk_target_x = rand_x
            env.robot.walk_target_y = rand_y
            state, done = env.reset(), False
 
```

As it so happens, the robot ant tends to prefer to learn how to walk in a particular orientation - in our experience, this above change to training on its own will produce a policy that handles all positive-x locations relatively well, but falters when being asked to go to any points behind it.  Given a very long training time and some creative changes to the action noise, this could be mitigated, but in this case there's a more straight-forward solution.

If you look at what the robot ant knows about its environment as set out in [`calc_state()`](https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/robot_locomotors.py#L35) there's a good amount of orientation-related information that doesn't completely make it into the observation space that's returned.  Thankfully, with all the work we did setting up our own ant environment we know how to get some of this information back into the reward function.  Namely, we define an `angle_to_target` in the robot's init step and set it to `None`.  Then, in the `step()` function we add

```
    if self.angle_to_target == None:
        self.angle_to_target = np.arctan2(self.walk_target_y, self.walk_target_x)
    
    angle_old = self.angle_to_target
    self.angle_to_target = self.walk_target_theta - self.body_rpy[2]
    angle_progress = float( (np.cos(self.angle_to_target) - np.cos(angle_old))/self.scene.dt )

```

Being sure to include `angle_progress` in the `self.rewards` sum, now the agent is encouraged to align itself with the path to its target, helping it turn to make those points behind it easier to learn.  The reader should note the cosine use in the angle_progress calculation - similar to a dot product, this ensures the reward is most positive when the target is aligned with the vector to the target.  The `scene.dt` division ensures that rewards are normalized to the simulated time-step frames, so that if a frame is dropped or skipped the rewards don't get modified.

You can find these changes implemented in our [TD3notebook-random_pionts.ipynb](https://github.com/dillonroach/TD3/blob/master/TD3notebook-random_points.ipynb) and associated [override_ant_random_points.py](https://github.com/dillonroach/TD3/blob/master/override_ant_random_points.py).  You can train an agent directly from these, or, we've included a trained model in the `/models/random_circle_final/` folder - copy those files into the `/models/` folder and run [TD3_view_random_points.ipynb]() to view the output and save animations.

As a bonus, since we've trained the ant to walk to any point from its starting point, and since the observation space for the ant is all relative to itself, rather than global, we can now feed the policy a string of random coordinates and the ant can walk a dynamic path through them all.  Our ant's all grown up.

[animation of ant walking long path]