Skip to content

Commit

Permalink
Update Real world transfer doc
Browse files Browse the repository at this point in the history
  • Loading branch information
erdnaxe committed Aug 14, 2020
1 parent 18fa058 commit 430e24b
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 32 deletions.
14 changes: 7 additions & 7 deletions docs/training_one_leg.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ The conclusion of this experience is that we can keep `nminibatch=1` i.e. not sp

## Tweaking the number of simulation episodes

As a simulation episode contains 32 steps, so the number of steps `n_steps` generated by one environment correspond to $32 \times n_{episodes}$.
As a simulation episode contains 32 steps, so the number of steps `n_steps` generated by one environment correspond to $32 \\times n\_{episodes}$.

This experience change the amount of simulation episodes done in each environments (32 environments in our case). More simulation episodes will use more CPU time and create a larger batch of data.

Expand Down Expand Up @@ -214,14 +214,14 @@ A good compromise seems to be `noptepochs=30`.

# Tweaking the discount factor

**The discount factor (`gamma`, $\gamma$) is the coefficient used when summing all step rewards to get the episode return.**
**The discount factor (`gamma`, $\\gamma$) is the coefficient used when summing all step rewards to get the episode return.**

$$
return = \sum_n \gamma^n~reward_n
return = \\sum_n \\gamma^n~reward_n
$$

$\gamma=1$ means that the learning does not differentiate between achieving good reward early or lately in the simulation episode.
We often see a value of $\gamma$ between 0.9 and 0.999 in literature to promote getting good rewards early.
$\\gamma=1$ means that the learning does not differentiate between achieving good reward early or lately in the simulation episode.
We often see a value of $\\gamma$ between 0.9 and 0.999 in literature to promote getting good rewards early.

<details>
<summary>
Expand Down Expand Up @@ -266,7 +266,7 @@ We observe that a discount factor between 0.8 and 0.9 gives best results. If the
# Tweaking the clip range

The PPO algorithm avoid too large policy update by clipping it. It uses a ratio that tell the difference between new and old policy and clip this ratio.
With a clip range of $\varepsilon=0.2$, the ratio will be clipped from $1-\varepsilon = 0.8$ to $1+\varepsilon = 1.2$.
With a clip range of $\\varepsilon=0.2$, the ratio will be clipped from $1-\\varepsilon = 0.8$ to $1+\\varepsilon = 1.2$.

<details>
<summary>
Expand Down Expand Up @@ -351,7 +351,7 @@ A learning rate of 0.01 gives the best sample efficiency. A smaller rate makes t

# Tweaking the GAE smoothing factor

**The `lam` hyperparameter represents the smoothing factor $\lambda$ in the Generalized Advantage Estimation.**
**The `lam` hyperparameter represents the smoothing factor $\\lambda$ in the Generalized Advantage Estimation.**
The Generalized Advantage Estimation is a method to estimate the advantage used by PPO.

<details>
Expand Down
31 changes: 6 additions & 25 deletions docs/transfer_real_world.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,11 @@

# Preparing transfer to real world

## Using bigger timesteps
The different learning experiments done last section were done with a environment time step of 50 ms,
which is the minimum period at which we can control and observe all servomotors.

At each time step, we need to gather observation from all servomotors.
It cannot be done in parallel as the robot uses only one serial connection.
**This limits the environment time step to 50 ms.**
As explained in [observation vector comparisons](training_one_leg.md#observation-vector-comparison),
**all the simulation observation is observable in the real world.**
No need to implement the direct mechanics to compute the position of the fingertip.

This speed limit implies that we won't try torque control on servomotors as this looks too slow.

To train the policy with a time step of 50 ms,
we subdivide it by 12 to keep the physic simulation time step around 4 ms.

Compared to previous section, we changed the following training hyperparameters to take into account the new time step:

```Python3
timestep_limit = 32 # less steps are required
gamma=0.95 # Discount factor, lowered to reduce vibrations
n_steps=128 # batchsize reduced, less steps are required
learning_rate=10e-4 # better performance
```

![Training results](img/transfer_real_world_new_timestep.png)

<video style="max-width:100%;height:auto" preload="metadata" controls="">
<source src="https://perso.crans.org/erdnaxe/videos/projet_hexapod/transfer_real_world_simulation.mp4" type="video/mp4">
</video><br/>

**To reduce vibrations, the discount factor was reduced to 0.95.**
Each environment got it "real world" equivalent, that will not use PyBullet and will **not be able to compute a reward**.

0 comments on commit 430e24b

Please sign in to comment.