Update Real world transfer doc

erdnaxe · Aug 14, 2020 · 430e24b · 430e24b
1 parent 18fa058
commit 430e24b
Show file tree

Hide file tree

Showing 2 changed files with 13 additions and 32 deletions.
diff --git a/docs/training_one_leg.md b/docs/training_one_leg.md
@@ -128,7 +128,7 @@ The conclusion of this experience is that we can keep `nminibatch=1` i.e. not sp
 
 ## Tweaking the number of simulation episodes
 
-As a simulation episode contains 32 steps, so the number of steps `n_steps` generated by one environment correspond to $32 \times n_{episodes}$.
+As a simulation episode contains 32 steps, so the number of steps `n_steps` generated by one environment correspond to $32 \\times n\_{episodes}$.
 
 This experience change the amount of simulation episodes done in each environments (32 environments in our case). More simulation episodes will use more CPU time and create a larger batch of data.
 
@@ -214,14 +214,14 @@ A good compromise seems to be `noptepochs=30`.
 
 # Tweaking the discount factor
 
-**The discount factor (`gamma`, $\gamma$) is the coefficient used when summing all step rewards to get the episode return.**
+**The discount factor (`gamma`, $\\gamma$) is the coefficient used when summing all step rewards to get the episode return.**
 
 $$
-return = \sum_n \gamma^n~reward_n
+return = \\sum_n \\gamma^n~reward_n
 $$
 
-$\gamma=1$ means that the learning does not differentiate between achieving good reward early or lately in the simulation episode.
-We often see a value of $\gamma$ between 0.9 and 0.999 in literature to promote getting good rewards early.
+$\\gamma=1$ means that the learning does not differentiate between achieving good reward early or lately in the simulation episode.
+We often see a value of $\\gamma$ between 0.9 and 0.999 in literature to promote getting good rewards early.
 
 <details>
    <summary>
@@ -266,7 +266,7 @@ We observe that a discount factor between 0.8 and 0.9 gives best results. If the
 # Tweaking the clip range
 
 The PPO algorithm avoid too large policy update by clipping it. It uses a ratio that tell the difference between new and old policy and clip this ratio.
-With a clip range of $\varepsilon=0.2$, the ratio will be clipped from $1-\varepsilon = 0.8$ to $1+\varepsilon = 1.2$.
+With a clip range of $\\varepsilon=0.2$, the ratio will be clipped from $1-\\varepsilon = 0.8$ to $1+\\varepsilon = 1.2$.
 
 <details>
    <summary>
@@ -351,7 +351,7 @@ A learning rate of 0.01 gives the best sample efficiency. A smaller rate makes t
 
 # Tweaking the GAE smoothing factor
 
-**The `lam` hyperparameter represents the smoothing factor $\lambda$ in the Generalized Advantage Estimation.**
+**The `lam` hyperparameter represents the smoothing factor $\\lambda$ in the Generalized Advantage Estimation.**
 The Generalized Advantage Estimation is a method to estimate the advantage used by PPO.
 
 <details>

diff --git a/docs/transfer_real_world.md b/docs/transfer_real_world.md
@@ -6,30 +6,11 @@
 
 # Preparing transfer to real world
 
-## Using bigger timesteps
+The different learning experiments done last section were done with a environment time step of 50 ms,
+which is the minimum period at which we can control and observe all servomotors.
 
-At each time step, we need to gather observation from all servomotors.
-It cannot be done in parallel as the robot uses only one serial connection.
-**This limits the environment time step to 50 ms.**
+As explained in [observation vector comparisons](training_one_leg.md#observation-vector-comparison),
+**all the simulation observation is observable in the real world.**
+No need to implement the direct mechanics to compute the position of the fingertip.
 
-This speed limit implies that we won't try torque control on servomotors as this looks too slow.
-
-To train the policy with a time step of 50 ms,
-we subdivide it by 12 to keep the physic simulation time step around 4 ms.
-
-Compared to previous section, we changed the following training hyperparameters to take into account the new time step:
-
-```Python3
-timestep_limit = 32  # less steps are required
-gamma=0.95  # Discount factor, lowered to reduce vibrations
-n_steps=128  # batchsize reduced, less steps are required
-learning_rate=10e-4  # better performance
-```
-
-![Training results](img/transfer_real_world_new_timestep.png)
-
-<video style="max-width:100%;height:auto" preload="metadata" controls="">
-<source src="https://perso.crans.org/erdnaxe/videos/projet_hexapod/transfer_real_world_simulation.mp4" type="video/mp4">
-</video><br/>
-
-**To reduce vibrations, the discount factor was reduced to 0.95.**
+Each environment got it "real world" equivalent, that will not use PyBullet and will **not be able to compute a reward**.