This is a javascript reinforcement learning framework.
In progress..
Let's look at a state s0 which contains 3 actions which point back to the state s0. The first action a0 receives a reward of 1. The second action a1 receives a penalty of 1 (-1). The third action a2 remains neutral and results in neither reward nor punishment. In this case, each action contains only one possible state change, so the transition probability is 1,0 (100%). The whole setup does not make much sense, but it should show the procedure in more detail:
As one can logically see, a0 is the best option and leads to maximum reward. a1 teaches us punishment and is the most unfavorable variant, while a2 is the neutral version without any reward. Let's calculate that:
The written-out version:
var discountFactor = 0;
var rl = new ReinforcementLearning.mdp();
/* s0 */
var s0 = rl.addState();
/* create a0, a1 and a2 */
var a0 = rl.addAction(s0);
var a1 = rl.addAction(s0);
var a2 = rl.addAction(s0);
/* add the action to state connections (state changes) */
rl.addStateChange(a0, s0, 1.0, 1);
rl.addStateChange(a1, s0, 1.0, -1);
rl.addStateChange(a2, s0, 1.0, 0);
var Q = rl.calculateQ(discountFactor);
console.log(JSON.stringify(Q));
The short version:
var discountFactor = 0;
var rl = new ReinforcementLearning();
/* s0 */
var s0 = rl.addState();
/* s0.a0, s0.a1 and s0.a2 */
rl.addAction(s0, new StateChange(s0, 1.0, 1));
rl.addAction(s0, new StateChange(s0, 1.0, -1));
rl.addAction(s0, new StateChange(s0, 1.0, 0));
var Q = rl.calculateQ(discountFactor);
console.log(JSON.stringify(Q));
It returns:
[
[1, -1, 0]
]
As we suspected above: a0 is the winner and with the maximum value of Q(s=0) (Q(s=0,a=0) = 1). The discountFactor is set to 0, because we only want to consider one iteration step. The discountFactor determines the importance of future rewards: A factor of 0 makes the agent "short-sighted" by considering only the current rewards, while a factor of close to 1 makes him strive for a high long-sighted reward. Because it is set to 0, only the next step is important and it shows the previously shown result.
The situation doesn't change if we look a little bit more far-sighted and we set the discount factor close to 1 (e.g. 0,9):
var discountFactor = 0.9;
It returns:
[
[9.991404955442832, 7.991404955442832, 8.991404955442832]
]
Q(s=0,a=0) is still the winner with the maximum of Q(s=0) ≈ 10. The algorithm of the calculateQ
function stops the iteration of the above Markov formula until the Q change difference falls below a certain threshold: the default value of this threshold is 0,001.
2.2.1.2 Watch at the demo
Discount rate 0,9:
Let's look at the next following example:
If we look at this example in the short term, it is a good idea to permanently go through a1 from s0 (discountRate = 0) and stay on state s0: Because we always get a reward of 2 and we don't want to receive any punishment of -5. From a far-sighted point of view, it's better to go through a0 (discountRate = 0,9), because in future we will receive a reward of 10 in addition to the punishment of -5 (it means the sum of 5 reward instead of only 2). Let's calculate that:
var discountRate = 0.9;
var rl = new ReinforcementLearning.mdp();
/* s0 and s1 */
var s0 = rl.addState();
var s1 = rl.addState();
/* s0.a0, s0.a1 and s1.a0 */
rl.addAction(s0, new StateChange(s1, 1.0, -5));
rl.addAction(s0, new StateChange(s0, 1.0, 2));
/* s1.a0 */
rl.addAction(s1, new StateChange(s0, 1.0, 10));
var Q = rl.calculateQ(discountRate);
console.log(JSON.stringify(Q));
It returns:
[
[21.044799074176453, 20.93957918978874],
[28.93957918978874]
]
As we expected, far-sighted it is better to choose s0.a0 with the reward of Q(s=0,a=0).
2.2.2.2 Watch at the demo
Discount rate 0,9:
discountRate | type | s0 | s1 | s0 (winner) | s1 (winner) |
---|---|---|---|---|---|
0.0 | short-sighted | [-5, 2] |
[10] |
a1 | a0 |
0.1 | short-sighted | [-3.98, 2.22] |
[10.22] |
a1 | a0 |
0.5 | half short-sighted | [1.00, 4.00] |
[12.00] |
a1 | a0 |
0.9 | far-sighted | [21.04, 20.94] |
[28.94] |
a0 | a0 |
Graphic:
Let's look at the somewhat more complex example:
Short-sightedly it is a good idea to permanently go through a0 and stay on state s0. But how is it farsighted? Is courage rewarded in this case? Let's calculate that:
var discountRate = 0.9;
var rl = new ReinforcementLearning.mdp();
/* s0, s1 and s2 */
var s0 = rl.addState();
var s1 = rl.addState();
var s2 = rl.addState();
/* s0.a0 and s0.a1 */
rl.addAction(s0, new StateChange(s0, 1.0, 1));
rl.addAction(s0, new StateChange(s0, 0.5, -2), new StateChange(s1, 0.5, 0));
/* s1.a0 and s1.a1 */
rl.addAction(s1, new StateChange(s1, 1.0, 0));
rl.addAction(s1, new StateChange(s2, 1.0, -50));
/* s2.a0 */
rl.addAction(s2, new StateChange(s0, 0.8, 100), new StateChange(s1, 0.1, 0), new StateChange(s2, 0.1, 0));
var Q = rl.calculateQ(discountRate);
console.log(JSON.stringify(Q));
It returns:
[
[61.75477734479686, 67.50622243150205],
[76.25766751820726, 84.73165595751362],
[149.70275422340958]
]
Looking at the example far-sightedly (discountRate = 0.9
) it is a good idea to take action a1 in status s0, take action a1 in status s1 and take action a0 in status s2.
2.2.3.2 Watch at the demo
Discount rate 0,9:
discountRate | type | s0 | s1 | s2 | s0 (winner) | s1 (winner) | s2 (winner) |
---|---|---|---|---|---|---|---|
0.0 | short-sighted | [1, -1] |
[0, -50] |
[80] |
a0 | a0 | a0 |
0.1 | short-sighted | [1.11, -0.94] |
[0, -41.91] |
[80.90] |
a0 | a0 | a0 |
0.5 | half short-sighted | [2, -0.5] |
[0, -7.47] |
[85.05] |
a0 | a0 | a0 |
0.9 | far-sighted | [61.76, 67.51] |
[76.27, 84.74] |
[149.71] |
a1 | a1 | a0 |
Graphic:
In progress..
2.2.4.2 Watch at the demo
In progress..
In progress..
In Progress
In progress..
3.2.1.2 Watch at the demo
In progress..
Imagine we have a person who is currently on the field x=5 and y=3. The goal is the safe way to x=1 and y=3. The red fields must be avoided. They are chasms that endanger the person (negative rewards or punishments). Which way should the person go?
Let's calculate that:
var discountRate = 0.95;
var width = 5;
var height = 3;
var R = {
0: {2: 100},
1: {2: -10},
2: {2: -10, 1: -10},
3: {2: -10},
4: {2: 0, 0: -10}
};
/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();
/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);
/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
iterations: 100000,
useSeededRandom: true,
useOptimizedRandom: true
});
/* print result */
rlQLearning.printTableGridWorld(Q, width, R);
It returns:
3.2.2.2 Watch at the demo
In progress.
As in the example 3.2.2 but just a bigger grid world:
var width = 10;
var height = 5;
var R = {
0: {4: 100},
2: {4: -10},
3: {4: -10, 3: -10},
4: {4: -10},
5: {4: 0, 0: -10}
};
That's easy now:
Now imagine that the person is drunk. That means with a certain probability the person goes to the right or left, although he wanted to go straight out. Depending on how drunk the person is we choose a probability of 2.5% to go left and a probability of 2.5% to go right (splitT = 0.025
). What is the safest way now? Preliminary consideration: Moving away from the chasms first and staying away from them might now be better than taking the shortest route.
Let's calculate that:
var discountRate = 0.95;
var width = 10;
var height = 5;
var R = {
0: {4: 100},
2: {4: -10},
3: {4: -10, 3: -10},
4: {4: -10},
5: {4: 0, 0: -10}
};
var splitT = 0.025;
/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();
rlQLearning.adoptConfig({splitT: splitT});
/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);
/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
iterations: 100000,
useSeededRandom: true,
useOptimizedRandom: true
});
/* print result */
rlQLearning.printTableGridWorld(Q, width, R);
It returns:
3.2.3.2 Watch at the demo
In progress.
- All flowcharts were gratefully created with Google Drive
- Björn Hempel bjoern@hempel.li - Initial work - https://github.com/bjoern-hempel
This tutorial is licensed under the MIT License - see the LICENSE.md file for details
Have fun! :)