Skip to content

bjoern-hempel/js-reinforcement-learning-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A javascript reinforcement learning framework

This is a javascript reinforcement learning framework.

1. Introduction

In progress..

2. Markov decision process (MDP)

2.1 Theory

2.1.1 The bellman equation

V^*(s) = \substack{\textbf{max}\\ {\tiny a}}\sum_{s'}^{} T(s, a, s')[R(s, a, s') + \gamma \cdot V^*(s')] \quad \forall s

2.1.2 The value iteration algorithm

V_{k+1}(s) \leftarrow \substack{\textbf{max}\\ {\tiny a}}\sum_{s'}^{} T(s, a, s')[R(s, a, s') + \gamma \cdot V_k(s')] \quad \forall s

2.1.3 The Q-value iteration algorithm

Q_{k+1}(s,a) \leftarrow \sum_{s'}^{} T(s, a, s')[R(s, a, s') + \gamma \cdot \substack{\textbf{max}\\ {\tiny a'}} Q_k(s',a')] \quad \forall (s,a)

2.2 Usage

2.2.1 Super basic example

Let's look at a state s0 which contains 3 actions which point back to the state s0. The first action a0 receives a reward of 1. The second action a1 receives a penalty of 1 (-1). The third action a2 remains neutral and results in neither reward nor punishment. In this case, each action contains only one possible state change, so the transition probability is 1,0 (100%). The whole setup does not make much sense, but it should show the procedure in more detail:

super basic example

As one can logically see, a0 is the best option and leads to maximum reward. a1 teaches us punishment and is the most unfavorable variant, while a2 is the neutral version without any reward. Let's calculate that:

2.2.1.1 Code

The written-out version:

var discountFactor = 0;

var rl = new ReinforcementLearning.mdp();

/* s0 */
var s0 = rl.addState();

/* create a0, a1 and a2 */
var a0 = rl.addAction(s0);
var a1 = rl.addAction(s0);
var a2 = rl.addAction(s0);

/* add the action to state connections (state changes) */
rl.addStateChange(a0, s0, 1.0,  1);
rl.addStateChange(a1, s0, 1.0, -1);
rl.addStateChange(a2, s0, 1.0,  0);

var Q = rl.calculateQ(discountFactor);

console.log(JSON.stringify(Q));

The short version:

var discountFactor = 0;

var rl = new ReinforcementLearning();

/* s0 */
var s0 = rl.addState();

/* s0.a0, s0.a1 and s0.a2 */
rl.addAction(s0, new StateChange(s0, 1.0,  1));
rl.addAction(s0, new StateChange(s0, 1.0, -1));
rl.addAction(s0, new StateChange(s0, 1.0,  0));

var Q = rl.calculateQ(discountFactor);

console.log(JSON.stringify(Q));

It returns:

[
    [1, -1, 0]
]

As we suspected above: a0 is the winner and with the maximum value of Q(s=0) (Q(s=0,a=0) = 1). The discountFactor is set to 0, because we only want to consider one iteration step. The discountFactor determines the importance of future rewards: A factor of 0 makes the agent "short-sighted" by considering only the current rewards, while a factor of close to 1 makes him strive for a high long-sighted reward. Because it is set to 0, only the next step is important and it shows the previously shown result.

The situation doesn't change if we look a little bit more far-sighted and we set the discount factor close to 1 (e.g. 0,9):

var discountFactor = 0.9;

It returns:

[
    [9.991404955442832, 7.991404955442832, 8.991404955442832]
]

Q(s=0,a=0) is still the winner with the maximum of Q(s=0) ≈ 10. The algorithm of the calculateQ function stops the iteration of the above Markov formula until the Q change difference falls below a certain threshold: the default value of this threshold is 0,001.

2.2.1.2 Watch at the demo

Discount rate 0,9:

super basic example

2.2.2 Basic example

Let's look at the next following example:

super basic example

If we look at this example in the short term, it is a good idea to permanently go through a1 from s0 (discountRate = 0) and stay on state s0: Because we always get a reward of 2 and we don't want to receive any punishment of -5. From a far-sighted point of view, it's better to go through a0 (discountRate = 0,9), because in future we will receive a reward of 10 in addition to the punishment of -5 (it means the sum of 5 reward instead of only 2). Let's calculate that:

2.2.2.1 Code
var discountRate = 0.9;

var rl = new ReinforcementLearning.mdp();

/* s0 and s1 */
var s0 = rl.addState();
var s1 = rl.addState();

/* s0.a0, s0.a1 and s1.a0 */
rl.addAction(s0, new StateChange(s1, 1.0, -5));
rl.addAction(s0, new StateChange(s0, 1.0,  2));

/* s1.a0 */
rl.addAction(s1, new StateChange(s0, 1.0, 10));

var Q = rl.calculateQ(discountRate);

console.log(JSON.stringify(Q));

It returns:

[
    [21.044799074176453, 20.93957918978874],
    [28.93957918978874]
]

As we expected, far-sighted it is better to choose s0.a0 with the reward of Q(s=0,a=0).

2.2.2.2 Watch at the demo

Discount rate 0,9:

basic example

2.2.2.3 Comparison of different discount rates
discountRate type s0 s1 s0 (winner) s1 (winner)
0.0 short-sighted [-5, 2] [10] a1 a0
0.1 short-sighted [-3.98, 2.22] [10.22] a1 a0
0.5 half short-sighted [1.00, 4.00] [12.00] a1 a0
0.9 far-sighted [21.04, 20.94] [28.94] a0 a0

Graphic:

basic example result

2.2.3 More complex example

Let's look at the somewhat more complex example:

super basic example

Short-sightedly it is a good idea to permanently go through a0 and stay on state s0. But how is it farsighted? Is courage rewarded in this case? Let's calculate that:

2.2.3.1 Code
var discountRate =  0.9;

var rl = new ReinforcementLearning.mdp();

/* s0, s1 and s2 */
var s0 = rl.addState();
var s1 = rl.addState();
var s2 = rl.addState();

/* s0.a0 and s0.a1 */
rl.addAction(s0, new StateChange(s0, 1.0,   1));
rl.addAction(s0, new StateChange(s0, 0.5,  -2), new StateChange(s1, 0.5, 0));

/* s1.a0 and s1.a1 */
rl.addAction(s1, new StateChange(s1, 1.0,   0));
rl.addAction(s1, new StateChange(s2, 1.0, -50));

/* s2.a0 */
rl.addAction(s2, new StateChange(s0, 0.8, 100), new StateChange(s1, 0.1, 0), new StateChange(s2, 0.1, 0));

var Q = rl.calculateQ(discountRate);

console.log(JSON.stringify(Q));

It returns:

[
    [61.75477734479686, 67.50622243150205],
    [76.25766751820726, 84.73165595751362],
    [149.70275422340958]
]

Looking at the example far-sightedly (discountRate = 0.9) it is a good idea to take action a1 in status s0, take action a1 in status s1 and take action a0 in status s2.

2.2.3.2 Watch at the demo

Discount rate 0,9:

more complex example

2.2.3.3 Comparison of different discount rates
discountRate type s0 s1 s2 s0 (winner) s1 (winner) s2 (winner)
0.0 short-sighted [1, -1] [0, -50] [80] a0 a0 a0
0.1 short-sighted [1.11, -0.94] [0, -41.91] [80.90] a0 a0 a0
0.5 half short-sighted [2, -0.5] [0, -7.47] [85.05] a0 a0 a0
0.9 far-sighted [61.76, 67.51] [76.27, 84.74] [149.71] a1 a1 a0

Graphic:

more complex example result

more complex example result

more complex example result

more complex example result

2.2.4 Real example

2.2.4.1 Code

In progress..

2.2.4.2 Watch at the demo

In progress..

2.2.4.3 Comparison of different discount rates

In progress..

3. Temporal Difference Learning and Q-Learning

3.1 Theory

3.1.1 Formula

In Progress

3.2 Usage

3.2.1 Real example

3.2.1.1 Code

In progress..

3.2.1.2 Watch at the demo

In progress..

3.2.2 Simple Grid World

Imagine we have a person who is currently on the field x=5 and y=3. The goal is the safe way to x=1 and y=3. The red fields must be avoided. They are chasms that endanger the person (negative rewards or punishments). Which way should the person go?

grid world raw small

Let's calculate that:

3.2.2.1 Code
var discountRate = 0.95;
var width  = 5;
var height = 3;
var R      = {
    0: {2: 100},
    1: {2: -10},
    2: {2: -10, 1: -10},
    3: {2: -10},
    4: {2: 0, 0: -10}
};

/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();

/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);

/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
    iterations: 100000,
    useSeededRandom: true,
    useOptimizedRandom: true
});

/* print result */
rlQLearning.printTableGridWorld(Q, width, R);

It returns:

grid world calculated small

3.2.2.2 Watch at the demo

In progress.

3.2.3 Extended Grid World

As in the example 3.2.2 but just a bigger grid world:

var width  = 10;
var height = 5;
var R      = {
    0: {4: 100},
    2: {4: -10},
    3: {4: -10, 3: -10},
    4: {4: -10},
    5: {4: 0, 0: -10}
};

That's easy now:

grid world raw wide

Now imagine that the person is drunk. That means with a certain probability the person goes to the right or left, although he wanted to go straight out. Depending on how drunk the person is we choose a probability of 2.5% to go left and a probability of 2.5% to go right (splitT = 0.025). What is the safest way now? Preliminary consideration: Moving away from the chasms first and staying away from them might now be better than taking the shortest route.

Let's calculate that:

3.2.3.1 Code
var discountRate = 0.95;
var width  = 10;
var height = 5;
var R      = {
    0: {4: 100},
    2: {4: -10},
    3: {4: -10, 3: -10},
    4: {4: -10},
    5: {4: 0, 0: -10}
};
var splitT = 0.025;

/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();
rlQLearning.adoptConfig({splitT: splitT});

/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);

/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
    iterations: 100000,
    useSeededRandom: true,
    useOptimizedRandom: true
});

/* print result */
rlQLearning.printTableGridWorld(Q, width, R);

It returns:

grid world calculated small drunk

3.2.3.2 Watch at the demo

In progress.

A. Tools

B. Authors

C. Licence

This tutorial is licensed under the MIT License - see the LICENSE.md file for details

D. Closing words

Have fun! :)