A javascript reinforcement learning framework

This is a javascript reinforcement learning framework.

1. Introduction

In progress..

2. Markov decision process (MDP)

2.1 Theory

2.2 Usage

2.2.1 Super basic example

Let's look at a state s₀ which contains 3 actions which point back to the state s₀. The first action a₀ receives a reward of 1. The second action a₁ receives a penalty of 1 (-1). The third action a₂ remains neutral and results in neither reward nor punishment. In this case, each action contains only one possible state change, so the transition probability is 1,0 (100%). The whole setup does not make much sense, but it should show the procedure in more detail:

As one can logically see, a₀ is the best option and leads to maximum reward. a₁ teaches us punishment and is the most unfavorable variant, while a₂ is the neutral version without any reward. Let's calculate that:

2.2.1.1 Code

The written-out version:

var discountFactor = 0;

var rl = new ReinforcementLearning.mdp();

/* s0 */
var s0 = rl.addState();

/* create a0, a1 and a2 */
var a0 = rl.addAction(s0);
var a1 = rl.addAction(s0);
var a2 = rl.addAction(s0);

/* add the action to state connections (state changes) */
rl.addStateChange(a0, s0, 1.0,  1);
rl.addStateChange(a1, s0, 1.0, -1);
rl.addStateChange(a2, s0, 1.0,  0);

var Q = rl.calculateQ(discountFactor);

console.log(JSON.stringify(Q));

The short version:

var discountFactor = 0;

var rl = new ReinforcementLearning();

/* s0 */
var s0 = rl.addState();

/* s0.a0, s0.a1 and s0.a2 */
rl.addAction(s0, new StateChange(s0, 1.0,  1));
rl.addAction(s0, new StateChange(s0, 1.0, -1));
rl.addAction(s0, new StateChange(s0, 1.0,  0));

var Q = rl.calculateQ(discountFactor);

console.log(JSON.stringify(Q));

It returns:

[
    [1, -1, 0]
]

As we suspected above: a₀ is the winner and with the maximum value of Q_(s=0) (Q_(s=0,a=0) = 1). The discountFactor is set to 0, because we only want to consider one iteration step. The discountFactor determines the importance of future rewards: A factor of 0 makes the agent "short-sighted" by considering only the current rewards, while a factor of close to 1 makes him strive for a high long-sighted reward. Because it is set to 0, only the next step is important and it shows the previously shown result.

The situation doesn't change if we look a little bit more far-sighted and we set the discount factor close to 1 (e.g. 0,9):

var discountFactor = 0.9;

It returns:

[
    [9.991404955442832, 7.991404955442832, 8.991404955442832]
]

Q_(s=0,a=0) is still the winner with the maximum of Q_(s=0) ≈ 10. The algorithm of the calculateQ function stops the iteration of the above Markov formula until the Q change difference falls below a certain threshold: the default value of this threshold is 0,001.

2.2.1.2 Watch at the demo

Discount rate 0,9:

2.2.2 Basic example

Let's look at the next following example:

If we look at this example in the short term, it is a good idea to permanently go through a₁ from s₀ (discountRate = 0) and stay on state s₀: Because we always get a reward of 2 and we don't want to receive any punishment of -5. From a far-sighted point of view, it's better to go through a₀ (discountRate = 0,9), because in future we will receive a reward of 10 in addition to the punishment of -5 (it means the sum of 5 reward instead of only 2). Let's calculate that:

2.2.2.1 Code

var discountRate = 0.9;

var rl = new ReinforcementLearning.mdp();

/* s0 and s1 */
var s0 = rl.addState();
var s1 = rl.addState();

/* s0.a0, s0.a1 and s1.a0 */
rl.addAction(s0, new StateChange(s1, 1.0, -5));
rl.addAction(s0, new StateChange(s0, 1.0,  2));

/* s1.a0 */
rl.addAction(s1, new StateChange(s0, 1.0, 10));

var Q = rl.calculateQ(discountRate);

console.log(JSON.stringify(Q));

It returns:

[
    [21.044799074176453, 20.93957918978874],
    [28.93957918978874]
]

As we expected, far-sighted it is better to choose s₀.a₀ with the reward of Q_(s=0,a=0).

2.2.2.2 Watch at the demo

Discount rate 0,9:

2.2.2.3 Comparison of different discount rates

discountRate	type	s₀	s₁	s₀ (winner)	s₁ (winner)
0.0	short-sighted	`[-5, 2]`	`[10]`	a₁	a₀
0.1	short-sighted	`[-3.98, 2.22]`	`[10.22]`	a₁	a₀
0.5	half short-sighted	`[1.00, 4.00]`	`[12.00]`	a₁	a₀
0.9	far-sighted	`[21.04, 20.94]`	`[28.94]`	a₀	a₀

Graphic:

2.2.3 More complex example

Let's look at the somewhat more complex example:

Short-sightedly it is a good idea to permanently go through a₀ and stay on state s₀. But how is it farsighted? Is courage rewarded in this case? Let's calculate that:

2.2.3.1 Code

var discountRate =  0.9;

var rl = new ReinforcementLearning.mdp();

/* s0, s1 and s2 */
var s0 = rl.addState();
var s1 = rl.addState();
var s2 = rl.addState();

/* s0.a0 and s0.a1 */
rl.addAction(s0, new StateChange(s0, 1.0,   1));
rl.addAction(s0, new StateChange(s0, 0.5,  -2), new StateChange(s1, 0.5, 0));

/* s1.a0 and s1.a1 */
rl.addAction(s1, new StateChange(s1, 1.0,   0));
rl.addAction(s1, new StateChange(s2, 1.0, -50));

/* s2.a0 */
rl.addAction(s2, new StateChange(s0, 0.8, 100), new StateChange(s1, 0.1, 0), new StateChange(s2, 0.1, 0));

var Q = rl.calculateQ(discountRate);

console.log(JSON.stringify(Q));

It returns:

[
    [61.75477734479686, 67.50622243150205],
    [76.25766751820726, 84.73165595751362],
    [149.70275422340958]
]

Looking at the example far-sightedly (discountRate = 0.9) it is a good idea to take action a₁ in status s₀, take action a₁ in status s₁ and take action a₀ in status s₂.

2.2.3.2 Watch at the demo

Discount rate 0,9:

2.2.3.3 Comparison of different discount rates

discountRate	type	s₀	s₁	s₂	s₀ (winner)	s₁ (winner)	s₂ (winner)
0.0	short-sighted	`[1, -1]`	`[0, -50]`	`[80]`	a₀	a₀	a₀
0.1	short-sighted	`[1.11, -0.94]`	`[0, -41.91]`	`[80.90]`	a₀	a₀	a₀
0.5	half short-sighted	`[2, -0.5]`	`[0, -7.47]`	`[85.05]`	a₀	a₀	a₀
0.9	far-sighted	`[61.76, 67.51]`	`[76.27, 84.74]`	`[149.71]`	a₁	a₁	a₀

Graphic:

2.2.4 Real example

2.2.4.1 Code

In progress..

2.2.4.2 Watch at the demo

In progress..

2.2.4.3 Comparison of different discount rates

In progress..

3. Temporal Difference Learning and Q-Learning

3.1 Theory

3.1.1 Formula

In Progress

3.2 Usage

3.2.1 Real example

3.2.1.1 Code

In progress..

3.2.1.2 Watch at the demo

In progress..

3.2.2 Simple Grid World

Imagine we have a person who is currently on the field x=5 and y=3. The goal is the safe way to x=1 and y=3. The red fields must be avoided. They are chasms that endanger the person (negative rewards or punishments). Which way should the person go?

Let's calculate that:

3.2.2.1 Code

var discountRate = 0.95;
var width  = 5;
var height = 3;
var R      = {
    0: {2: 100},
    1: {2: -10},
    2: {2: -10, 1: -10},
    3: {2: -10},
    4: {2: 0, 0: -10}
};

/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();

/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);

/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
    iterations: 100000,
    useSeededRandom: true,
    useOptimizedRandom: true
});

/* print result */
rlQLearning.printTableGridWorld(Q, width, R);

It returns:

3.2.2.2 Watch at the demo

In progress.

3.2.3 Extended Grid World

As in the example 3.2.2 but just a bigger grid world:

var width  = 10;
var height = 5;
var R      = {
    0: {4: 100},
    2: {4: -10},
    3: {4: -10, 3: -10},
    4: {4: -10},
    5: {4: 0, 0: -10}
};

That's easy now:

Now imagine that the person is drunk. That means with a certain probability the person goes to the right or left, although he wanted to go straight out. Depending on how drunk the person is we choose a probability of 2.5% to go left and a probability of 2.5% to go right (splitT = 0.025). What is the safest way now? Preliminary consideration: Moving away from the chasms first and staying away from them might now be better than taking the shortest route.

Let's calculate that:

3.2.3.1 Code

var discountRate = 0.95;
var width  = 10;
var height = 5;
var R      = {
    0: {4: 100},
    2: {4: -10},
    3: {4: -10, 3: -10},
    4: {4: -10},
    5: {4: 0, 0: -10}
};
var splitT = 0.025;

/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();
rlQLearning.adoptConfig({splitT: splitT});

/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);

/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
    iterations: 100000,
    useSeededRandom: true,
    useOptimizedRandom: true
});

/* print result */
rlQLearning.printTableGridWorld(Q, width, R);

It returns:

3.2.3.2 Watch at the demo

In progress.

A. Tools

All flowcharts were gratefully created with Google Drive

B. Authors

Björn Hempel bjoern@hempel.li - Initial work - https://github.com/bjoern-hempel

C. Licence

This tutorial is licensed under the MIT License - see the LICENSE.md file for details

D. Closing words

Have fun! :)

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
demo		demo
images		images
sources/js		sources/js
tests		tests
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.md		LICENSE.md
README.md		README.md

License

bjoern-hempel/js-reinforcement-learning-framework

Folders and files

Latest commit

History

Repository files navigation

A javascript reinforcement learning framework

1. Introduction

2. Markov decision process (MDP)

2.1 Theory

2.1.1 The bellman equation

2.1.2 The value iteration algorithm

2.1.3 The Q-value iteration algorithm

2.2 Usage

2.2.1 Super basic example

2.2.1.1 Code

2.2.1.2 Watch at the demo

2.2.2 Basic example

2.2.2.1 Code

2.2.2.2 Watch at the demo

2.2.2.3 Comparison of different discount rates

2.2.3 More complex example

2.2.3.1 Code

2.2.3.2 Watch at the demo

2.2.3.3 Comparison of different discount rates

2.2.4 Real example

2.2.4.1 Code

2.2.4.2 Watch at the demo

2.2.4.3 Comparison of different discount rates

3. Temporal Difference Learning and Q-Learning

3.1 Theory

3.1.1 Formula

3.2 Usage

3.2.1 Real example

3.2.1.1 Code

3.2.1.2 Watch at the demo

3.2.2 Simple Grid World

3.2.2.1 Code

3.2.2.2 Watch at the demo

3.2.3 Extended Grid World

3.2.3.1 Code

3.2.3.2 Watch at the demo

A. Tools

B. Authors

C. Licence

D. Closing words

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages