-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Demo of chapter 21 of the 4th edition (#1103)
* add demo of chapter 18 * add demo of chapter 21 * remove chapter 18 * monior style change
- Loading branch information
1 parent
483bf81
commit a01aceb
Showing
3 changed files
with
636 additions
and
0 deletions.
There are no files selected for viewing
212 changes: 212 additions & 0 deletions
212
notebooks/chapter21/Active Reinforcement Learning.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# ACTIVE REINFORCEMENT LEARNING\n", | ||
"\n", | ||
"This notebook mainly focuses on active reinforce learning algorithms. For a general introduction to reinforcement learning and passive algorithms, please refer to the notebook of **[Passive Reinforcement Learning](./Passive%20Reinforcement%20Learning.ipynb)**.\n", | ||
"\n", | ||
"Unlike Passive Reinforcement Learning in Active Reinforcement Learning, we are not bound by a policy pi and we need to select our actions. In other words, the agent needs to learn an optimal policy. The fundamental tradeoff the agent needs to face is that of exploration vs. exploitation. \n", | ||
"\n", | ||
"## QLearning Agent\n", | ||
"\n", | ||
"The QLearningAgent class in the rl module implements the Agent Program described in **Fig 21.8** of the AIMA Book. In Q-Learning the agent learns an action-value function Q which gives the utility of taking a given action in a particular state. Q-Learning does not require a transition model and hence is a model-free method. Let us look into the source before we see some usage examples." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%psource QLearningAgent" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The Agent Program can be obtained by creating the instance of the class by passing the appropriate parameters. Because of the __ call __ method the object that is created behaves like a callable and returns an appropriate action as most Agent Programs do. To instantiate the object we need a `mdp` object similar to the `PassiveTDAgent`.\n", | ||
"\n", | ||
" Let us use the same `GridMDP` object we used above. **Figure 17.1 (sequential_decision_environment)** is similar to **Figure 21.1** but has some discounting parameter as **gamma = 0.9**. The enviroment also implements an exploration function **f** which returns fixed **Rplus** until agent has visited state, action **Ne** number of times. The method **actions_in_state** returns actions possible in given state. It is useful when applying max and argmax operations." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let us create our object now. We also use the **same alpha** as given in the footnote of the book on **page 769**: $\\alpha(n)=60/(59+n)$ We use **Rplus = 2** and **Ne = 5** as defined in the book. The pseudocode can be referred from **Fig 21.7** in the book." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os, sys\n", | ||
"sys.path = [os.path.abspath(\"../../\")] + sys.path\n", | ||
"from rl4e import *\n", | ||
"from mdp import sequential_decision_environment, value_iteration" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"q_agent = QLearningAgent(sequential_decision_environment, Ne=5, Rplus=2, \n", | ||
" alpha=lambda n: 60./(59+n))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now to try out the q_agent we make use of the **run_single_trial** function in rl.py (which was also used above). Let us use **200** iterations." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for i in range(200):\n", | ||
" run_single_trial(q_agent,sequential_decision_environment)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now let us see the Q Values. The keys are state-action pairs. Where different actions correspond according to:\n", | ||
"\n", | ||
"north = (0, 1) \n", | ||
"south = (0,-1) \n", | ||
"west = (-1, 0) \n", | ||
"east = (1, 0)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"q_agent.Q" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The Utility U of each state is related to Q by the following equation.\n", | ||
"\n", | ||
"$$U (s) = max_a Q(s, a)$$\n", | ||
"\n", | ||
"Let us convert the Q Values above into U estimates.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"U = defaultdict(lambda: -1000.) # Very Large Negative Value for Comparison see below.\n", | ||
"for state_action, value in q_agent.Q.items():\n", | ||
" state, action = state_action\n", | ||
" if U[state] < value:\n", | ||
" U[state] = value" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we can output the estimated utility values at each state:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"defaultdict(<function __main__.<lambda>()>,\n", | ||
" {(0, 0): -0.0036556430391564178,\n", | ||
" (1, 0): -0.04862675963288682,\n", | ||
" (2, 0): 0.03384490363100474,\n", | ||
" (3, 0): -0.16618771401113092,\n", | ||
" (3, 1): -0.6015323978614368,\n", | ||
" (0, 1): 0.09161077177913537,\n", | ||
" (0, 2): 0.1834607974581678,\n", | ||
" (1, 2): 0.26393277962204903,\n", | ||
" (2, 2): 0.32369726495311274,\n", | ||
" (3, 2): 0.38898341569576245,\n", | ||
" (2, 1): -0.044858154562400485})" | ||
] | ||
}, | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"U" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let us finally compare these estimates to value_iteration results." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{(0, 1): 0.3984432178350045, (1, 2): 0.649585681261095, (3, 2): 1.0, (0, 0): 0.2962883154554812, (3, 0): 0.12987274656746342, (3, 1): -1.0, (2, 1): 0.48644001739269643, (2, 0): 0.3447542300124158, (2, 2): 0.7953620878466678, (1, 0): 0.25386699846479516, (0, 2): 0.5093943765842497}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(value_iteration(sequential_decision_environment))" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.