Update 2016-11-16-Deep-Learning-Research-Review-Week-2&barryclark#58-…

…Reinforcement-Learning.html
adeshpande3 · Nov 19, 2016 · fc662c5 · fc662c5
1 parent 9b4b251
commit fc662c5
Showing 1 changed file with 11 additions and 11 deletions.
diff --git a/_posts/2016-11-16-Deep-Learning-Research-Review-Week-2&#58-Reinforcement-Learning.html b/_posts/2016-11-16-Deep-Learning-Research-Review-Week-2&#58-Reinforcement-Learning.html
@@ -7,7 +7,7 @@
 ---
 <link href="https://afeld.github.io/emoji-css/emoji.css" rel="stylesheet">
 <img src="/assets/Cover6th.png">
-<p><em>This is the 2<sup>nd</sup> installment of a new series called Deep Learning Research Review. Every couple weeks or so, I&rsquo;ll be summarizing and explaining research papers in specific subfields of deep learning. This week focuses on Reinforcement Learning. </em><a href="https://adeshpande3.github.io/adeshpande3.github.io/Deep-Learning-Research-Review-Week-1-Generative-Adversarial-Nets"><em>Last time</em></a><em> was Generative Adversarial Networks ICYMI</em></p>
+<p><em>This is the 2<sup>nd</sup> installment of a new series called Deep Learning Research Review. Every couple weeks or so, I&rsquo;ll be summarizing and explaining research papers in specific subfields of deep learning. This week focuses on Reinforcement Learning. </em><a href="https://adeshpande3.github.io/adeshpande3.github.io/Deep-Learning-Research-Review-Week-1-Generative-Adversarial-Nets"  target="_blank"><em>Last time</em></a><em> was Generative Adversarial Networks ICYMI</em></p>
 <h2><strong>Introduction to Reinforcement Learning</strong></h2>
 <p><span style="text-decoration: underline;">3 Categories of Machine Learning</span></p>
 <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Before getting into the papers, let&rsquo;s first talk about what <strong>reinforcement learning</strong> is. The field of machine learning can be separated into 3 main categories.</p>
@@ -44,7 +44,7 @@ <h2><strong>Introduction to Reinforcement Learning</strong></h2>
 <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Policy iteration is great and all, but it only works when we have a given MDP. The MDP essentially tells you how the environment works, which realistically is not going to be given in real world scenarios. When not given an MDP, we use model free methods that go directly from the experience/interactions of the agents and the environment to the value functions and policies. We&rsquo;re going to be doing the same steps of policy evaluation and policy improvement, just without the information given by the MDP.</p>
 <p>The way we do this is instead of improving our policy by optimizing over the state value function, we&rsquo;re going to optimize over the action value function Q. Remember how we decomposed the state value function into the sum of immediate reward and value function of the successor state? Well, we can do the same with our Q function.</p>
 <img src="/assets/IRL7.png">
-<p>Now, we&rsquo;re going to go through the same process of policy evaluation and policy improvement, except we replace our state value function V with our action value function Q. Now, I&rsquo;m going to skip over the details of what changes with the evaluation/improvement steps. To understand MDP free evaluation and improvement methods, topics such as Monte Carlo Learning, Temporal Difference Learning, and SARSA would require whole blogs just themselves (If you are interested, though, please take a listen to David Silver&rsquo;s <a href="https://www.youtube.com/watch?v=PnHCvfgC_ZA">Lecture 4</a> and <a href="https://www.youtube.com/watch?v=0g4j2k_Ggc4">Lecture 5</a>). Right now, however, I&rsquo;m going to jump ahead to value function approximation and the methods discussed in the AlphaGo and Atari Papers, and hopefully that should give a taste of modern RL techniques. <strong>The main takeaway is that we want to find the optimal policy &pi;<sup>*</sup> that maximizes our action value function Q.</strong></p>
+<p>Now, we&rsquo;re going to go through the same process of policy evaluation and policy improvement, except we replace our state value function V with our action value function Q. Now, I&rsquo;m going to skip over the details of what changes with the evaluation/improvement steps. To understand MDP free evaluation and improvement methods, topics such as Monte Carlo Learning, Temporal Difference Learning, and SARSA would require whole blogs just themselves (If you are interested, though, please take a listen to David Silver&rsquo;s <a href="https://www.youtube.com/watch?v=PnHCvfgC_ZA"  target="_blank">Lecture 4</a> and <a href="https://www.youtube.com/watch?v=0g4j2k_Ggc4"  target="_blank">Lecture 5</a>). Right now, however, I&rsquo;m going to jump ahead to value function approximation and the methods discussed in the AlphaGo and Atari Papers, and hopefully that should give a taste of modern RL techniques. <strong>The main takeaway is that we want to find the optimal policy &pi;<sup>*</sup> that maximizes our action value function Q.</strong></p>
 <p><span style="text-decoration: underline;">Value Function Approximation</span></p>
 <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; So, if you think about everything we&rsquo;ve learned up until this point, we&rsquo;ve treated our problem in a relatively simplistic way. Look at the above Q equation. We&rsquo;re taking in a specific state S and action A, and then computing a number that basically tells us what the expected return is. Now let&rsquo;s imagine that our agent moves 1 millimeter to the right. This means we have a whole new state S&rsquo;, and now we&rsquo;re going to have to compute a Q value for that. In real world RL problems, there are millions and millions of states so it&rsquo;s important that our value functions understand generalization in that we don&rsquo;t have to store a completely separate value function for every possible state. The solution is to use a <strong>Q value function approximation </strong>that is able to generalize to unknown states.</p>
 <p>So, what we want is some function, let&rsquo;s call is Qhat, that gives a rough approximation of the Q value given some state S and some action A.</p>
@@ -58,20 +58,20 @@ <h2><strong>Introduction to Reinforcement Learning</strong></h2>
 <p><strong>Other Resources for Learning RL</strong></p>
 <p><strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </strong>Phew. That was a lot of info. By no means, however, was that a comprehensive overview of the field. If you&rsquo;d like a more in-depth overview of RL, I&rsquo;d strongly recommend these resources.</p>
 <ul>
-<li>David Silver (from Deepmind) Reinforcement Learning <a href="https://www.youtube.com/watch?v=2pWv7GOvuf0&amp;list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT">Video Lectures</a>
+<li>David Silver (from Deepmind) Reinforcement Learning <a href="https://www.youtube.com/watch?v=2pWv7GOvuf0&amp;list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT"  target="_blank">Video Lectures</a>
 <ul>
-<li>My <a href="https://docs.google.com/document/d/1TjmYDOxQzOQ0jd0lUiFOVQ1hNAmtwKfSAACck9dR7a8/edit?usp=sharing">personal notes</a> from the RL course</li>
+<li>My <a href="https://docs.google.com/document/d/1TjmYDOxQzOQ0jd0lUiFOVQ1hNAmtwKfSAACck9dR7a8/edit?usp=sharing"  target="_blank">personal notes</a> from the RL course</li>
 </ul>
 </li>
-<li>Sutton and Barto&rsquo;s <a href="https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf">Reinforcement Learning Textbook</a> (This is really the holy grail if you are determined to learn the ins and outs of this subfield)</li>
-<li>Andrej Karpathy&rsquo;s <a href="http://karpathy.github.io/2016/05/31/rl/">Blog Post</a> on RL (Start with this one if you want to ease into RL and want to see a really well done practical example)</li>
-<li><a href="http://ai.berkeley.edu/lecture_videos.html">UC Berkeley CS 188</a> Lectures 8-11</li>
-<li><a href="https://gym.openai.com/">Open AI Gym</a>: When you feel comfortable with RL, try creating your own agents with this reinforcement learning toolkit that Open AI created</li>
+<li>Sutton and Barto&rsquo;s <a href="https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf"  target="_blank">Reinforcement Learning Textbook</a> (This is really the holy grail if you are determined to learn the ins and outs of this subfield)</li>
+<li>Andrej Karpathy&rsquo;s <a href="http://karpathy.github.io/2016/05/31/rl/"  target="_blank">Blog Post</a> on RL (Start with this one if you want to ease into RL and want to see a really well done practical example)</li>
+<li><a href="http://ai.berkeley.edu/lecture_videos.html"  target="_blank">UC Berkeley CS 188</a> Lectures 8-11</li>
+<li><a href="https://gym.openai.com/"  target="_blank">Open AI Gym</a>: When you feel comfortable with RL, try creating your own agents with this reinforcement learning toolkit that Open AI created</li>
 </ul>
-<h2><span style="text-decoration: underline;"><a href="http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf"><strong>DQN For Reinforcement Learning</strong></a></span><strong> (RL With Atari Games)</strong></h2>
+<h2><span style="text-decoration: underline;"><a href="http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf"  target="_blank"><strong>DQN For Reinforcement Learning</strong></a></span><strong> (RL With Atari Games)</strong></h2>
 <img src="/assets/IRL10.png">
 <p><span style="text-decoration: underline;">Introduction</span></p>
-<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; This paper was published by Google Deepmind in February of 2015 and graced the cover of Nature, a world famous weekly journal of science. This was one of the first successful attempts at combining deep neural networks with reinforcement learning (<a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">This</a> was Deepmind&rsquo;s original paper). The paper showed that their system was able to play Atari games at a level comparable to professional game testers across a set of 49 games. Let&rsquo;s take a look at how they did it.</p>
+<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; This paper was published by Google Deepmind in February of 2015 and graced the cover of Nature, a world famous weekly journal of science. This was one of the first successful attempts at combining deep neural networks with reinforcement learning (<a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf"  target="_blank">This</a> was Deepmind&rsquo;s original paper). The paper showed that their system was able to play Atari games at a level comparable to professional game testers across a set of 49 games. Let&rsquo;s take a look at how they did it.</p>
 <p><span style="text-decoration: underline;">Approach</span></p>
 <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Okay, so remember where we left off in the intro tutorial at the beginning of the post? We had just described the main goal of having to optimize our action value function Q. The folks at Deepmind approached this task through a Deep Q-Network, or a DQN. This network is able to come up with successful policies that optimize Q, all from inputs in the form of pixels from the game screen and the current score.</p>
 <p><span style="text-decoration: underline;">Network Architecture</span></p>
@@ -89,7 +89,7 @@ <h2><span style="text-decoration: underline;"><a href="http://www.nature.com/nat
 <img src="/assets/IRL14.png">
 <p>As you remember, the value function is basically a metric for measuring &ldquo;how good it is to be in a particular situation&rdquo;. If you look at #4, you can see, based on the trajectory of the ball and the location of the bricks, that we&rsquo;re in for a lot of points and the high value function is quite representative of that.</p>
 <p>All 49 Atari games used the same network architecture, algorithm, and hyperparameters which is an impressive testament to the robustness of such an approach to reinforcement learning. The combination of deep networks and traditional reinforcement learning strategies, like Q learning, proved to be a great breakthrough in setting the stage for&hellip;&hellip;</p>
-<h2><span style="text-decoration: underline;"><a href="http://www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf"><strong>Mastering AlphaGo with RL</strong></a></span></h2>
+<h2><span style="text-decoration: underline;"><a href="http://www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf"  target="_blank"><strong>Mastering AlphaGo with RL</strong></a></span></h2>
 <img src="/assets/IRL15.png">
 <p><span style="text-decoration: underline;">Introduction</span></p>
 <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <strong>4-1</strong>. That&rsquo;s the record Deepmind&rsquo;s RL agent had against one of the best Go players in the world, Lee Sedol. In case you didn&rsquo;t know, Go is an abstract strategy game of capturing territory on a game board. It is considered to be one of the hardest games in the world for AI because of the incredible number of different game scenarios and moves. The paper begins with a comparison of Go and common board games like chess and checkers. While those can be attacked with variations of tree search algorithms, Go is a totally different animal because there are about 250<sup>150</sup> different sequences of moves in a game. It&rsquo;s clear that reinforcement learning was needed, so let&rsquo;s look into how AlphaGo managed to beat the odds.</p>