# dennybritz/reinforcement-learning

Switch branches/tags
Nothing to show
jonahweissman Fix typo in MC Control
`dictionar -> dictionary`
Latest commit 542cbf0 Mar 7, 2018
 .. Failed to load latest commit information. Blackjack Playground.ipynb Feb 19, 2018 MC Control with Epsilon-Greedy Policies Solution.ipynb Mar 7, 2018 MC Control with Epsilon-Greedy Policies.ipynb Nov 23, 2017 MC Prediction Solution.ipynb Nov 23, 2017 MC Prediction.ipynb Jan 28, 2018 Off-Policy MC Control with Weighted Importance Sampling Solution.ipynb Nov 23, 2017 Off-Policy MC Control with Weighted Importance Sampling.ipynb Nov 23, 2017 README.md Jan 3, 2018

## Model-Free Prediction & Control with Monte Carlo (MC)

### Learning Goals

• Understand the difference between Prediction and Control
• Know how to use the MC method for predicting state values and state-action values
• Understand the on-policy first-visit MC control algorithm
• Understand off-policy MC control algorithms
• Understand Weighted Importance Sampling
• Understand the benefits of MC algorithms over the Dynamic Programming approach

### Summary

• Dynamic Programming approaches assume complete knowledge of the environment (the MDP). In practice, we often don't have full knowledge of how the world works.
• Monte Carlo (MC) methods can learn directly from experience collected by interacting with the environment. An episode of experience is a series of `(State, Action, Reward, Next State)` tuples.
• MC methods work based on episodes. We sample episodes of experience and make updates to our estimates at the end of each episode. MC methods have high variance (due to lots of random decisions within an episode) but are unbiased.
• MC Policy Evaluation: Given a policy, we want to estimate the state-value function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the action-value function Q(s, a). Given enough samples, this is proven to converge.
• MC Control: Idea is the same as for Dynamic Programming. Use MC Policy Evaluation to evaluate the current policy then improve the policy greedily. The Problem: How do we ensure that we explore all states if we don't know the full environment?
• Solution to exploration problem: Use epsilon-greedy policies instead of full greedy policies. When making a decision act randomly with probability epsilon. This will learn the optimal epsilon-greedy policy.
• Off-Policy Learning: How can we learn about the actual optimal (greedy) policy while following an exploratory (epsilon-greedy) policy? We can use importance sampling, which weighs returns by their probability of occurring under the policy we want to learn about.

Required:

Optional:

• David Silver's RL Course Lecture 4 - Model-Free Prediction (video, slides)
• David Silver's RL Course Lecture 5 - Model-Free Control (video, slides)

### Exercises

• Get familiar with the Blackjack environment (Blackjack-v0)
• Implement the Monte Carlo Prediction to estimate state-action values
• Implement the on-policy first-visit Monte Carlo Control algorithm
• Implement the off-policy every-visit Monte Carlo Control using Weighted Important Sampling algorithm