# Implementing On-policy MC control

Now, let's learn how to implement the MC control method with epsilon-greedy policy for playing the blackjack game, that is, we will see how can we use the MC control method for
finding the optimal policy in the blackjack game:

First, let's import the necessary libraries:

In [1]:
import gym
import pandas as pd
from collections import defaultdict
import random

Create a blackjack environment:

In [2]:
env = gym.make('Blackjack-v0')

Initialize the dictionary for storing the Q values:

In [3]:
Q = defaultdict(float)

Initialize the dictionary for storing the total return of the state-action pair:

In [4]:
total_return = defaultdict(float)

Initialize the dictionary for storing the count of the number of times a state-action pair is
visited:

In [5]:
N = defaultdict(int)

## Define the epsilon-greedy policy

We learned that we select actions based on the epsilon-greedy policy, so we define a
function called epsilon_greedy_policy which takes the state and Q value as an input
and returns the action to be performed in the given state:

In [6]:
def epsilon_greedy_policy(state,Q):
    
    #set the epsilon value to 0.5
    epsilon = 0.5
    
    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below
    
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])

## Generating an episode

Now, let's generate an episode using the epsilon-greedy policy. We define a function called
generate_episode which takes the Q value as an input and returns the episode.

First, let's set the number of time steps:

In [7]:
num_timesteps = 100

In [8]:
def generate_episode(Q):
    
    #initialize a list for storing the episode
    episode = []
    
    #initialize the state using the reset function
    state = env.reset()
    
    #then for each time step
    for t in range(num_timesteps):
        
        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)
        
        #perform the selected action and store the next state information
        next_state, reward, done, info = env.step(action)
        
        #store the state, action, reward in the episode list
        episode.append((state, action, reward))
        
        #if the next state is a final state then break the loop else update the next state to the current
        #state
        if done:
            break
            
        state = next_state

    return episode

## Computing the optimal policy

Now, let's learn how to compute the optimal policy. First, let's set the number of iterations, that is, the number of episodes, we want to generate:

In [9]:
num_iterations = 50000

We learned that in the on-policy control method, we will not be given any policy as an
input. So, we initialize a random policy in the first iteration and improve the policy
iteratively by computing Q value. Since we extract the policy from the Q function, we don't
have to explicitly define the policy. As the Q value improves the policy also improves
implicitly. That is, in the first iteration we generate episode by extracting the policy
(epsilon-greedy) from the initialized Q function. Over a series of iterations, we will find the
optimal Q function and hence we also find the optimal policy. 

In [10]:
#for each iteration
for i in range(num_iterations):
    
    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)
    
    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]
    
    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each state-action pair
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:
            
            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])
            
            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R
            
            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

Thus on every iteration, the Q value improves and so does policy.
After all the iterations, we can have a look at the Q value of each state-action in the pandas
data frame for more clarity.

First, let's convert the Q value dictionary to a pandas data
frame:

In [11]:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])

Let's look at the first few rows of the data frame:

In [12]:
df.head(11)

Unnamed: 0,state_action pair,value
0,"((14, 10, False), 0)",-0.641944
1,"((14, 10, False), 1)",-0.617698
2,"((11, 10, False), 1)",-0.170015
3,"((12, 3, False), 0)",-0.180328
4,"((12, 3, False), 1)",-0.320388
5,"((13, 1, False), 0)",-0.752381
6,"((11, 6, False), 1)",0.0
7,"((17, 6, False), 0)",-0.118644
8,"((10, 9, False), 0)",-0.714286
9,"((10, 9, False), 1)",-0.041322


As we can observe, we have the Q values for all the state-action pairs. Now we can extract
the policy by selecting the action which has maximum Q value in each state. 

To learn more how to select action based on this Q value, check the book under the section, implementing on-policy control.