# Every-visit MC prediction with blackjack game

To understand this section clearly, you can recap every visit Monte Carlo method we
learned earlier. Let's now understand how to implement the every-visit MC prediction with
the blackjack game step by step:

Import the necessary libraries

In [1]:
import gym
import pandas as pd
from collections import defaultdict

Create a blackjack environment:

In [2]:
env = gym.make('Blackjack-v0')

## Defining a policy

We learned that in the prediction method, we will be given an input policy and we predict
the value function of the given input policy. So, now, we first define a policy function
which acts as an input policy. That is, we define the input policy whose value function will
be predicted in the upcoming steps.

As shown below, our policy function takes the state as an input and if the state[0], sum of
our cards value is greater than 19, then it will return action 0 (stand) else it will return
action 1 (hit):

In [3]:
def policy(state):
    return 0 if state[0] > 18 else 1

We defined an optimal policy, that is, it makes more sense to perform an action 0 (stand)
when our sum value is already greater than 19. That is, when the sum value is greater than
19 we don't have to perform 1 (hit) action and receive a new card which may cause us to
lose the game or burst.

For example, let's generate an initial state by resetting the environment as shown below:

In [4]:
state = env.reset()
print(state)

(20, 7, False)


As you can notice, state[0] = 20, that is our sum of cards value is 20, so in this case, our
policy will return the action 0 (stand) as shown below:

In [5]:
print(policy(state))

0


Now, that we have defined the policy, in the next section, we will predict the value
function (state values) of this policy. 

## Generating an episode
Next, we generate an episode using the given policy, so, we, define a function
called generate_episode which takes the policy as an input and generates the episode
using the given policy.

First, let's set the number of time steps:

In [6]:
num_timestep = 100

In [7]:
def generate_episode(policy):
    
    #let's define a list called episode for storing the episode
    episode = []
    
    #initialize the state by resetting the environment
    state = env.reset()
    
    #then for each time step
    for i in range(num_timestep):
        
        #select the action according to the given policy
        action = policy(state)
        
        #perform the action and store the next state information
        next_state, reward, done, info = env.step(action)
        
        #store the state, action, reward into our episode list
        episode.append((state, action, reward))
        
        #If the next state is a final state then break the loop else update the next state to the current state
        if done:
            break
            
        state = next_state

    return episode

Let's take a look at how the output of our generate_episode function looks like. Note
that we generate episode using the policy we defined earlier:

In [8]:
generate_episode(policy)

[((12, 10, False), 1, 0), ((15, 10, False), 1, -1)]

As we can observe our output is in the form of [(state, action, reward)]. As shown above,
we have two states in our episode. We performed action 1 (hit) in the state (10, 2,
False)and received a 0 reward and the action 0 (stand) in the state (20, 2, False)and
received 1.0 reward.

Now that we have learned how to generate an episode using the given policy, next, we will
look at how to compute the value of the state (value function) using every visit-MC
method.

## Computing the value function

We learned that in order to predict the value function, we generate several episodes using
the given policy and compute the value of the state as an average return across several
episodes. Let's see how to do implement that.

First, we define the total_return and N as a dictionary for storing the total return and the
number of times the state is visited across episodes respectively. 

In [9]:
total_return = defaultdict(float)
N = defaultdict(int)

Set the number of iterations, that is, the number of episodes, we want to generate:

In [10]:
num_iterations = 10000

In [11]:
#then, for every iteration
for i in range(num_iterations):
    
    #generate the episode using the given policy, that is, generate an episode using the policy
    #function we defined earlier
    episode = generate_episode(policy)
    
    #store all the states, actions, rewards obtained from the episode
    states, actions, rewards = zip(*episode)
    
    #then, for each state in the episode
    for t, state in enumerate(states):
        
            #compute the return R of the state as the sum of reward
            R = (sum(rewards[t:]))
            
            #update the total_return of the state
            total_return[state] =  total_return[state] + R
            
            #update the number of times the state is visited in the episode
            N[state] =  N[state] + 1

After computing the total_return and N We can just convert them into a pandas data
frame for a better understanding. [Note that this is just to give a clear understanding of the
algorithm, we don't necessarily have to convert to the pandas data frame, we can also
implement this efficiently just using the dictionary]


Convert total_returns dictionary to a data frame:

In [12]:
total_return = pd.DataFrame(total_return.items(),columns=['state', 'total_return'])

Convert the counter N dictionary to a data frame

In [13]:
N = pd.DataFrame(N.items(),columns=['state', 'N'])

Merge the two data frames on states:

In [14]:
df = pd.merge(total_return, N, on="state")

Look at the first few rows of the data frame:

In [15]:
df.head(10)

Unnamed: 0,state,total_return,N
0,"(7, 7, False)",-4.0,16
1,"(11, 7, False)",19.0,43
2,"(16, 7, False)",-38.0,104
3,"(19, 7, False)",55.0,113
4,"(20, 8, False)",96.0,129
5,"(20, 2, False)",94.0,142
6,"(15, 5, False)",-42.0,93
7,"(20, 5, False)",62.0,115
8,"(12, 3, False)",-55.0,91
9,"(15, 3, False)",-36.0,96


As we can observe from above, we have the total return and
the number of times the state is visited.

Next, we can compute the value of the state as the average return, thus, we can write:

In [16]:
df['value'] = df['total_return']/df['N']

Let's look at the first few rows of the data frame:

In [17]:
df.head(10)

Unnamed: 0,state,total_return,N,value
0,"(7, 7, False)",-4.0,16,-0.25
1,"(11, 7, False)",19.0,43,0.44186
2,"(16, 7, False)",-38.0,104,-0.365385
3,"(19, 7, False)",55.0,113,0.486726
4,"(20, 8, False)",96.0,129,0.744186
5,"(20, 2, False)",94.0,142,0.661972
6,"(15, 5, False)",-42.0,93,-0.451613
7,"(20, 5, False)",62.0,115,0.53913
8,"(12, 3, False)",-55.0,91,-0.604396
9,"(15, 3, False)",-36.0,96,-0.375


As we can observe we now have the value of the state which is just the average of a return
of the state across several episodes. Thus, we have successfully predicted the value function
of the given policy using the every-visit MC method.

Okay, let's check the value of some states and understand how accurately our value
function is estimated according to the given policy. Recall that when we started off, to
generate episodes, we used the optimal policy which selects action 0 (stand) when the sum
value is greater than 19 and action 1 (hit) when the sum value is less than 19.

Let's evaluate the value of the state (21,9,False), as we can observe, our sum of cards
value is already 21 and so this is a good state and should have a high value. Let's see what's
our estimated value of the state:

In [18]:
df[df['state']==(21,9,False)]['value'].values

array([0.90163934])

As we can observe above our value of the state is high.
Now, let's check the value of the state (5,8,False)as we can notice, our sum of cards
value is just 5 and even the one dealer's single card has a high value, 8, then, in this case,
the value of the state should be less. Let's see what's our estimated value of the state:

In [19]:
df[df['state']==(5,8,False)]['value'].values

array([0.08333333])

As we can notice, the value of the state is less.
Thus, we learned how to predict the value function of the given policy using the every-visit
MC prediction method, in the next section, we will look at how to compute the value of the
state using the first-visit mC method. 