<a href="https://colab.research.google.com/github/akshaydp1995/Deep-Reinforcement-Learning/blob/master/Blackjack_Reinforcement_Learning_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Game rules: 

Blackjack is a card game where the player plays against the dealer. 
The game starts off with 2 cards each, where only one is open for the other person to view and one is closed. The player can hit (1) and continue to be dealt cards. He/she can stick(0) when he/she thinks that the sum of all his cards is close but less than 21. If the sum is greater than 21, he goes bust and the dealer wins. Once the player sticks, the dealer is now dealt cards until the sum of the dealer's cards is greater than or equal to 17. If the sum of the dealer's cards is greater than 21, the dealer goes bust and the player wins. At the end, if both do not go bust, the sum of the cards is compared and the one closest to 21 wins.

**Monte Carlo Method on OpenAI gym's Blackjack environment - My implementation**

Importing the necessary libraries

In [0]:
import sys
import gym
import numpy as np
from collections import defaultdict

Creating the environment

In [8]:
env = gym.make('Blackjack-v0')

  result = entry_point.load(False)


Observation Space - 3 tuples (0-31 *player card sum*, 0-10 or *dealer showing card*, 0-1 *binary player usable ace*) - 704 possible states

Action Space - Either Stick (0) or Hit (1)

In [9]:
print("Observation Space - ", env.observation_space)
print("Action Space -      ", env.action_space)
print("Reward -             Discrete(3)") # Reward is -1 if player loses, 0 if draw, 1 if player wins

Observation Space -  Tuple(Discrete(32), Discrete(11), Discrete(2))
Action Space -       Discrete(2)
Reward -             Discrete(3)


In [10]:
for episode in range(5):                           # 5 episodes or episodic tasks
  state = env.reset()                              # reset environment
  print("** New Game! **")
  while True:     
    print(state) 
    action = env.action_space.sample()             # choosing a random action (equiprobable random policy followed here)
    state, reward, done, info = env.step(action)   # running an episode 
    if done:    
      print('Reward: ', reward)
      print('You won :)') if reward > 0 else print('You lost :(')
      print("Game Ends!\n")
      break                                        # if the Blackjack game ends, break out of the loop and start the next episode

** New Game! **
(19, 10, False)
Reward:  1.0
You won :)
Game Ends!

** New Game! **
(13, 6, False)
(20, 6, False)
Reward:  1.0
You won :)
Game Ends!

** New Game! **
(9, 1, False)
Reward:  -1.0
You lost :(
Game Ends!

** New Game! **
(12, 5, False)
Reward:  1.0
You won :)
Game Ends!

** New Game! **
(19, 2, False)
Reward:  1.0
You won :)
Game Ends!



**Monte Carlo Prediction**

In [0]:
def generate_episode(env):
  episode = []                                           # One episode
  state = env.reset()                                    
  while True:
    probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]  # Setting the policy - 80% probability of sticking if player card sum > 18, 20% hit if not 
    action = np.random.choice(np.arange(2), p=probs)      
    next_state, reward, done, info = env.step(action)    # Acting in the environment
    episode.append(next_state)
    episode.append(action)
    episode.append(reward)
    state = next_state                                   # If game doesn't end here, go to the next state
    if done:
      break
  return episode

Testing if generation of episode works

In [12]:
for i in range(3):
    print(generate_episode(env))

[(20, 5, False), 0, 1.0]
[(11, 3, False), 0, -1.0]
[(15, 4, False), 0, 1.0]


Q-table updation

In [0]:
def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):
    # initialize empty dictionaries of arrays
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n))
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    # loop over episodes
    for i_episode in range(1, num_episodes+1):
        states = []
        actions = []
        # monitor progress
        if i_episode % 1000 == 0:
            print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
            sys.stdout.flush()
        ## TODO: complete the function
        epi = generate_episode(env)
        for each in epi:
            state = epi[0][0]
            action = epi[0][1]
            if state not in states and action not in actions:
                g = epi[-1]
                N[state,action] = N[state,action] + 1
                returns_sum[state,action] = returns_sum[state,action] + g
                break
            states.append(state)
            actions.append(action)
            Q[state][action] = (returns_sum[state][action])/(N[state][action])
    return Q

Training

In [20]:
Q = mc_prediction_q(env, 5000, generate_episode)  # obtain the action-value function

Episode 5000/5000.