# BlackJack - using RL 🂿 🂡 🂾 🂭 🃛 🂱 🂿

> We will be finding the most optimal policy for BlackJack - a popular casino game using reinforcement learning techniques. We will be using Monte Carlo, Q learning
and SARSA algorithm and compare the final result to see which is performing the best and draw intuition from the results. 

## Game setting

A single round goes on like this :
- place your bet
- deal cards
  - You get two cards face up
  - The dealer gets one card face up and one face down
- player's turn
- dealer's turn
- Win/Lose
  - Beat the dealer without busting - win a 1:1 payout
  - You get BlackJack but dealer doesn't - win a 3:2 payout
  - If you bust or dealer beats you - lose your bet
  - Both get the same number - tie (No profit/loss)


### Your Options During Play

1)    Hit: Take another card.

       - Example: You have 12 (7 + 5). You "hit" and get a 9 → 21! You stand.

1)    Stand: Keep your current hand.

       - Example: You have 18 (10 + 8). You "stand" to avoid busting.

1)    Double Down: Double your bet, take one more card, then stand.

       - Example: You have 11 (5 + 6). You double your bet, take one card (e.g., 10), and now have 21.

1)    Split: If your first two cards match (e.g., two 8s), split them into two separate hands (each with its own bet).

       - Example: You have two 8s (16). You split → now play two hands: Hand 1 (8 + ?) and Hand 2 (8 + ?).

       - Aces: If you split Aces, you usually get only one card per Ace.

1)    Surrender (if allowed): Give up your hand and lose half your bet.

       - Example: You have 16 (10 + 6), dealer shows a 10. You surrender to save half your bet.

1)    Insurance: If the dealer’s upcard is an Ace, you can bet half your original wager that the dealer has Blackjack.

       - If dealer has Blackjack, insurance pays 2:1.

       - If not, you lose the insurance bet.

### Dealer Rules

-   The dealer must follow strict rules:

    -   Hit until 17 or higher.

    -   Stand on 17+.

## Importing libs

In [163]:
import gymnasium as gym
from gymnasium.envs.registration import register
from typing import Tuple, List
from gymnasium.envs.toy_text.frozen_lake import FrozenLakeEnv
from gymnasium import spaces
from tqdm import tqdm
import numpy as np
import time
from collections import defaultdict
import random
import jdc
import pprint

## Timer decorator

In [164]:
def timer(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        duration = end - start
        if args and hasattr(args[0], '__dict__'):
            setattr(args[0], f'{func.__name__}_time', duration)
        print(f"Function '{func.__name__}' took {duration:.4f} seconds")
        return result
    return wrapper

## BlackJack Environment setup

In [None]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from collections import deque

class BlackjackEnv(gym.Env):
    metadata = {'render_modes': ['human']}

    def __init__(self, render_mode=None):
        super(BlackjackEnv, self).__init__()

        # Action space: 0=stand, 1=hit, 2=double down, 3=split
        self.action_space = spaces.Discrete(4)
        
        # Observation space
        self.observation_space = spaces.Tuple((
            spaces.Discrete(32),  # Player total (0-31)
            spaces.Discrete(11),  # Dealer upcard (0-10, 0=not visible)
            spaces.Discrete(2),   # Usable ace (0=False, 1=True)
            spaces.Discrete(2),   # Can split (0=False, 1=True)
            spaces.Discrete(2),   # Hand active (0=inactive, 1=active)
        ))
        
        # Game state variables
        self.player_hands = None
        self.dealer_hand = None
        self.current_hand_idx = None
        self.done = None
        self.render_mode = render_mode

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        # Initialize game
        self.player_hands = [{'cards': [], 'bet': 1, 'active': True}]
        self.dealer_hand = []
        self.current_hand_idx = 0
        self.done = False
        
        # Deal initial cards
        self._deal_initial_cards()

        if self.render_mode == 'human':
            self.render()
        
        # Return initial obs
        return self._get_obs(), {}
    
    def step(self, action):
            hand = self.player_hands[self.current_hand_idx]
            terminated = False
            reward = 0
            
            # lights, camera and ACTION
            if action == 0:  # Stand
                hand['active'] = False
            elif action == 1:  # Hit
                self._deal_card(hand['cards'])
                if self._hand_value(hand['cards']) > 21:
                    hand['active'] = False
            elif action == 2:  # Double down
                hand['bet'] *= 2
                self._deal_card(hand['cards'])
                hand['active'] = False
            elif action == 3:  # Split
                if len(hand['cards']) == 2 and hand['cards'][0] == hand['cards'][1]:
                    # Create new hand with same bet
                    new_hand = {
                        'cards': [hand['cards'].pop()],
                        'bet': hand['bet'],
                        'active': True
                    }
                    self.player_hands.append(new_hand)
                    # Deal to both hands
                    self._deal_card(hand['cards'])
                    self._deal_card(new_hand['cards'])
                else:
                    # Invalid split - we will treat it as stand for ease
                    hand['active'] = False

            # Check if current hand is done
            if not hand['active']:
                # Move to next active hand
                self._activate_next_hand()
            
            # payout if all hands are done
            if self.done:
                reward = self._payout()
                terminated = True

            return self._get_obs(), reward, terminated, False, {}
    
    def render(self):
            if self.render_mode != 'human':
                return
                
            print(" *** ")
            print(f"Current Hand: {self.current_hand_idx+1}/{len(self.player_hands)}")
            print(f"Your Cards: {self.player_hands[self.current_hand_idx]['cards']}")
            print(f"Your Total: {self._hand_value(self.player_hands[self.current_hand_idx]['cards'])}")
            print(f"Dealer Shows: {self.dealer_hand[0]}")
            print(f"Current Bet: {self.player_hands[self.current_hand_idx]['bet']}")
            print("0: Stand | 1: Hit | 2: Double Down | 3: Split")
            print(" *** ")
            
    def _deal_initial_cards(self):
        # Deal to player 
        self._deal_card(self.player_hands[0]['cards'])
        self._deal_card(self.player_hands[0]['cards'])
        
        # Deal to dealer (one up, one down)
        self._deal_card(self.dealer_hand)
        self._deal_card(self.dealer_hand)

    def _deal_card(self, hand_cards):
        card = self.np_random.choice([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10])
        hand_cards.append(card)

    def _hand_value(self, cards):
        total = sum(cards)
        aces = cards.count(1) + cards.count(11)
        
        # Convert aces from 11 to 1 if needed
        while total > 21 and aces > 0:
            total -= 10
            aces -= 1
            
        return total
    
    def _get_obs(self):
        hand = self.player_hands[self.current_hand_idx]
        cards = hand['cards']
        
        # Calculate hand value and usable ace
        total = sum(cards)
        usable_ace = 0
        if 1 in cards and total <= 11:
            total += 10
            usable_ace = 1
            
        # Check split e
        can_split = int(
            len(cards) == 2 and 
            cards[0] == cards[1] and 
            len(self.player_hands) < 4  # Max 4 hands
        )
        
        return (
            min(total, 31),  
            self.dealer_hand[0],
            usable_ace,
            can_split,
            int(hand['active'])
            )
    def _activate_next_hand(self):
        # Find next active hand
        for i in range(self.current_hand_idx + 1, len(self.player_hands)):
            if self.player_hands[i]['active']:
                self.current_hand_idx = i
                return
                
        # No more active hands
        self.done = True

    def _payout(self):
        total_reward = 0
        dealer_value = self._dealer_play()
        
        for hand in self.player_hands:
            player_value = self._hand_value(hand['cards'])
            bet = hand['bet']
            
            # Player busts
            if player_value > 21:
                total_reward -= bet
                continue
                
            # Dealer busts
            if dealer_value > 21:
                total_reward += bet
                continue
                
            # Compare values
            if player_value > dealer_value:
                # Blackjack pays 3:2
                if len(hand['cards']) == 2 and player_value == 21:
                    total_reward += bet * 1.5
                else:
                    total_reward += bet
            elif player_value < dealer_value:
                total_reward -= bet
            # Push returns 0
            
        return total_reward
    
    def _dealer_play(self):
            # Dealer plays according to fixed rules
            while True:
                value = self._hand_value(self.dealer_hand)
                if value >= 17:
                    return value
                self._deal_card(self.dealer_hand)

### Registering

In [166]:
gym.register(
    id="Blackjack-v0",
    entry_point=BlackjackEnv,
)

In [167]:
env = BlackjackEnv(render_mode='human')
obs, _ = env.reset()
done = False

while not done:
    env.render()
    action = env.action_space.sample()  # Replace with agent policy
    obs, reward, done, _, _ = env.step(action)
    env.render()  # Optional visual output

env.render()
print(f"Final Reward: {reward}")

 *** 
Current Hand: 1/1
Your Cards: [np.int64(9), np.int64(9)]
Your Total: 18
Dealer Shows: 1
Current Bet: 1
0: Stand | 1: Hit | 2: Double Down | 3: Split
 *** 
 *** 
Current Hand: 1/1
Your Cards: [np.int64(9), np.int64(9)]
Your Total: 18
Dealer Shows: 1
Current Bet: 1
0: Stand | 1: Hit | 2: Double Down | 3: Split
 *** 
 *** 
Current Hand: 1/1
Your Cards: [np.int64(9), np.int64(9), np.int64(1)]
Your Total: 19
Dealer Shows: 1
Current Bet: 1
0: Stand | 1: Hit | 2: Double Down | 3: Split
 *** 
 *** 
Current Hand: 1/1
Your Cards: [np.int64(9), np.int64(9), np.int64(1)]
Your Total: 19
Dealer Shows: 1
Current Bet: 1
0: Stand | 1: Hit | 2: Double Down | 3: Split
 *** 
 *** 
Current Hand: 1/1
Your Cards: [np.int64(9), np.int64(9), np.int64(1)]
Your Total: 19
Dealer Shows: 1
Current Bet: 1
0: Stand | 1: Hit | 2: Double Down | 3: Split
 *** 
 *** 
Current Hand: 1/1
Your Cards: [np.int64(9), np.int64(9), np.int64(1)]
Your Total: 19
Dealer Shows: 1
Current Bet: 1
0: Stand | 1: Hit | 2: Double Down

## Monte Carlo Implementation

In [169]:
@timer
def monte_carlo(env, episodes=10000, alpha=0.1, discount=0.99, epsilon=0.1):

    """
    Monte Carlo control using first-visit method and epsilon-greedy policy.
    Returns Q table of state-action values.
    """

    n_actions = env.action_space.n
    Q = defaultdict(lambda: np.zeros(n_actions))

    for ep in range(episodes):
        state, _ = env.reset()
        done = False
        episode = []

        while not done:
            # epsilon-greedy on Q[state]
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                vals = Q[tuple(state)]
                best_actions = np.flatnonzero(vals == vals.max())
                action = int(np.random.choice(best_actions))

            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((tuple(state), action, reward))
            state = next_state

        G = 0
        visited = set()
        for s, a, r in reversed(episode):
            G = discount * G + r
            if (s, a) not in visited:
                visited.add((s, a))
                Q[s][a] += alpha * (G - Q[s][a])

    return Q


## Temporal Difference Implementation

A generic TD update for $Q(s_t, a_t)$ takes the form:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[\text{Target} - Q(s_t, a_t)\right],
$$

where $\alpha$ is a step-size (learning rate), and Target is an estimate of the return just one-step ahead plus estimated future values.

- In **SARSA**, the target is:

$$
r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}),
$$

using the next action $a_{t+1}$ actually chosen by the current policy (**on-policy**).

- In **Q-learning**, the target is:

$$
r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a'),
$$

using the best possible next action according to current $Q$ (**off-policy**, because it imagines following the greedy policy from the next state even if the       behavior policy actually explores).

Summing up:

- **SARSA update**:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right].
$$

- **Q-learning update**:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right].
$$

In both cases, during learning we select actions via an $\epsilon$-greedy policy over current $Q$: with probability $\epsilon$ choose a random action, else choose:

$$
\arg\max_a Q(s, a).
$$

This ensures exploration.


In [None]:
@timer
def q_learning(env, episodes=1000, alpha=0.1, discount=0.99, epsilon=0.1):
    """
    Q-Learning with epsilon-greedy
    """
    n_actions = env.action_space.n
    # Q[state_tuple] -> np.array of action-values
    Q = defaultdict(lambda: np.zeros(n_actions, dtype=float))

    for ep in range(episodes):
        state, _ = env.reset()
        state = tuple(state)  # convert to hashable
        done = False

        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                vals = Q[state]
                best_actions = np.flatnonzero(vals == vals.max())
                action = int(np.random.choice(best_actions))

            next_state, reward, terminated, truncated, _ = env.step(action)
            next_state = tuple(next_state)
            done = terminated or truncated

            # Q-Learning update (off-policy)
            if done:
                target = reward
            else:
                target = reward + discount * np.max(Q[next_state])
            Q[state][action] += alpha * (target - Q[state][action])

            state = next_state

    return Q


In [None]:
@timer
def sarsa(env, episodes=1000, alpha=0.1, discount=0.99, epsilon=0.1):
    """
    SARSA (on-policy TD) with epsilon-greedy
    """
    n_actions = env.action_space.n
    Q = defaultdict(lambda: np.zeros(n_actions, dtype=float))

    for ep in range(episodes):
        obs, _ = env.reset()
        state = tuple(obs)

        # Choose initial action
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            vals = Q[state]
            best_actions = np.flatnonzero(vals == vals.max())
            action = int(np.random.choice(best_actions))

        done = False
        while not done:
            next_obs, reward, terminated, truncated, _ = env.step(action)
            next_state = tuple(next_obs)
            done = terminated or truncated

            # Choose next action (epsilon-greedy)
            if random.random() < epsilon:
                next_action = env.action_space.sample()
            else:
                vals_next = Q[next_state]
                best_actions = np.flatnonzero(vals_next == vals_next.max())
                next_action = int(np.random.choice(best_actions))

            # SARSA update: if done, we treat Q[next_state, next_action] as 0
            target = reward + (discount * Q[next_state][next_action] if not done else 0.0)
            Q[state][action] += alpha * (target - Q[state][action])

            state, action = next_state, next_action

    return Q


## Policy Evaluation

Testing our policy on the environments and  printing the time for al the algorithms and the average return obtained

In [None]:
def evaluate_policy(env, Q, episodes=100, discount=1.0):
    """
    Evaluate a given policy by running episodes.
    Returns the average total discounted return.
    """
    total_return = 0.0
    for ep in range(episodes):
        state, _ = env.reset()
        done = False
        G = 0.0
        t = 0
        while not done:
            # Greedy action
            best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten()
            action = int(np.random.choice(best_actions))
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            G += (discount**t) * reward
            t += 1 # keeping track of the power to raise discount with
            state = next_state
        total_return += G
    avg_return = total_return / episodes
    return avg_return


## Testing the Algorithms

In [173]:
env = gym.make("Blackjack-v0")
# Train and evaluate each algorithm
mc_Q = monte_carlo(env, episodes=5000000, alpha=0.03, discount=0.99, epsilon=0.05)
ql_Q = q_learning(env, episodes=5000000, alpha=0.1, discount=0.99, epsilon=0.05)
sa_Q = sarsa(env, episodes=5000000, alpha=0.1, discount=0.99, epsilon=0.05)

print("Average Return (MC):", evaluate_policy(env, mc_Q, episodes=1000, discount=0.99))
print("Average Return (Q-Learning):", evaluate_policy(env, ql_Q, episodes=1000, discount=0.99))
print("Average Return (SARSA):", evaluate_policy(env, sa_Q, episodes=1000, discount=0.99))
print()

Function 'monte_carlo' took 468.8506 seconds
Function 'q_learning' took 510.0618 seconds
Function 'sarsa' took 554.9023 seconds
Average Return (MC): -0.05497446518940097
Average Return (Q-Learning): -0.08789791263351324
Average Return (SARSA): -0.07783266020118046



## Result

Even after 5 million episodes, none of the algorithms average return was positive though it's very close to zero. It might be that the game becomes even when played many times and you are likely to neither gain nor lose anything in a long turn. Of course this is just a speculation. Another possibility might be that our model free algorithms are not suited for this environment and we need to discover more.