# Blackjack with Policy Gradient (REINFORCE)

We'll implement a simple policy-gradient agent for the Blackjack environment using Gymnasium's `Blackjack-v1`.

Steps:
- Define a soft policy with a small neural network
- Sample episodes to estimate returns
- Update policy with REINFORCE
- Track performance over time


In [None]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

print("Libraries imported.")


## Environment & Policy Network
State is `(player_sum, dealer_card, usable_ace)`. We'll one-hot encode discretized features and use a small MLP for π(a|s).


In [None]:
env = gym.make("Blackjack-v1", sab=True)

# feature encoding: player 0..31, dealer 1..10, usable_ace {0,1}
obs_space = (32, 11, 2)
feat_dim = np.prod(obs_space)

def encode(obs):
    ps, dc, ua = obs
    x = np.zeros(obs_space, dtype=np.float32)
    x[min(ps, 31), dc, int(ua)] = 1.0
    return x.reshape(-1)

class Policy(nn.Module):
    def __init__(self, input_dim, hidden=64, outputs=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, outputs)
        )
    def forward(self, x):
        return self.net(x)

policy = Policy(feat_dim)
opt = optim.Adam(policy.parameters(), lr=1e-2)

print("Policy ready; input_dim=", feat_dim)


## REINFORCE Loop
Sample episodes, compute returns G_t, and update the policy to maximize expected return.


In [None]:
def choose(pi, obs):
    x = torch.from_numpy(encode(obs)).float().unsqueeze(0)
    logits = pi(x)
    probs = torch.softmax(logits, dim=-1)
    dist = torch.distributions.Categorical(probs)
    a = dist.sample()
    return a.item(), dist.log_prob(a)

returns, lengths = [], []
for ep in range(2000):
    obs, _ = env.reset()
    logps, rs = [], []
    done = False
    while not done:
        a, logp = choose(policy, obs)
        obs, r, term, trunc, _ = env.step(a)
        logps.append(logp)
        rs.append(r)
        done = term or trunc
    # compute returns
    G = 0.0
    rets = []
    for r in reversed(rs):
        G = r + 0.99 * G
        rets.append(G)
    rets = list(reversed(rets))
    returns.append(sum(rs))
    lengths.append(len(rs))
    # policy gradient step
    opt.zero_grad()
    loss = 0
    for logp, Gt in zip(logps, rets):
        loss += -logp * Gt
    loss.backward()
    opt.step()
    if (ep+1) % 100 == 0:
        print(f"Episode {ep+1}: avg return (100) = {np.mean(returns[-100:]):.2f}")

print("Training complete.")


# Blackjack with Gymnasium and Deep Q-Learning (DQN)

DQN is an off-policy reinforcement learning algorithm that learns an action-value function $Q(s,a)$, which estimates the expected future reward of taking action a in state s.


In [None]:
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

## Create the Environment

## Train the Model

## Render the Trained Agent