# Minesweeper LLM Competition - Custom GRPO Training

## Goal
Finetune an LLM with LoRA using GRPO to play Minesweeper by:
- **Input**: JSON game state (board configuration)
- **Output**: JSON action (reveal or flag a cell)

Teams will compete to train the best Minesweeper-playing LLM!

## Training Approach
- **Model**: GPT-OSS 20B with LoRA or other models in the /root/.cache/huggingface/hub directory [**Any model other than /root/.cache/huggingface/hub will lead to disqualification**]
- **Method**: GRPO (Group Relative Policy Optimization), SFT or any RL-policies (not just strict to use GRPO)
- **Framework**: Unsloth (2-6x faster, 70% less VRAM)
- **Hardware**: AMD GPU (ROCm)

# Load Model with Unsloth

Load GPT-OSS 20B with LoRA configuration:

In [1]:
!ls -1 /root/.cache/huggingface/hub/

models--Qwen--Qwen3-4B


In [2]:
from unsloth import FastLanguageModel
import os
import torch

os.environ['HF_HOME'] = '/workspace/huggingface_cache'
os.makedirs('/workspace/huggingface_cache', exist_ok=True)

max_seq_length = 1024  # Max context length
lora_rank = 64    # LoRA rank (higher = smarter but slower; 4 is too low for reasoning tasks)

model_path = "/root/.cache/huggingface/hub/models--Qwen--Qwen3-4B/snapshots/1cfa9a7208912126459214e8b04321603b3df60c/"

# Try loading with explicit torch_dtype
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    load_in_4bit = False,
    max_seq_length = max_seq_length,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
)

# Force model to cuda explicitly
print(f"Model device: {model.device}")
print("Model loaded successfully!")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
INFO 02-15 09:03:33 [__init__.py:225] Automatically detected platform rocm.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.6: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc2.dev161+g8a297115e.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+gitb2fb688. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled -

`torch_dtype` is deprecated! Use `dtype` instead!
[2026-02-15 09:03:36] INFO modeling.py:1004: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model device: cuda:0
Model loaded successfully!


# Add LoRA Adapters

Add LoRA layers for efficient finetuning:

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank * 2,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth 2025.10.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


# Minesweeper Game Implementation

Custom Minesweeper environment supporting:
- Customizable board size and mine count
- Actions: reveal or flag cells
- Win: reveal all safe cells
- Lose: reveal a mine

In [4]:
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Set
import random

@dataclass
class MinesweeperGame:
    rows: int
    cols: int
    num_mines: int
    seed: Optional[int] = None
    _rng: random.Random = field(init=False, repr=False)
    _board: List[List[int]] = field(init=False, repr=False)  # -1 = mine, 0-8 = count
    _revealed: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _flagged: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _state: str = field(default="ongoing", init=False, repr=False)

    def __post_init__(self):
        if self.num_mines >= self.rows * self.cols:
            raise ValueError("Too many mines for board size")
        self._rng = random.Random(self.seed)
        self._board = [[0 for _ in range(self.cols)] for _ in range(self.rows)]
        self._place_mines()
        self._calculate_numbers()

    def _place_mines(self):
        """Place mines randomly on the board"""
        positions = [(r, c) for r in range(self.rows) for c in range(self.cols)]
        mine_positions = self._rng.sample(positions, self.num_mines)
        for r, c in mine_positions:
            self._board[r][c] = -1

    def _calculate_numbers(self):
        """Calculate numbers for each cell based on adjacent mines"""
        for r in range(self.rows):
            for c in range(self.cols):
                if self._board[r][c] == -1:
                    continue
                count = 0
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if 0 <= nr < self.rows and 0 <= nc < self.cols:
                            if self._board[nr][nc] == -1:
                                count += 1
                self._board[r][c] = count

    def _reveal_cell(self, row: int, col: int) -> bool:
        """Reveal a cell. Returns True if valid move, False if invalid.
        Uses iterative flood-fill to avoid recursion limit on large boards.
        (Issue #11: was recursive; Issue typo: fixed 'bself' -> 'self')
        """
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed or (row, col) in self._flagged:
            return False

        stack = [(row, col)]
        while stack:
            r, c = stack.pop()
            if (r, c) in self._revealed:
                continue

            self._revealed.add((r, c))

            # Hit a mine!
            if self._board[r][c] == -1:
                self._state = "failed"
                return True

            # Auto-reveal neighbors if cell is 0
            if self._board[r][c] == 0:
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if (0 <= nr < self.rows and 0 <= nc < self.cols
                                and (nr, nc) not in self._revealed
                                and (nr, nc) not in self._flagged):
                            stack.append((nr, nc))

        return True

    def _flag_cell(self, row: int, col: int) -> bool:
        """Flag/unflag a cell. Returns True if valid, False if invalid"""
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed:
            return False
        
        if (row, col) in self._flagged:
            self._flagged.remove((row, col))
        else:
            self._flagged.add((row, col))
        return True

    def do_action(self, action: dict) -> str:
        """Execute an action and return a status string.

        Returns one of:
          'ok'               - valid move executed
          'mine'             - revealed a mine (game over)
          'win'              - game won after this move
          'invalid_format'   - bad action dict / missing keys / bad types
          'out_of_bounds'    - coordinates outside the board
          'already_revealed' - cell was already revealed
          'flagged_cell'     - tried to reveal a flagged cell
          'invalid_flag'     - tried to flag a revealed cell
          'game_over'        - game was already over before this call

        (Issue #13: previously set state='failed' for ALL invalid moves,
         conflating formatting errors with hitting a mine.)
        """
        if self._state != "ongoing":
            return "game_over"

        if not isinstance(action, dict):
            self._state = "failed"
            return "invalid_format"

        action_type = action.get("type")
        row = action.get("row")
        col = action.get("col")

        if action_type not in ["reveal", "flag"] or row is None or col is None:
            self._state = "failed"
            return "invalid_format"

        try:
            row, col = int(row), int(col)
        except (ValueError, TypeError):
            self._state = "failed"
            return "invalid_format"

        if not (0 <= row < self.rows and 0 <= col < self.cols):
            self._state = "failed"
            return "out_of_bounds"

        if action_type == "reveal":
            if (row, col) in self._revealed:
                self._state = "failed"
                return "already_revealed"
            if (row, col) in self._flagged:
                self._state = "failed"
                return "flagged_cell"
            valid = self._reveal_cell(row, col)
        else:
            if (row, col) in self._revealed:
                self._state = "failed"
                return "invalid_flag"
            valid = self._flag_cell(row, col)

        if not valid:
            self._state = "failed"
            return "invalid_format"

        self._check_win()

        if self._state == "failed":
            return "mine"
        if self._state == "success":
            return "win"
        return "ok"

    def _check_win(self):
        """Check if player has won"""
        total_cells = self.rows * self.cols
        safe_cells = total_cells - self.num_mines
        if len(self._revealed) == safe_cells:
            self._state = "success"

    def get_visible_board(self) -> List[List[str]]:
        """Get board state as player sees it"""
        visible = []
        for r in range(self.rows):
            row = []
            for c in range(self.cols):
                if (r, c) in self._flagged:
                    row.append('F')
                elif (r, c) in self._revealed:
                    val = self._board[r][c]
                    row.append('*' if val == -1 else str(val))
                else:
                    row.append('.')
            visible.append(row)
        return visible

    def state(self) -> str:
        return self._state

    def pretty_print(self) -> str:
        """Pretty print the board"""
        visible = self.get_visible_board()
        lines = []
        
        # Header
        header = "   " + " ".join(f"{i:2d}" for i in range(self.cols))
        lines.append(header)
        lines.append("  " + "‚îÄ" * (self.cols * 3 + 1))
        
        # Board
        for r, row in enumerate(visible):
            line = f"{r:2d}‚îÇ " + "  ".join(row)
            lines.append(line)
        
        return "\n".join(lines)

# Test the Game

In [5]:
# Create test game
game = MinesweeperGame(rows=6, cols=6, num_mines=5)
print(game.pretty_print())
print(f"State: {game.state()}")

# Test action
game.do_action({"type": "reveal", "row": 0, "col": 0})
print("\nAfter revealing (0,0):")
print(game.pretty_print())
print(f"State: {game.state()}")

    0  1  2  3  4  5
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
 0‚îÇ .  .  .  .  .  .
 1‚îÇ .  .  .  .  .  .
 2‚îÇ .  .  .  .  .  .
 3‚îÇ .  .  .  .  .  .
 4‚îÇ .  .  .  .  .  .
 5‚îÇ .  .  .  .  .  .
State: ongoing

After revealing (0,0):
    0  1  2  3  4  5
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
 0‚îÇ 0  1  .  .  .  .
 1‚îÇ 0  1  .  .  .  .
 2‚îÇ 0  1  1  1  1  .
 3‚îÇ 0  0  0  0  1  1
 4‚îÇ 0  0  0  0  0  0
 5‚îÇ 0  0  0  0  0  0
State: ongoing


# JSON Input/Output Format

## Input Format (Game State)
```json
{
  "board": [
    ["1", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."]
  ],
  "rows": 6,
  "cols": 6,
  "mines": 5,
  "flags_placed": 0,
  "cells_revealed": 0
}
```

## Output Format (Action)
```json
{"type": "reveal", "row": 2, "col": 3}
```
or
```json
{"type": "flag", "row": 1, "col": 4}
```

In [6]:
import json

'''
Important Hints:

1. Prompt is crucial - make sure your LLM is not verbose and do not write/output reasoning, instead the verbose must be hidden or abstracted and
    output must be JSON object - the verbosity in our experiment led to running out of max tokens set and
    thus JSON parsing failure - i.e. Disqualification:
    {{"type": "reveal", "row": <row_index>, "col": <col_index>}}
    or
    {{"type": "flag", "row": <row_index>, "col": <col_index>}}

2. Make sure your model learns generic N*M game board shapes and # number of mines

3. Do not flag the cell which is already flagged - game will go in recursion and you will have heavy penalty

4. Do not flag the cell which is already revealed - game will go in recursion and you will have heavy penalty
'''

def format_state_for_llm(game: MinesweeperGame) -> str:
    """Convert game state to a highly compressed JSON prompt for LLM"""
    
    # 1. Compress the 2D array into a tight string grid to save hundreds of tokens
    board_str = "\n".join([" ".join(row) for row in game.get_visible_board()])
    
    state = {
        "board_grid": board_str, 
        "rows": game.rows,
        "cols": game.cols,
        "mines": game.num_mines,
        "flags_placed": len(game._flagged),
        "cells_revealed": len(game._revealed),
    }

    # 2. Aggressive Anti-Verbosity Prompt
    prompt = f"""You are a Minesweeper AI. Analyze the state and make a move.
    
CRITICAL RULES:
- DO NOT use <think> tags.
- DO NOT output any reasoning, explanations, or text.
- Start your response IMMEDIATELY with {{ and end with }}.
- ONLY output a valid JSON object.

Game state:
{json.dumps(state, indent=2)}

Legend:
- "." = unrevealed
- "F" = flagged
- "0"-"8" = adjacent mines

Format:
{{"type": "reveal", "row": <R>, "col": <C>}}
or
{{"type": "flag", "row": <R>, "col": <C>}}"""
    
    return prompt

def parse_llm_action(response: str) -> dict:
    """Extract JSON action from LLM response.
    
    Finds all JSON-like objects and returns the LAST one matching the
    expected schema.  LLMs typically reason through options and place
    their final answer at the end, so taking the last valid match is
    more robust than taking the first.
    """
    import re
    best = None
    for match in re.finditer(r'\{[^{}]*\}', response):
        try:
            action = json.loads(match.group())
            if ("type" in action and "row" in action and "col" in action
                    and action["type"] in ["reveal", "flag"]):
                best = action
        except json.JSONDecodeError:
            continue
    return best

# Test formatting
game = MinesweeperGame(rows=6, cols=6, num_mines=5)
prompt = format_state_for_llm(game)
print(prompt[:500] + "...")

You are a Minesweeper AI. Analyze the state and make a move.

CRITICAL RULES:
- DO NOT use <think> tags.
- DO NOT output any reasoning, explanations, or text.
- Start your response IMMEDIATELY with { and end with }.
- ONLY output a valid JSON object.

Game state:
{
  "board_grid": ". . . . . .\n. . . . . .\n. . . . . .\n. . . . . .\n. . . . . .\n. . . . . .",
  "rows": 6,
  "cols": 6,
  "mines": 5,
  "flags_placed": 0,
  "cells_revealed": 0
}

Legend:
- "." = unrevealed
- "F" = flagged
- "0"-"8"...


# Test Model Before Training

See how the base model performs without finetuning:

In [7]:
from transformers import TextStreamer

game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=42)
prompt = format_state_for_llm(game)

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize = False,
    add_generation_prompt = True,
)

print("=== Base Model Response ===")
output = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 0.3, # Lowered temperature to discourage creative rambling
    max_new_tokens = 512,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

=== Base Model Response ===
<think>
Okay, let's see. The user provided a Minesweeper game state where the board is entirely unrevealed, with 6x6 grid, 5 mines, and no flags or revealed cells. My job is to figure out the best move as an AI.

First, since the board is all unrevealed, there's no immediate information to deduce mine locations. But with 5 mines in 36 cells, the probability of each cell being a mine is 5/36, which is about 14%. But since the AI needs to make a move, maybe the best approach is to start by revealing a cell that's in a position where it can give more information. However, since there are no flags or revealed cells, the AI can't use any existing clues.

But wait, the rules say that the AI must make a move. Since there are no flags placed yet, maybe the AI should start by revealing a cell. But how to choose which one? Since there's no information, maybe the safest bet is to pick a cell that's in the middle, but that's just a guess. Alternatively, maybe the AI sho

# GRPO Reward Functions

Define reward functions to guide the model's learning:

In [8]:
def valid_json_reward(completions, **kwargs):
    """Reward valid JSON action format and heavily penalize <think> tags."""
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        action = parse_llm_action(response)

        score = 0.0
        
        # 1. Massive penalty for verbosity or <think> tags
        if "<think>" in response or response.strip()[:1] != "{":
            score -= 10.0  # Punish it heavily for not starting immediately with JSON
            
        # 2. Reward or punish based on parsing success
        if action is None:
            score -= 50.0  # -50 as per the AMD rules for Invalid JSON
        else:
            score += 5.0   # Base reward for successfully formatting a JSON
            
        scores.append(score)

    return scores


def gameplay_scores(completions, **kwargs):
    """
    Scoring Criteria:
    1.  Flag cell that IS a mine        ‚Üí +15
    2.  Flag cell that is NOT a mine    ‚Üí -10
    3.  Reveal cell that IS a mine      ‚Üí -25 (round over, team goes to next round)
    4.  Reveal cell that is safe        ‚Üí +10 or +15 (+10 is for randomly guessed OR +15 if logically deducible)
    5.  Flag already flagged cell       ‚Üí -12
    6.  Reveal already revealed cell    ‚Üí -12
    7.  Out of bounds                   ‚Üí -15
    8.  Total flags > total mines       ‚Üí -10
    9.  Invalid JSON                    ‚Üí -50
    10. Win the game                    ‚Üí +100 (big bonus) - Winning here means Flagging all the mines + Revealing all the safe cells
    11. Reveal a flagged cell           ‚Üí -8
    12. Flag a revealed cell            ‚Üí -8
    """
    scores = []

    # Get game state info passed from dataset
    seeds = kwargs.get("seed", [])
    move_histories = kwargs.get("move_history", [])

    for idx, completion in enumerate(completions):
        response = completion[0]["content"]
        action = parse_llm_action(response)

        # Criterion 12: Invalid JSON
        if action is None:
            scores.append(-50.0)
            continue

        # Reconstruct EXACT game state
        if idx < len(seeds) and idx < len(move_histories):
            seed = seeds[idx]
            move_history_raw = move_histories[idx]

            if isinstance(move_history_raw, str):
                move_history = json.loads(move_history_raw)
            else:
                move_history = move_history_raw

            # Reconstruct game
            # Note: kwargs wraps dataset columns in lists during GRPO batched processing
            r = kwargs.get("rows", [4] * len(completions))[idx]
            c = kwargs.get("cols", [4] * len(completions))[idx]
            m = kwargs.get("num_mines", [2] * len(completions))[idx]
            
            # Reconstruct game with correct dimensions
            game = MinesweeperGame(rows=r, cols=c, num_mines=m, seed=seed)
            for prev_action in move_history:
                game.do_action(prev_action)

            row, col = action.get("row", -1), action.get("col", -1)
            action_type = action.get("type", "")

            # Criterion 7: Efficiency penalty applied to all valid moves
            score = -0.5 
            # Criterion 10: Out of bounds
            if not (0 <= row < game.rows and 0 <= col < game.cols):
                scores.append(score - 15.0)
                continue

            # State Checks
            is_mine = (game._board[row][col] == -1)
            is_revealed = ((row, col) in game._revealed)
            is_flagged = ((row, col) in game._flagged)

            if action_type == "flag":
                if is_flagged:
                    score += -12.0  # Criterion 8
                elif is_revealed:
                    score += -8.0   # Criterion 15
                elif len(game._flagged) >= game.num_mines:
                    score += -10.0  # Criterion 11
                else:
                    if is_mine:
                        score += 20.0  # Criterion 1
                        
                        # Check for Win Condition (Criterion 13)
                        if len(game._flagged) + 1 == game.num_mines and len(game._revealed) == (game.rows * game.cols - game.num_mines):
                            score += 100.0
                    else:
                        score += -15.0 # Criterion 2

            elif action_type == "reveal":
                if is_revealed:
                    score += -12.0  # Criterion 9
                elif is_flagged:
                    score += -8.0   # Criterion 14
                elif is_mine:
                    score += -30.0  # Criterion 3
                else:
                    # Adjacency Heuristic
                    is_adjacent_to_revealed = False
                    for dr in [-1, 0, 1]:
                        for dc in [-1, 0, 1]:
                            nr, nc = row + dr, col + dc
                            if 0 <= nr < game.rows and 0 <= nc < game.cols:
                                if (nr, nc) in game._revealed:
                                    is_adjacent_to_revealed = True
                                    break
                        if is_adjacent_to_revealed: break
                    
                    # Criterion 4 & 5: Adjacency Logic vs YOLO Guessing
                    if is_adjacent_to_revealed:
                        score += 20.0  # Logical deducible move
                    else:
                        if len(game._revealed) == 0:
                            score += 10.0  # Completely fine if it's the very first move of the game
                        else:
                            score += -5.0  # YOLO penalty: Guessing in the dark mid-game is bad

                    # Criterion 6: Zero-Cell Discovery Bonus
                    if game._board[row][col] == 0:
                        score += 10.0 

                    # Check for Win Condition (Criterion 13)
                    if len(game._revealed) + 1 == (game.rows * game.cols - game.num_mines):
                        score += 100.0
            else:
                score += -10.0 # Unknown action type

            scores.append(score)
        else:
            scores.append(0.0) # Fallback

    return scores

# Create Training Dataset|

Generate diverse game states for training:

In [9]:
import numpy as np
from datasets import Dataset

def generate_game_states(num_samples=1000, rows=6, cols=6, num_mines=5,
                         rng_seed=42):
    """
    Generate EXACTLY num_samples diverse Minesweeper game states.
    
    Mix of:
    - Fresh games (20-30%)
    - Mid-game states (70-80%)
    
    IMPORTANTLY: Stores seed + move_history (as JSON string) so reward
    function can reconstruct the EXACT game state!
    
    Keeps generating until we have exactly num_samples valid ongoing games.
    
    Args:
        rng_seed: Seed for numpy/random RNG for reproducibility.
    """
    # Seed RNG for reproducibility across runs
    np.random.seed(rng_seed)
    random.seed(rng_seed)

    dataset_items = []
    attempts = 0
    max_attempts = num_samples * 3  # Safety limit
    
    while len(dataset_items) < num_samples and attempts < max_attempts:
        attempts += 1
        seed = np.random.randint(100000)
        game = MinesweeperGame(rows=rows, cols=cols, num_mines=num_mines, seed=seed)
        
        # Make 0-5 random moves (0 = fresh game, 1-5 = mid-game)
        num_moves = np.random.randint(0, 6)
        move_history = []
        
        for _ in range(num_moves):
            board = game.get_visible_board()
            unrevealed = []
            for r in range(rows):
                for c in range(cols):
                    if board[r][c] == '.':
                        unrevealed.append((r, c))
            
            if unrevealed and game.state() == "ongoing":
                r, c = random.choice(unrevealed)
                action = {"type": "reveal", "row": r, "col": c}
                game.do_action(action)
                move_history.append(action)
            else:
                break
        
        # Only add ongoing games (skip failed/completed games)
        if game.state() == "ongoing":
            prompt_text = format_state_for_llm(game)
            dataset_items.append({
                "prompt": [{"role": "user", "content": prompt_text}],
                "seed": seed,  # Store seed to reconstruct game
                # IMPORTANT: Serialize as JSON string to avoid HF Dataset
                # schema inference mangling list-of-dicts into dict-of-lists
                "move_history": json.dumps(move_history),
            })
    
    return Dataset.from_list(dataset_items)

# Generate training dataset
print("Generating training dataset...")
dataset = generate_game_states(num_samples=1000, rows=6, cols=6, num_mines=5)
print(f"Created EXACTLY {len(dataset)} training examples (all ongoing games)")

# Count fresh vs mid-game
fresh_count = sum(1 for item in dataset if item["move_history"] == "[]")
print(f"  Fresh games: {fresh_count} ({fresh_count/len(dataset)*100:.1f}%)")
print(f"  Mid-game states: {len(dataset) - fresh_count} ({(len(dataset)-fresh_count)/len(dataset)*100:.1f}%)")

# Show example
print("\nExample training prompt:")
print(dataset[0]["prompt"][0]["content"][:400] + "...")
print(f"Seed: {dataset[0]['seed']}, Previous moves: {len(json.loads(dataset[0]['move_history']))}")

Generating training dataset...
Created EXACTLY 1000 training examples (all ongoing games)
  Fresh games: 279 (27.9%)
  Mid-game states: 721 (72.1%)

Example training prompt:
You are a Minesweeper AI. Analyze the state and make a move.

CRITICAL RULES:
- DO NOT use <think> tags.
- DO NOT output any reasoning, explanations, or text.
- Start your response IMMEDIATELY with { and end with }.
- ONLY output a valid JSON object.

Game state:
{
  "board_grid": "0 0 0 1 . .\n1 1 1 1 . .\n. . . . 1 .\n. . . . . .\n. . . . . .\n. 1 . . . .",
  "rows": 6,
  "cols": 6,
  "mines": 5...
Seed: 15795, Previous moves: 4


# Configure GRPO Training

Set up GRPO trainer with all hyperparameters:

In [10]:
from trl import GRPOConfig, GRPOTrainer

# Calculate max lengths
max_prompt_length = 600   # JSON state prompt
max_completion_length = max_seq_length - max_prompt_length

# GRPO Configuration
training_args = GRPOConfig(
    temperature = 1.0,
    learning_rate = 5e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 4,  # Issue #6: was 1; 4 gives 16 effective completions per update
    num_generations = 4,  # Generate 4 actions per state
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 20,      # Adjust based on compute budget
    save_steps = 5,
    report_to = "none",
    output_dir = "minesweeper_custom_outputs",
)

print("Training configuration:")
print(f"  Max steps: {training_args.max_steps}")
print(f"  Generations per state: {training_args.num_generations}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  LoRA rank: {lora_rank}")

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4
Training configuration:
  Max steps: 20
  Generations per state: 4
  Learning rate: 5e-05
  LoRA rank: 64


In [11]:
from transformers import TrainerCallback

class MinesweeperEvalCallback(TrainerCallback):
    """Periodically play games during training and log win rate.
    (Issue #8: no validation / reward tracking in the original notebook.)
    """

    def __init__(self, eval_every_steps=50, num_games=5):
        self.eval_every_steps = eval_every_steps
        self.num_games = num_games

    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):
        if state.global_step % self.eval_every_steps != 0:
            return

        tokenizer = processing_class
        if tokenizer is None or model is None:
            return

        # Temporarily set model to eval mode
        was_training = model.training
        model.eval()

        wins = 0
        for i in range(self.num_games):
            game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=10000 + i)
            moves = 0
            while game.state() == "ongoing" and moves < 50:
                prompt = format_state_for_llm(game)
                text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": prompt}],
                    tokenize=False,
                    add_generation_prompt=True,
                )
                output = model.generate(
                    **tokenizer(text, return_tensors="pt").to(model.device),
                    temperature=0.7,
                    max_new_tokens=128,
                    do_sample=True,
                )
                response = tokenizer.decode(output[0], skip_special_tokens=True)
                action = parse_llm_action(response)
                if action is None:
                    break
                game.do_action(action)
                moves += 1
            if game.state() == "success":
                wins += 1

        win_rate = wins / self.num_games
        print(f"\n[Eval @ step {state.global_step}] Win rate: {wins}/{self.num_games} ({win_rate*100:.0f}%)\n")

        if was_training:
            model.train()

eval_callback = MinesweeperEvalCallback(eval_every_steps=50, num_games=5)
print("Eval callback created: plays 5 games every 50 steps")

Eval callback created: plays 5 games every 50 steps


# Train the Model

Start GRPO training with reward functions:

In [12]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        valid_json_reward,   # Reward valid JSON format
        gameplay_scores,     # Reward good gameplay
    ],
    args = training_args,
    use_reference_model = False,
    train_dataset = dataset,
    callbacks = [eval_callback],  # Periodic gameplay evaluation
)

print("Starting training...")
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 20
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576 of 4,154,588,672 (3.18% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 40960, 'temperature': 0.6, 'top_p': 0.95}. If this is not desired, please set these values explicitly.
  out = torch_matmul(X, W.t(), out = out)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,sampling / sampling_logp_difference / mean,sampling / sampling_logp_difference / max,sampling / importance_sampling_ratio / min,sampling / importance_sampling_ratio / mean,sampling / importance_sampling_ratio / max,kl,rewards / valid_json_reward / mean,rewards / valid_json_reward / std,rewards / gameplay_scores / mean,rewards / gameplay_scores / std
1,0.0,-110.0,0.0,424.0,424.0,424.0,1.0,0.0,0.0,0.0,0,0,0,0,0,0.000667,-60.0,0.0,-50.0,0.0
2,0.0,-102.84375,14.3125,423.75,420.0,424.0,0.9375,420.0,420.0,420.0,No Log,No Log,No Log,No Log,No Log,0.000672,-56.5625,13.750001,-46.28125,14.875001
3,0.0,-110.0,0.0,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.000753,-60.0,0.0,-50.0,0.0
4,0.0,-110.0,0.0,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.003944,-60.0,0.0,-50.0,0.0
5,0.0001,-110.0,0.0,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.024335,-60.0,0.0,-50.0,0.0
6,0.0002,-102.84375,14.3125,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.061265,-56.5625,13.750001,-46.28125,14.875001
7,0.0003,-102.84375,14.3125,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.087364,-56.5625,13.750001,-46.28125,14.875001
8,0.0001,-110.0,0.0,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.035295,-60.0,0.0,-50.0,0.0
9,0.0003,-110.0,0.0,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.080578,-60.0,0.0,-50.0,0.0
10,0.0003,-102.84375,14.3125,424.0,424.0,424.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.081395,-56.5625,13.750001,-46.28125,14.875001


  out = torch_matmul(X, W.t(), out = out)
  out = torch_matmul(X, W.t(), out = out)
  out = torch_matmul(X, W.t(), out = out)
  out = torch_matmul(X, W.t(), out = out)


TrainOutput(global_step=20, training_loss=0.0003303109772332391, metrics={'train_runtime': 282.3787, 'train_samples_per_second': 1.133, 'train_steps_per_second': 0.071, 'total_flos': 0.0, 'train_loss': 0.0003303109772332391})

# Test Trained Model

Evaluate the finetuned model:

In [13]:
# Test on new game
test_game = MinesweeperGame(rows=6, cols=6, num_mines=5)
test_prompt = format_state_for_llm(test_game)

# Removed reasoning_effort="low" for train/eval consistency
test_text = tokenizer.apply_chat_template(
    [{"role": "user", "content": test_prompt}],
    tokenize = False,
    add_generation_prompt = True,
)

print("=== Trained Model Response ===")
output = model.generate(
    **tokenizer(test_text, return_tensors = "pt").to("cuda"),
    temperature = 0.7,
    max_new_tokens = 128,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

# Parse and test action
response_text = tokenizer.decode(output[0])
action = parse_llm_action(response_text)
print(f"\nParsed action: {action}")

if action:
    test_game.do_action(action)
    print(f"\nGame state after action: {test_game.state()}")
    print(test_game.pretty_print())

=== Trained Model Response ===
<think>
Okay, let's see. The user provided a Minesweeper game state where the board is completely unrevealed, there are 5 mines, and no flags or cells revealed yet. My job is to figure out the best move here.

First, since the board is all unrevealed, the AI needs to decide whether to flag a cell or reveal one. But since no flags are placed yet, maybe starting by revealing a cell that's likely to be safe. But with 5 mines on a 6x6 grid, there are 36 cells total. So 5 mines mean 31 safe cells

Parsed action: None


# Evaluation: Play Complete Games

Test the model on multiple complete games:

In [14]:
def play_full_game(
    model,
    tokenizer,
    rows=4,
    cols=4,
    num_mines=2,
    seed=None,
    max_moves=50
):
    """Play a complete Minesweeper game with the model"""
    game = MinesweeperGame(rows=rows, cols=cols, num_mines=num_mines, seed=seed)
    moves = 0

    while game.state() == "ongoing" and moves < max_moves:
        # üîí Force safe first move
        if moves == 0:
            action = {"type": "reveal", "row": 0, "col": 0}
            game.do_action(action)
            moves += 1
            continue

        # Format game state
        prompt = format_state_for_llm(game)

        text = tokenizer.apply_chat_template(
            [
                {
                    "role": "system",
                    "content": "You MUST output ONLY a valid JSON object. No explanation. No <think>."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )

        inputs = tokenizer(text, return_tensors="pt").to(model.device)

        output = model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.eos_token_id,
        )

        # Decode ONLY generated tokens
        input_len = inputs.input_ids.shape[1]
        response = tokenizer.decode(
            output[0][input_len:],
            skip_special_tokens=True
        ).strip()

        print("\nRAW MODEL OUTPUT:")
        print(response)

        action = parse_llm_action(response)
        print("PARSED ACTION:", action)

        if action is None:
            print("‚ùå ACTION IS NONE ‚Äî STOPPING GAME")
            break

        game.do_action(action)
        moves += 1

    return game, moves


# =======================
# SINGLE-GAME DEBUG RUN
# =======================

game, moves = play_full_game(
    model,
    tokenizer,
    rows=4,
    cols=4,
    num_mines=2,
    seed=0
)

print("\nFINAL GAME STATE:", game.state())
print("TOTAL MOVES:", moves)



RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": "flag", "row": 3, "col": 3}
PARSED ACTION: {'type': 'flag', 'row': 3, 'col': 3}

RAW MODEL OUTPUT:
{"type": 

# Save the Model

Save your trained model for competition submission:

In [16]:
# Save LoRA adapters
model.save_pretrained("my_minesweeper_model")
tokenizer.save_pretrained("my_minesweeper_model")

print("Model saved to: my_minesweeper_model/")


Model saved to: my_minesweeper_model/


# Competition Tips

## Improve Your Model:

1. **Adjust Reward Functions**
   - Increase rewards for logical deduction
   - Add penalties for random moves
   - Reward flagging correct mines

2. **Tune Hyperparameters**
   - Increase `max_steps` for longer training
   - Adjust `learning_rate` (try 1e-5 to 1e-4)
   - Increase `lora_rank` for more capacity
   - Adjust `num_generations` (2-8)

3. **Better Training Data**
   - Generate more diverse states
   - Include harder scenarios (more mines)
   - Add states requiring logical deduction

4. **Advanced Techniques**
   - Multi-step rollouts in reward function
   - Curriculum learning (easy ‚Üí hard boards)
   - Ensemble multiple models

## Team Strategy:
- Experiment with different reward functions
- Try different board sizes during training
- Analyze failed games to improve rewards
- Use temperature sampling during evaluation

Good luck!