# **Transformer-Based Chess Engine**

### **References**

**Noever, D., Ciolino, M., & Kalin, J. (2020).**  
*The Chess Transformer: Mastering Play using Generative Language Models.*  
[https://arxiv.org/abs/2008.04057](https://arxiv.org/abs/2008.04057)

### **Install Dependencies and Import Libraries**

In [None]:
%pip install transformers
%pip install gpt-2-simple
%pip install bertviz

In [21]:
import re
from pathlib import Path
import random
from datasets import load_dataset
import gpt_2_simple as gpt2
from bertviz import model_view

2025-11-11 19:51:35.376899: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-11 19:51:36.193086: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-11 19:51:42.198944: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


### **Preprocessing the Dataset**

This step was performed locally, outside of Google Colaboratory. You do not need to execute this step again.

In [None]:
def preprocess_pgn_file(input_path, output_path):
	if output_path.exists():
		print(f"‚ö†Ô∏è {output_path.name} already exists, skipping...")
		return
	
	with open(input_path, encoding="utf-8", errors="ignore") as f:
		text = f.read()

	text = re.sub(r'\[.*?\]', '', text)

	text = re.sub(r'\{[^}]*\}', '', text)
	text = re.sub(r'\([^)]*\)', '', text)
	text = re.sub(r';[^\n]*', '', text)

	games = re.split(r'\n\s*\n', text)

	saved = 0

	with open(output_path, "w", encoding="utf-8") as out:
		for game in games:
			game = game.strip()
			if not game:
				continue

			clean = re.sub(r'\s+', ' ', game).strip()

			# clean = re.sub(r'\b\d+\.\s*', '', clean)

			clean = re.sub(r'^(.+?)\s(1-0|0-1|1/2-1/2|\*)$', r'[Result \2] \1', clean)

			if clean.startswith("[Result"):
				out.write(clean + "\n")
				saved += 1

	print(f"‚úÖ {Path(input_path).name} ‚Üí {Path(output_path).name} ({saved} games saved)")

def process_all_pgn_files(input_directory, output_directory):
	input_path = Path(input_directory)
	output_path = Path(output_directory)

	output_path.mkdir(parents=True, exist_ok=True)

	files = sorted(input_path.glob("*.pgn"))
	
	for pgn_file in files:
		output_file = output_path / pgn_file.name.replace(".pgn", ".txt")
		preprocess_pgn_file(pgn_file, output_file)

In [15]:
def sample_from_huggingface(dataset_name="gabridulol/chess",
                            split="train",
                            sample_size=2_701_488,
                            seed=42):

    print(f"üì¶ Loading dataset: {dataset_name} (split: {split})")
    ds = load_dataset(dataset_name, split=split, streaming=True)
    
    random.seed(seed)
    reservoir = []
    total_seen = 0

    print(f"üéØ Sampling {sample_size:,} games using reservoir sampling...")

    for example in ds:
        text = example.get("text") or example.get("content") or str(example)
        if not text.strip():
            continue

        clean_game = " ".join(text.strip().splitlines())
        clean_game = " ".join(clean_game.split())

        total_seen += 1
        if len(reservoir) < sample_size:
            reservoir.append(clean_game)
        else:
            j = random.randint(0, total_seen - 1)
            if j < sample_size:
                reservoir[j] = clean_game

        if total_seen % 100000 == 0:
            print(f"‚è≥ Seen {total_seen:,} games...")

    print(f"\n‚úÖ Finished. Total seen: {total_seen:,}, sample kept: {len(reservoir):,}")
    
    output_path = Path("data/chess/lichess-elite-sample/lichess_elite_1st_train.txt")
    output_path.parent.mkdir(parents=True, exist_ok=True)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write("\n".join(reservoir))

    print(f"üíæ Saved sample to: {output_path.resolve()}")

In [None]:
process_all_pgn_files(
	"data/chess/lichess-elite-database",
	"data/chess/lichess-elite-dataset")

In [None]:
sample_from_huggingface(
    dataset_name="data/chess/lichess-elite-dataset",
    split="train",
    sample_size=2_701_489,
    seed=42
)

In [18]:
def count_games_in_file(file_path):
	count = 0
	with open(file_path, encoding="utf-8", errors="ignore") as f:
		for line in f:
			if line.startswith("[Result"):
				count += 1
	return count

def total_games_in_directory(directory_path):
	total = 0
	path = Path(directory_path)
	files = sorted(path.glob("*.txt"))
	for txt_file in files:
		total += count_games_in_file(txt_file)
	return total

In [19]:
total = total_games_in_directory("data/chess/lichess-elite-dataset")
total_sample = total_games_in_directory("data/chess/lichess-elite-sample")

print(f"‚úÖ Total games processed: {total}")
print(f"‚úÖ Total games processed for 1st training: {total_sample}")
print(f"‚úÖ Total games processed for 2nd training: {total}")

‚úÖ Total games processed: 27014886
‚úÖ Total games processed for 1st training: 2701489
‚úÖ Total games processed for 2nd training: 27014886


### **Dataset Overview**

The dataset used in this project originates from the [Lichess Elite Database](https://database.nikonoel.fr), which contains chess games played by highly rated players. These games are provided in PGN (Portable Game Notation) format. For this project, the original PGN files were processed and converted into plain text (TXT) format. During preprocessing, all non-essential metadata was removed. The resulting dataset retains only the game result and the sequence of moves expressed in Standard Algebraic Notation (SAN). The processed version of the dataset has been uploaded to [Hugging Face](https://huggingface.co/datasets/gabridulol/chess).

### **Dataset Statistics**

- **Total games processed:** 27,014,886  
- **Total games processed for 1st training:** 10% (2,701,489 games)  
- **Total games processed for 2nd training:** 100% (27,014,886 games)

### **The Chess Transformer: Mastering Play using Generative Language Models**