# Creating a dataset to train Llama 2 to play chess

This notebook imports a dataset of chess moves from professional tournaments from a CSV. Each row contains a string which outlines every move in the tournament. Each move is numbered with an integer and period (e.g. Move 12 is written as "12.") with an uppercase letter for the piece, lower case letters for the rank, integers for the file and 'x' for a take.

The legend for the chess pieces is listed below. The pawn has no key.
* 'B': Bishop
* 'K': King
* 'N': Knight
* 'P': Pawn
* 'Q': Queen
* 'R': Rook

For example: "Ng6" denotes that a knight is moved to position g6, and "Qxa4" denotes that queen takes whatever is in a4.

Each of the keys is replaced with a unique string which represents a unique token in the Llama 2 tokenizer. This is because for Llama to learn from these moves, we need to best translate it into the language that Llama understands: tokens. A set of 3 tokens is also added to the start of each row, so that this will signify to Llama that the rest of the tokens are chess moves.

It should be noted that the term "tokenize" is being used to refer to the translation of the chess keys to the unique strings. However, when the model's tokenizer.tokenize method is called, what it does is convert the string into a list of token _IDs_ which are an integer which represents the unique token. Most libraries will train Llama 2 models using strings instead of the token IDs, which is why I convert the chess keys into the string tokens, rather than the token ID.

In [53]:
import pandas as pd
import re
from pprint import pprint

moves = pd.read_csv('moves_raw.csv')["moves"]

eos = '</s>'
black_move_token  = "<0xFC>"
white_move_token  = "<0xFD>"
chess_start_token = "<0xF9><0xFF><0xFE>"

# # token to signify a take
moves = moves.str.replace('x',"<0xC9>")

# add chess_start_token to the front
moves = moves.apply(lambda text: chess_start_token+text)

# remove game results
moves = moves.str.replace("1-0","")
moves = moves.str.replace("0-0","")
moves = moves.str.replace("1/2-1/2","")

# remove move numbering and spaces
moves = moves.apply(lambda x: re.sub(r'\d+\.', eos, x))
moves = moves.str.replace(' '+eos, eos)
moves = moves.str.replace(eos+' ', eos)
moves = moves.str.replace(' ', eos)

# replace positions with tokens
moves = moves.str.replace('a',"<0x0A>")
moves = moves.str.replace('b',"<0x0C>")
moves = moves.str.replace('c',"<0x0D>")
moves = moves.str.replace('d',"<0x0E>")
moves = moves.str.replace('e',"<0x0F>")
moves = moves.str.replace('f',"<0x90>")
moves = moves.str.replace('g',"<0x99>")
moves = moves.str.replace('h',"<0x9A>")
moves = moves.str.replace('1',"<0x9C>")
moves = moves.str.replace('2',"<0x9D>")
moves = moves.str.replace('3',"<0x9E>")
moves = moves.str.replace('4',"<0x9F>")
moves = moves.str.replace('5',"<0xA0>")
moves = moves.str.replace('6',"<0xA9>")
moves = moves.str.replace('7',"<0xAA>")
moves = moves.str.replace('8',"<0xAC>")

# replace pieces with tokens
moves = moves.str.replace('K',"<0x10>")
moves = moves.str.replace('R',"<0x11>")
moves = moves.str.replace('B',"<0x12>")
moves = moves.str.replace('Q',"<0x13>")
moves = moves.str.replace('N',"<0x14>")

# append tokens to signify whose move it is
# create this function to use in the apply method
def turn_token(row):
    row = row.split(eos) # split the text into a list divided by the eos token
    text = ''
    for i in range(len(row)):
        if i%2: 
            # even moves are black
            # insert the eos token and black_move_token
            text += row[i]+eos+black_move_token 
        else: 
            # odd moves are white
            # insert the eos token and white_move_token
            text += row[i]+eos+white_move_token 
    
    return text

moves = moves.apply(turn_token)

In [55]:
# Saving the Pandas Series as a csv

moves.to_csv("tokenised_moves.csv", index=False)

In [56]:
# Convert the Pandas Series into a JSON
import json

json_list = [{"text": text} for text in moves]
with open('tokenised_moves.json', 'w') as f:
    json.dump(json_list, f, indent=4)

In [58]:
# Convert the Pandas Series into a parquet

pd.DataFrame(moves).to_parquet("tokenised_moves.parquet", index=False)