## Final Project Proposal

By Evan Richardson

[GitHub Repo](https://github.com/ejrich/chess-predictions)

I've enjoyed chess for a long time and have wanted to dive into machine learning in the deep end right away.  For this project, I would like to analyze the position of the pieces for the most likely outcome given equal skilled players.  In order to gather the data for this project, I will initially load PGN(portable game notation for games) data gathered from https://www.ficsgames.org/download.html for games above 2000 rating.  From this data, I created a a data loader to interpret each move and create a snapshot of the game after each move, and hopefully from this project I can create a model that will predict who will likely win from the given state of the game.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Dataset

All the data I have gathered is from https://www.ficsgames.org/download.html, which provides data for a lot of games, over 7000 in January 2018 alone, which is the file I am currently using.  The data provided however is not a csv, but PGN, so I wrote an interpreter to translate the moves in the file to actual board states after every move, and from these game states, I wrote the result for testing the data, the number of each piece for each player and their total number of pieces, along with the piece and the its color for each square on the board, flattening the board object.  I did not even load the entire file and there is still over 50,000 rows.

An example game notated with PGN looks like this:

1. e4 c5 2. f4 e6 3. Nf3 Nc6 4. d3 d5 5. e5 Qa5+ 6. c3 Nh6 7. Be2 Be7 8. O-O O-O 9. Re1 Qb6 10. Nbd2 c4+ 11. d4 Nf5 12. Nf1 Bd7 13. g4 Nh4 14. Kh1 Nxf3 15. Bxf3 Bh4 16. Re2 Rad8 17. Rg2 a6 18. Ng3 Ne7 19. Nh5 Qa5 20. Bd2 Qb5 21. Be1 Bxe1 22. Qxe1 h6 23. g5 hxg5 24. Rxg5 Ng6 25. Qg3 Qxb2 26. Rg1 Qxc3 27. Nf6+ gxf6 28. Rxg6+ fxg6 29. Qxg6+ Kh8 30. Qh6# {Black checkmated} 1-0


In [7]:
df = pd.read_csv('../data/201801_games.csv')

In [11]:
print(df.shape)
df.head(10)

(52500, 143)


Unnamed: 0,Result,WhitePawns,BlackPawns,WhiteRooks,BlackRooks,WhiteBishops,BlackBishops,WhiteKnights,BlackKnights,WhiteQueen,...,d8_color,d8_piece,e8_color,e8_piece,f8_color,f8_piece,g8_color,g8_piece,h8_color,h8_piece
0,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
1,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
2,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
3,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
4,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
5,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
6,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
7,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
8,1.0,8,8,2,2,2,2,2,2,1,...,2,Q,2,K,2,B,2,N,2,R
9,1.0,8,8,2,2,2,2,2,2,1,...,0,,2,K,2,B,2,N,2,R


The first 10 rows in this set refer to the same game, for example you can see in the d8_color and d8_piece columns in the last row show that the queen has moved.  There are about 2300 games with an average of about 22 moves per game.

The color and piece columns are categorial, with color having values of 0 = Empty, 1 = White, and 2 = Black.

Pieces are '' = Empty, P = Pawn, R = Rook, B = Bishop, N = Knight, Q = Queen, and K = King.

The piece counts are continuous based on the number of pieces on the board for each side.

Result also has the possibility to be 1 = White win, 0.5 = Draw, and 0 = Black win.  This leads to interesting possibilities for approaches as well, since I could use classification or regression depending on the approach.

Classification can predict the winner, but regression has the advantage of being able to predict who is likely winning the game in the current board state.

### Objectives for Analyzing the Dataset

Using this dataset, I would like to have a metric of who is currently winning the game and by how much that player is winning the game.  Hopefully I will be able to bundle this into one model through a regression to be a value between 0 and 1, with the value corresponding numbers close to 0 as black winning, and numbers close to 1 as white winning.

I will obviously not have every possible move that can be made and my dataset will be limited, and likely the player with the higher rating winning most of the time, but the player with the higher rating will likely be making better moves and board states.

I hope that the model will also be the most accurate using all of the features, so any possible move can produce a different output from the model.