<a href="https://colab.research.google.com/github/eireford/chess_data/blob/main/Chess_Data_Preprocessing_first_pass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chess Data Analysis
*Preprocessing first pass*

Thanks to the **Chess Research Project** for providing the raw data.
https://chess-research-project.readthedocs.io/en/latest/

Author: Eire Ford eireford@gmail.com [eireford.com](http://eireford.com)

In [None]:
!pip install dask[dataframe] --upgrade

In [None]:
import os
import dask.dataframe as dd
import pandas as pd
import numpy as np
    
filename = "all.pgn.zip"
source = "https://storage.googleapis.com/eire_ford_chess_data/chess-research-project/" + filename

if os.path.isfile(filename):
    print ("File exists.")
else:
    print ("File does not exist. Downloading...")
    !wget cp {source}
    !unzip {filename}

In [None]:
raw = dd.read_table('all.pgn',sep='\n',encoding='ISO-8859-1',header=None)

Dask reads the file in as a single column, which dask maps to the index of a series with undefined values. The resulting index is of type object and possibly less efficent then an integer range based index.  Rather then reindexing, copy the autogenerated index to a new column 'description' and leave the index out of the next write to file.  On the next file read a new interger index will auto-generated.

https://stackoverflow.com/questions/46174556/can-i-set-the-index-column-when-reading-a-csv-using-python-dask

In [None]:
raw['description'] = raw[0]

Each chess match is described with sequental rows of text, each attribute on its own line. We are most interested in the game outcome and the list of moves but we will also keep the event name for convenience.

In [None]:
raw['isMove'] = raw[0].str.match('\\t')

In [None]:
raw['isOutcome'] = raw[0].str.match('[0-9]+\\-[0-9]+')

In [None]:
raw['isEvent'] = raw[0].str.match('\\[Event ')

Save out the typed and filtered dataset. 

In [None]:
raw[raw['isEvent'] | raw['isMove'] | raw['isOutcome']][['description','isEvent','isOutcome','isMove']].to_parquet('all_events_first_pass',write_index=False)

Tar, gzip, and upload to Google storage.

In [None]:
!tar -czvf all_events_first_pass.tar.gz ./all_events_first_pass

In [None]:
from google.colab import auth
auth.authenticate_user()
!gcloud init

In [None]:
!gsutil -m cp ./all_events_first_pass.tar.gz gs://eire_ford_chess_data/