<a href="https://colab.research.google.com/github/eireford/chess_data/blob/main/Chess_Data_Preprocessing_One.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chess data Analysis by Eire Ford
*Preprocessing One*

Thanks to the **Chess Research Project** for providing the raw data.
https://chess-research-project.readthedocs.io/en/latest/

In [None]:
import os

if os.path.isfile('/content/all.pgn.zip'):
    print ("File exists.")
else:
    print ("File does not exist. Downloading...")
    !gsutil cp gs://eire_ford_chess_data/chess-research-project/all.pgn.zip ./

In [None]:
!unzip ./all.pgn.zip

In [None]:
!pip install dask[dataframe] --upgrade

In [None]:
import dask.dataframe as dd

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
all = dd.read_table('all.pgn',sep='\n',encoding='ISO-8859-1',header=None)

Dask reads the file in as a single column, which dask maps to the index of a series with undefined values. The resulting index is of type object and possibly less efficent then an integer range based index.  Rather then reindexing, copy the autogenerated index to a new column 'description' and leave the index out of the next write to file.  On the next file read a new interger index will auto-generated.

https://stackoverflow.com/questions/46174556/can-i-set-the-index-column-when-reading-a-csv-using-python-dask

In [None]:
# Create a description column from the
all['description'] = all[0]

Each chess match is described with sequental rows of text, each attribute on its own line. We are most interested in the game outcome and the list of moves.

In [None]:
all['isMove'] = all[0].str.match('\\t')

In [None]:
all['isOutcome'] = all[0].str.match('[0-9]+\\-[0-9]+')

Mark and keep the first event in each match as an additional check on data completeness.

In [None]:
all['isEvent'] = all[0].str.match('\\[Event ')

Save out the typed and filtered dataset. 

In [None]:
all[all['isEvent'] | all['isMove'] | all['isOutcome']][['description','isEvent','isOutcome','isMove']].to_parquet('all_first_pass',write_index=False)

Tar, gzip, and upload to Google storage.

In [None]:
!tar -czvf all_first_pass.tar.gz ./all_first_pass

In [None]:
from google.colab import auth
auth.authenticate_user()
!gcloud init

In [None]:
!gsutil -m cp ./all_first_pass.tar.gz gs://eire_ford_chess_data/