## 20daysofcode challenge
**ML | DAY 9: CHESS END-GAME**
- Load the data into a pandas dataframe and insert the column headers with the target labeled as target.
- Implement a 70-30 test train split on the data  without using the sklearn method.
- Preprocess the data. 
- Select(engineer) 18 features for training and suggest an algorithm.

Dataset: https://github.com/Fortune-Adekogbe/30-Days-of-ML/tree/master/Day-9

## Import Libraries

In [16]:
import pandas as pd
import numpy as np

from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

## Load Dataset
..into pandas dataframe

In [17]:
chess = pd.read_csv('kr-vs-kp1.csv')
chess.head()

Unnamed: 0,f,f.1,f.2,f.3,f.4,f.5,f.6,f.7,f.8,f.9,...,f.23,f.24,f.25,f.26,f.27,f.28,t.2,t.3,n.1,won
0,f,f,f,f,t,f,f,f,f,f,...,f,f,f,f,f,f,t,t,n,won
1,f,f,f,f,t,f,t,f,f,f,...,f,f,f,f,f,f,t,t,n,won
2,f,f,f,f,f,f,f,f,t,f,...,f,f,f,f,f,f,t,t,n,won
3,f,f,f,f,f,f,f,f,f,f,...,f,f,f,f,f,f,t,t,n,won
4,f,f,f,f,f,f,f,f,f,f,...,f,f,t,f,f,f,t,t,n,won


In [18]:
chess.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3195 entries, 0 to 3194
Data columns (total 37 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   f       3195 non-null   object
 1   f.1     3195 non-null   object
 2   f.2     3195 non-null   object
 3   f.3     3195 non-null   object
 4   f.4     3195 non-null   object
 5   f.5     3195 non-null   object
 6   f.6     3195 non-null   object
 7   f.7     3195 non-null   object
 8   f.8     3195 non-null   object
 9   f.9     3195 non-null   object
 10  f.10    3195 non-null   object
 11  f.11    3195 non-null   object
 12  l       3195 non-null   object
 13  f.12    3195 non-null   object
 14  n       3195 non-null   object
 15  f.13    3195 non-null   object
 16  f.14    3195 non-null   object
 17  t       3195 non-null   object
 18  f.15    3195 non-null   object
 19  f.16    3195 non-null   object
 20  f.17    3195 non-null   object
 21  f.18    3195 non-null   object
 22  f.19    3195 non-null   

- There 3195 samples, 17 columns and no missing values ... our dataset is clean!
- The features seem incorrect, from the details given [here](https://github.com/Fortune-Adekogbe/30-Days-of-ML/blob/master/Day-9/kr-vs-kp1.csv), hence our first task will be to rename them

### Rename features

In [19]:
chess.columns

Index(['f', 'f.1', 'f.2', 'f.3', 'f.4', 'f.5', 'f.6', 'f.7', 'f.8', 'f.9',
       'f.10', 'f.11', 'l', 'f.12', 'n', 'f.13', 'f.14', 't', 'f.15', 'f.16',
       'f.17', 'f.18', 'f.19', 'f.20', 'f.21', 't.1', 'f.22', 'f.23', 'f.24',
       'f.25', 'f.26', 'f.27', 'f.28', 't.2', 't.3', 'n.1', 'won'],
      dtype='object')

In [20]:
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd', 
            'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr', 'skrxp',
            'spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg', 'target']
chess.columns = features

In [21]:
chess.columns

Index(['bkblk', 'bknwy', 'bkon8', 'bkona', 'bkspr', 'bkxbq', 'bkxcr', 'bkxwp',
       'blxwp', 'bxqsq', 'cntxt', 'dsopp', 'dwipd', 'hdchk', 'katri', 'mulch',
       'qxmsq', 'r2ar8', 'reskd', 'reskr', 'rimmx', 'rkxwp', 'rxmsq', 'simpl',
       'skach', 'skewr', 'skrxp', 'spcop', 'stlmt', 'thrsk', 'wkcti', 'wkna8',
       'wknck', 'wkovl', 'wkpos', 'wtoeg', 'target'],
      dtype='object')

In [22]:
chess.head()

Unnamed: 0,bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,...,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg,target
0,f,f,f,f,t,f,f,f,f,f,...,f,f,f,f,f,f,t,t,n,won
1,f,f,f,f,t,f,t,f,f,f,...,f,f,f,f,f,f,t,t,n,won
2,f,f,f,f,f,f,f,f,t,f,...,f,f,f,f,f,f,t,t,n,won
3,f,f,f,f,f,f,f,f,f,f,...,f,f,f,f,f,f,t,t,n,won
4,f,f,f,f,f,f,f,f,f,f,...,f,f,t,f,f,f,t,t,n,won


### Label Encoding
We convert the categorical data into numerical data

In [23]:
chess = pd.get_dummies(chess, drop_first=True)
chess.head()

Unnamed: 0,bkblk_t,bknwy_t,bkon8_t,bkona_t,bkspr_t,bkxbq_t,bkxcr_t,bkxwp_t,blxwp_t,bxqsq_t,...,spcop_t,stlmt_t,thrsk_t,wkcti_t,wkna8_t,wknck_t,wkovl_t,wkpos_t,wtoeg_t,target_won
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,1
1,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,1,1,0,1
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,1,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,1,0,1


Next, we rename the most of the columns to most apt format ..

In [24]:
chess.columns = chess.columns.str.replace("_t", "").str.replace("_won", "")
chess.columns

Index(['bkblk', 'bknwy', 'bkon8', 'bkona', 'bkspr', 'bkxbq', 'bkxcr', 'bkxwp',
       'blxwp', 'bxqsq', 'cntxt', 'dsopp', 'dwipd_l', 'hdchk', 'katri_n',
       'katri_w', 'mulch', 'qxmsq', 'r2ar8', 'reskd', 'reskr', 'rimmx',
       'rkxwp', 'rxmsq', 'simpl', 'skach', 'skewr', 'skrxp', 'spcop', 'stlmt',
       'thrsk', 'wkcti', 'wkna8', 'wknck', 'wkovl', 'wkpos', 'wtoeg',
       'target'],
      dtype='object')

In [25]:
features_18 = chess.corr('spearman')['target'].sort_values(ascending=False).head(19)
features_18

target     1.000000
rimmx      0.452506
katri_w    0.208207
wkpos      0.146392
bkxbq      0.140117
wkcti      0.126511
rkxwp      0.101570
dwipd_l    0.099383
cntxt      0.068404
simpl      0.045852
thrsk      0.040347
reskd      0.030871
dsopp      0.015495
qxmsq      0.012269
reskr      0.007640
bknwy      0.003760
bkblk      0.001238
wtoeg      0.000131
bkona     -0.012707
Name: target, dtype: float64

In [26]:
chess_processed = chess[features_18.index]
chess_processed.head()

Unnamed: 0,target,rimmx,katri_w,wkpos,bkxbq,wkcti,rkxwp,dwipd_l,cntxt,simpl,thrsk,reskd,dsopp,qxmsq,reskr,bknwy,bkblk,wtoeg,bkona
0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0


### Splitting dataset into Training and Test sets

In [28]:
#The dataset looks arranged ('win' on one end, 'nowin' on the order, so in order to get more accurate result, we shuffle it ..
chess_shuffled = shuffle(chess_processed)

In [29]:
#select 70% of the data for training
train_length = round(len(chess_shuffled) * 0.7)
train = chess_shuffled[:train_length].copy()

#use the remaining 30% of the data for testing
test = chess_shuffled[train_length:].copy()
test.drop(labels=['target'], axis=1, inplace=True)

#split training set into X_train and Y_train
X_train = train.drop(labels=['target'], axis=1)
Y_train = train['target'].astype(int)

### Data Preprocessing
#### Feature Scaling

In [30]:
# standardize the data
scaler_x = StandardScaler()
X_train_sc = scaler_x.fit_transform(X_train)
test_sc = scaler_x.fit_transform(test)

scaler_y = StandardScaler()
Y_train = Y_train.values.reshape(-1,1)
Y_train_sc = scaler_y.fit_transform(Y_train)

## Algorithm Suggestion:
We note that:
- Labels are provided in the dataset - suggesting we will need to use a Supervised machine learning algorithm.
- The challenge is a binary classifaction problem, hence we need to build a *classification model* - further narrowing down the kind of supervised ML algorithn we should consider

My first insinct will be to use a **Decision Tree/Random Forest** algorithm, as it provides myriads of paths to evaluate and predict an outcome