# Preprocessing a Dataset

A vital first step in many machine learning tasks is to preprocess your dataset. In this notebook we will run an experiment using two types of dataset preprocessing: regularization (also called normalization) and feature standardization (also called scaling). We will then train a simple neural network with data processed using these two methods of preprocessing and compare them to a baseline of the non-preprocessed dataset.

We will train three different models using data transformed in these ways:
- No preprocessing (baseline)
- Regularization
- Feature Standardization

Regularization (normalization) transforms all of the values present in the dataset so that they lie between 0.0 and 1.0. Sometimes it is helpful to transform these values so they are are between -1.0 and 1.0 instead of 0.0 and 1.0.

Feature standardization transforms the data so that it has a Gaussian distrobution with zero mean (zero-centered) and unit variance. For each feature, this is achieved by subtracting the mean of that feature across the dataset and then dividing by that feature's standard deviation.

In [9]:
# imports, etc...
import math
import csv
import numpy as np
from sklearn import metrics
from sklearn.utils import shuffle
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils.np_utils import to_categorical

## Split Data into Train/Test
For this task we are attempting to classify Pokemon (grass, water, etc...) given 6 statistics like HP, Attack, Speed, etc... as features. I recognize this isn't the most exciting classification problem but it was a weekly coding challenge from Siraj Raval's [dataset preperation video](https://www.youtube.com/watch?v=0xVqLJe9_CY). The 5-7 participants that submitted results seemed to have a classification accuracy from ~14% to 75% with a mode of ~30%.

Our first step is to parse the `Pokemon.csv`, extract only the features that we want, and then split our data into testing/training sets.

In [10]:
np.random.seed(1337) # l337 4 lyf3

features, labels = ([], [])
with open('../data/Pokemon.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        features.append([*row[5:11]]) 
        labels.append(row[2])

# remove column names
features.pop(0)
labels.pop(0)

# convert labels from strings to ints
uniq_labels = list(set(labels))
label_2_int = dict((key, value) for key, value in zip(uniq_labels, range(len(uniq_labels))))
labels = to_categorical(list(map(lambda l: label_2_int[l], labels)))

# shuffle the data
features, labels = shuffle(np.array(features).astype(np.float),
                           np.array(labels).astype(np.int), random_state=0)

#split 75% training, 25% testing
split = math.floor(len(labels) * 0.75) 
train_X, train_y = (features[0:split], labels[0:split])
test_X, test_y = (features[split:], labels[split:])

print('{} training samples'.format(len(train_X)))
print('{} testing samples'.format(len(test_X)))

600 training samples
200 testing samples


## Preprocessing

In [11]:
# baseline, no preprocessing at all, not even normalization
train_X_base = np.copy(train_X)
test_X_base = np.copy(test_X)

# regularization, scale values so that they are between 0.0 and 1.0
train_X_norm = np.nan_to_num((train_X - np.min(train_X)) / (np.max(train_X) - np.min(train_X)))
test_X_norm = np.nan_to_num((test_X - np.min(train_X)) / (np.max(train_X) - np.min(train_X)))

# feature standardization
# Note: This can also be achieved with sklearn preprocessing.scale(...) or
# preprocessing.StandardScaler(...) but I choose to do it with vanilla numpy
# here for demonstration.
train_X_std = np.copy(train_X)
test_X_std = np.copy(test_X)
# for each feature, subtract the mean and divide by the std dev
train_X_std = np.nan_to_num((train_X_std - np.mean(train_X, axis=0)) / np.std(train_X, axis=0))
# note: we use the mean and std dev of the training set (train_X) because the
# test set is theoretically yet "unseen" by the algorithm
test_X_std = np.nan_to_num((test_X_std - np.mean(train_X, axis=0)) / np.std(train_X, axis=0))

## Training the Classifiers
This can take a while depending on the resources you have access to. If you see an asterisk in the `In [*]` to the left the below cell is still processing. Upon completion you should see the three printed statements below. If it is taking entirely too long try with less neurons per layer: `Dense(XXX, ...)`

In [None]:
def get_model(num_inputs):
    model = Sequential()
    model.add(Dense(128, input_dim=num_inputs, init='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, init='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(18, init='normal', activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

clf_base = get_model(train_X.shape[1])
clf_norm = get_model(train_X.shape[1])
clf_std = get_model(train_X.shape[1])

clf_base.fit(train_X_base, train_y, nb_epoch=1000, verbose=0)
print('Base classifier training complete.')
clf_norm.fit(train_X_norm, train_y, nb_epoch=1000, verbose=0)
print('Regularized classifier training complete.')
clf_std.fit(train_X_std, train_y, nb_epoch=1000, verbose=0)
print('Standardized classifier training complete.')

Base classifier training complete.
Regularized classifier training complete.


## Evaluation

In [15]:
def print_accuracy(name, expected, predicted):
    correct = 0
    for i in range(len(predicted)):
        if np.argmax(predicted[i]) == np.argmax(expected[i]):
            correct += 1
    print('Accuracy of %s: %.2f' % (name, correct / len(test_y)))

pred_base = clf_base.predict(test_X_base)
pred_norm = clf_norm.predict(test_X_norm)
pred_std = clf_std.predict(test_X_std)

# print(np.sum(pred_norm, axis=1))

print_accuracy('Base', test_y, pred_base)
print_accuracy('Regularized', test_y, pred_norm)
print_accuracy('Standardized', test_y, pred_std)

y = [np.argmax(pred) for pred in test_y]
base = [np.argmax(pred) for pred in pred_base]
norm = [np.argmax(pred) for pred in pred_norm]
std = [np.argmax(pred) for pred in pred_std]

# uncomment for more verbose logging...
message = "\nClassification report for %s:\n%s\n"
# print(message % ('base', metrics.classification_report(y, base)))
# print(message % ('norm', metrics.classification_report(y, norm)))
# print(message % ('std', metrics.classification_report(y, std)))

Accuracy of Base: 0.30
Accuracy of Normalized: 0.28
Accuracy of Standardized: 0.24


As you can see, the regularized and standardized datasets performed similarly (or maybe even worse) than the baseline. I've found that this is sometimes the case, and just because data is preprocessed doesn't meen that it will necessarily perform better than the original data. That said, preprocessing is a very important step in the ML pipeline and you should always use it to at least see if you can get a better result than the original data. Various datasets and algorithms will require it more than others.