# Data Preprocessing
1. Balance the dataset
2. Divide the dataset into training, validation, and test
3. Save the data in a tensor friendly format

### Balancing the dataset
Say we have a dataset with pictures of cats and dogs where 90% of the pictures are cats. 

Our machine learning algorithm may find that the best output to increase accuracy is to say all pictures are cats, as this gives a 90% accuracy. This is not ideal.

We refer to the initial probability of picking a photo of some class as a prior. The priors in the above example are 0.9 for cats and 0.1 for dogs. The priors are balanced when 50% of the photos are cats and 50% are dogs. The example we have shows unbalanced priors. Unbalanced priors cause the case described above.

In our business case by exploring the targets we can see that most customers did not convert. So we must balance the dataset before we proceed. This is done by counting the number of target 1s and matching the same number of 0s to them

### Import libraries

In [1]:
import numpy as np
from sklearn import preprocessing as pp

# We load preprocessing as we will use it to standardise inputs
# We almost always standardise inputs as it usually greatly improves
# results

### Extract the data from the csv

In [2]:
raw_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

# A delimiter is a sequence of one or more characters used to specify
# the boundary between separate independent regions in plain text or
# other data streams

unscaled_inputs = raw_data[:, 1:-1]
# the inputs are all columns in the csv except the first and last
# the first column is the arbitrarily chosen ID 

targets_all = raw_data[:,-1]
# the last column in the csv is the targets

### Balance the dataset
1. We will count the number of targets that are 1s (as we know there are more 0s)
2. We will keep as many 0s as 1s (and we will delete the others) 

In [3]:
num_one_targets = int(np.sum(targets_all))
# Since the targets can only take values 0 and 1, by summing all the
# targets, we see how many 1s there are

zero_counter = 0
ind_remove = []

# This loop will keep as many 0s as 1s and then put the index of any
# extra 0s in the ind_remove list for us to remove later
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_counter += 1
        if zero_counter > num_one_targets:
            ind_remove.append(i)
            
# targets_all.shape[0] gives the number of rows in the csv 
# (i.e. number of targets)

# the zero_counter will increase every time there is a zero in the 
# targets, eventually the zero_counter will have counted all 0s in
# the targets. But, once there are more 0s than 1s, the index of the
# extra 0s is put in the ind_remove list for us to remove later

In [4]:
unscaled_inputs_ep = np.delete(unscaled_inputs, ind_remove, axis=0)
targets_ep = np.delete(targets_all, ind_remove, axis=0)

# np.delete(array, object to delete, axis) is a method that deletes
# an object along an axis, in this case it deletes all rows with an
# index in ind_remove

# ep is for 'equal priors' which is what we wanted to achieve

### Standardise the inputs

In [5]:
scaled_inputs = pp.scale(unscaled_inputs_ep)

# preprocessing.scale(x) is a method that standardises an array along
# an axis

### Shuffle the data

In [6]:
# This is important for effective batching

shuffled_ind = np.arange(scaled_inputs.shape[0])
# np.arange([start], stop) is a method that returns an array of 
# evenly spaced values within a given interval
# In this case shuffled_ind = an array of numbers from 0 to the 
# length of scaled_inputs (i.e. all row indices)

np.random.shuffle(shuffled_ind)
# This shuffles the indices

shuffled_inputs = scaled_inputs[shuffled_ind]
# scaled_inputs[shuffled_ind] returns the array with shuffled values

shuffled_targets = targets_ep[shuffled_ind]

### Split the data into training, validation, and test

In [7]:
samples_count = shuffled_inputs.shape[0]

train_size = int(0.8*samples_count)
val_size = int(0.1*samples_count)
test_size = samples_count - train_size - val_size

train_inputs = shuffled_inputs[:train_size]
train_targets = shuffled_targets[:train_size]

val_inputs = shuffled_inputs[train_size : train_size + val_size]
val_targets = shuffled_targets[train_size : train_size + val_size]

test_inputs = shuffled_inputs[train_size + val_size :]
test_targets = shuffled_targets[train_size + val_size:]

# We want to check we have not just balanced the dataset but also the
# training, validation, and test datasets

print(np.sum(train_targets), train_size, np.sum(train_targets)/train_size)
print(np.sum(val_targets), val_size, np.sum(val_targets)/val_size)
print(np.sum(test_targets), test_size, np.sum(test_targets)/test_size)

# They are all approximately 50%

1774.0 3579 0.4956691813355686
232.0 447 0.5190156599552572
231.0 448 0.515625


### Save the 3 datasets in *.npz

In [8]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_val', inputs=val_inputs, targets=val_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

# Create the machine learning algorithm

Let's discuss our net quickly. We have 10 input nodes, 2 hidden layers, and 2 output nodes. Each hidden layer has 50 nodes, this provides enough complexity, so we expect the algorithm to be much more sophisticated than a linear or logistic model. 
\
However we don't want to put too many nodes in the hidden layers initially so we can complete the learning faster, so 50 is a good starting number.

### Import the relevant libraries

In [9]:
import tensorflow as tf
# numpy is also needed but was imported earlier

### Data

In [10]:
npz = np.load('Audiobooks_data_train.npz')
# Recall we save npz in 2-tuple format [inputs, targets]

train_inputs = npz['inputs'].astype(float)
train_targets = npz['targets'].astype(int)

# We expect all inputs to be floats, and we expect all targets to be 
# integers. This is good practice even if we know we saved the 
# targets as integers and not boolean values or floats

npz = np.load('Audiobooks_data_val.npz')
val_inputs = npz['inputs'].astype(float)
val_targets = npz['targets'].astype(int)

npz = np.load('Audiobooks_data_test.npz')
test_inputs = npz['inputs'].astype(float)
test_targets = npz['targets'].astype(int)

### Model

Outline, optimizers, loss, early stopping, training

In [11]:
input_size = 10
output_size = 2
hidden_size = 50

model = tf.keras.Sequential([
    tf.keras.layers.Dense(hidden_size, activation = 'relu'),
    tf.keras.layers.Dense(hidden_size, activation = 'relu'),
    tf.keras.layers.Dense(output_size, activation = 'softmax')
])

model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics=['accuracy'])

batch_size = 100
max_epochs = 100

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(train_inputs,
          train_targets,
          batch_size = batch_size,
          epochs = max_epochs,
          callbacks = [early_stopping],
          validation_data = (val_inputs, val_targets),
          verbose = 2
         )

# We no longer need the layers.Flatten() as we have preprocessed the
# data properly so it is already in this format

# Indicating the batch size in .fit() will automatically batch the
# data

# By default callbacks.EarlyStopping() will monitor the validation
# loss and stop the training process the first time the validation
# loss starts increasing

# EarlyStopping(patience) configures the early stopping mechanism of
# the algorithm. 'patience' lets us decide how many consecutive
# increases we can tolerate

Epoch 1/100
36/36 - 1s - loss: 0.5475 - accuracy: 0.7485 - val_loss: 0.4224 - val_accuracy: 0.8658 - 721ms/epoch - 20ms/step
Epoch 2/100
36/36 - 0s - loss: 0.3646 - accuracy: 0.8771 - val_loss: 0.3308 - val_accuracy: 0.8792 - 94ms/epoch - 3ms/step
Epoch 3/100
36/36 - 0s - loss: 0.3151 - accuracy: 0.8852 - val_loss: 0.3034 - val_accuracy: 0.8904 - 91ms/epoch - 3ms/step
Epoch 4/100
36/36 - 0s - loss: 0.2938 - accuracy: 0.8919 - val_loss: 0.2896 - val_accuracy: 0.8904 - 90ms/epoch - 3ms/step
Epoch 5/100
36/36 - 0s - loss: 0.2807 - accuracy: 0.8944 - val_loss: 0.2853 - val_accuracy: 0.8993 - 92ms/epoch - 3ms/step
Epoch 6/100
36/36 - 0s - loss: 0.2723 - accuracy: 0.8983 - val_loss: 0.2734 - val_accuracy: 0.8971 - 93ms/epoch - 3ms/step
Epoch 7/100
36/36 - 0s - loss: 0.2646 - accuracy: 0.8994 - val_loss: 0.2696 - val_accuracy: 0.9016 - 91ms/epoch - 3ms/step
Epoch 8/100
36/36 - 0s - loss: 0.2603 - accuracy: 0.9022 - val_loss: 0.2637 - val_accuracy: 0.9016 - 92ms/epoch - 3ms/step
Epoch 9/100
36

<keras.callbacks.History at 0x172d3ac6700>

### Test the model

In [12]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [13]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.21. Test accuracy: 92.41%
