# Audiobooks business case

## Data Preprocessing

Data preprocessing is very important in Machine learning as an accurate result also depends on the quality of data 

This notebook works on preprocessing the dataset and saving it in the npz format for the machine learning algorithm.

It makes sense to shuffle the indices prior to balancing the dataset. 



### STEP 1: Extract the data from the csv

In [20]:
# Import the necessary packages
## NB: sklearn preprocessing library is used to standardize the dataset
import pandas as pd
import numpy as np
from sklearn import preprocessing

# Load the data
raw_data = np.loadtxt('Audiobooks_dataset.csv',delimiter=',')
#raw_data
# alternatively the dataset can be loaded in csv format using pandas as:
## raw_data = pd.read_csv("Audiobooks_data.csv", delimiter=',')

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is the targets)
unscaled_inputs = raw_data[:,1:-1]


# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_data[:,-1]
targets_all

array([1., 1., 1., ..., 0., 0., 0.])

### STEP 2: Shuffle the dataset

The indices of the dataset needs to be shuffled the before balancing.

The dataset will again be shuffled after balancing otherwise, all targets that are 1s will be contained in the train_targets.



In [22]:
# When the data was collected it was actually arranged by date
# It is therefore necessary to shuffle the indices of the data, 
# so that the data is not arranged in any way when it is being fed to the algorithm.
# Also, because there will be batching, it is required for the data to be as randomly spread out as possible

# Grab the indices of the dataset in order  
shuffled_indices = np.arange(unscaled_inputs.shape[0]) 

# randomly shuffle the indices extracted
np.random.shuffle(shuffled_indices)
#print(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
unscaled_inputs = unscaled_inputs[shuffled_indices]
targets_all = targets_all[shuffled_indices]
print(unscaled_inputs)
#print(targets_all)

[[1620.   1620.      5.33 ...    0.      0.    249.  ]
 [ 324.    324.      5.33 ...    0.      0.     12.  ]
 [2160.   2160.      5.33 ...    0.     30.     41.  ]
 ...
 [2160.   2160.      5.33 ...  113.4     0.    157.  ]
 [ 324.    324.      7.47 ...    0.      0.      0.  ]
 [2160.   2160.      5.33 ...    0.      0.     82.  ]]


### STEP 3: Balance the dataset

In [23]:
# In balancing the dataset, we want to ensure that there are almost equal number of "1s" as there are "0s"

# First count how many targets are 1 (meaning that the customer did convert)
ones_in_target = int(np.sum(targets_all))


# Set a counter for targets that are 0 (meaning that the customer did not convert)
target_zeros_counter = 0

# Surplus 0s or 1s need to be removed from the dataset 
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.

for i in range(targets_all.shape[0]): 
    if targets_all[i] == 0:
        target_zeros_counter += 1
        if target_zeros_counter > ones_in_target:
            indices_to_remove.append(i)  # store the indices to be removed

# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
balanced_unscaled_inputs = np.delete(unscaled_inputs, indices_to_remove, axis=0)
balanced_targets = np.delete(targets_all, indices_to_remove, axis=0)
print(balanced_unscaled_inputs)

[[1620.   1620.      5.33 ...    0.      0.    249.  ]
 [ 324.    324.      5.33 ...    0.      0.     12.  ]
 [2160.   2160.      5.33 ...    0.     30.     41.  ]
 ...
 [1620.   1620.      8.   ...    0.      0.      9.  ]
 [1188.   1188.      5.87 ...    0.      0.    208.  ]
 [2160.   2160.     10.13 ...    0.      0.     71.  ]]


### STEP 4: Standardize the inputs

In [24]:
# Th input data is now standardized using the sklearn preprocessing library
scaled_inputs = preprocessing.scale(balanced_unscaled_inputs)
print(scaled_inputs)
# N/B: I only standardized the inputs and not the outputs

[[ 1.22939327e-01 -2.44379273e-01 -3.41337700e-01 ... -4.45110996e-01
  -1.40423545e-01  1.91454550e+00]
 [-2.47657078e+00 -1.72705917e+00 -3.41337700e-01 ... -4.45110996e-01
  -1.40423545e-01 -6.34836635e-01]
 [ 1.20606854e+00  3.73404018e-01 -3.41337700e-01 ... -4.45110996e-01
   4.56062953e+01 -3.22886922e-01]
 ...
 [ 1.22939327e-01 -2.44379273e-01  1.43309424e-01 ... -4.45110996e-01
  -1.40423545e-01 -6.67107295e-01]
 [-7.43564041e-01 -7.38605906e-01 -2.43319180e-01 ... -4.45110996e-01
  -1.40423545e-01  1.47351315e+00]
 [ 1.20606854e+00  3.73404018e-01  5.29938028e-01 ... -4.45110996e-01
  -1.40423545e-01 -1.80323312e-04]]


### STEP 5: Shuffle the data

In [25]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = balanced_targets[shuffled_indices]

### STEP 6: Split the dataset into train, validation, and test

In [26]:
# In this step the data would be split into 75% training, 15% validation and 10% testing

# Get the total number of samples
samples_count = shuffled_inputs.shape[0]

# Get the number of samples for the train_data (75%)
train_samples_count = int(0.75 * samples_count)
validation_samples_count = int(0.15 * samples_count)

# The remaining dataset would be for the testing dataset.
test_samples_count = int(0.1 * samples_count)


# Create variables that record the inputs and targets for training

train_inputs = shuffled_inputs[:train_samples_count] #. the first 75% of the dataset
train_targets = shuffled_targets[:train_samples_count] # the first 75% of the dataset

# Create variables that record the inputs and targets for validation.
# # Accounts for the next 15% of the dataset
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# The dataset was balanced to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1685.0 3355 0.5022354694485842
325.0 671 0.4843517138599106
227.0 447 0.5078299776286354


###  STEP 7: Save the three datasets in *.npz

In [27]:
# Save the three datasets in *.npz.
# The data format was saved coherently for easy access

# The saved dataset can be found in this directory

np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)