# Audiobooks Project - Preprocessing Stage

Steps:
    1. Preprocess the data
    2. Balance the dataset
    3. Create s datasets (training, validation, test)
    4. Save the newly created sets in a tensor friendly format (.npz)

### extract the data from the csv

In [1]:
import numpy as np
from sklearn import preprocessing # to standardize the inputs

In [2]:
raw_csv_data = np.loadtxt('Audiobooks.csv', delimiter=',')

# new variable excluding the ID and Targets:
unscaled_inputs_all = raw_csv_data[:,1:-1]

# new variable to record the targets:
targets_all = raw_csv_data[:,-1]

### balance the dataset

In [3]:
# counting the targets that are 1's
# zero is not an integer so it will take only ones
num_one_targets = int(np.sum(targets_all))

# counting the number of zeros:
zero_targets_counter = 0

# variable to record indices to be removed:
indices_to_remove = []

# iterate over the dataset and balance it:
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i) # this will contain indices of data we dont need

unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### standardize the inputs

In [4]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### shuffle the data

In [5]:
# since we will be batching, we must shuffle the data, making it as randomly spread as possible

shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### split the data into train, validation, and test

In [11]:
samples_count = shuffled_inputs.shape[0]

# using the 80 / 10 / 10 split
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count + validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]

# to check balancing of dataset
print('Training: ', np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print('Validation: ', np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print('Test: ', np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

Training:  1797.0 3579 0.5020955574182733
Validation:  214.0 447 0.47874720357941836
Test:  226.0 448 0.5044642857142857


the proportion is around 50% for each class, which is fine
about +-5% around 50% is fine

### save the 3 datasets in .npz

In [12]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

that completes preprocessing. the above code to this point can be used to preprocess any dataset with two classes

if to be used for dataset with more classes, modification is to be made at the balancing point to more than 2, then the rest of the code will work