# Audiobooks business case

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervised learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

This code does not include comments - it is the same as the one in the lesson. Please refer to the other file if you want the code with comments.

### Extract the data from the csv

In [1]:
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('1.1 Audiobooks_data.csv', delimiter = ',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [17]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []
print(targets_all.shape[0])
print(targets_all)
for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
print(unscaled_inputs_equal_priors)
targets_equal_priors = np.delete (targets_all, indices_to_remove, axis=0)
print(targets_equal_priors)

14084
[0. 0. 0. ... 0. 0. 1.]
[[1620.   1620.     19.73 ... 1603.8     5.     92.  ]
 [2160.   2160.      5.33 ...    0.      0.      0.  ]
 [2160.   2160.      5.33 ...    0.      0.    388.  ]
 ...
 [2160.   2160.      5.33 ...    0.      0.      6.  ]
 [1674.   3348.      7.99 ...    0.      0.      0.  ]
 [1674.   3348.      5.33 ...    0.      0.      0.  ]]
[0. 0. 0. ... 1. 1. 1.]


### Standardize the inputs

In [18]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)
print(scaled_inputs)

[[ 0.21053387 -0.18888517  1.97823887 ...  4.80955413 11.83828419
   0.09415043]
 [ 1.27894497  0.41646744 -0.39082475 ... -0.41569922 -0.20183481
  -0.80255852]
 [ 1.27894497  0.41646744 -0.39082475 ... -0.41569922 -0.20183481
   2.979214  ]
 ...
 [ 1.27894497  0.41646744 -0.39082475 ... -0.41569922 -0.20183481
  -0.7440775 ]
 [ 0.31737498  1.7482432   0.04679395 ... -0.41569922 -0.20183481
  -0.80255852]
 [ 0.31737498  1.7482432  -0.39082475 ... -0.41569922 -0.20183481
  -0.80255852]]


### Shuffle the data

In [20]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
print(shuffled_indices)
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]
print(shuffled_inputs)
print(shuffled_targets)

[   0    1    2 ... 4471 4472 4473]
[[-0.64419501 -0.67316726 -0.26579083 ... -0.41569922 -0.20183481
  -0.80255852]
 [ 1.27894497  0.41646744 -0.39082475 ...  5.91794121 -0.20183481
   1.1565556 ]
 [-0.64419501 -0.67316726 -0.26579083 ... -0.41569922 -0.20183481
   1.1565556 ]
 ...
 [ 0.21053387 -0.18888517  0.39886313 ...  2.27609796  2.20618899
  -0.68559648]
 [-0.64419501 -0.67316726 -0.39082475 ...  2.17757466 -0.20183481
   0.87389734]
 [-2.35365278 -1.64173144 -0.39082475 ...  0.63990752 -0.20183481
  -0.80255852]]
[1. 0. 1. ... 0. 0. 0.]


### Split the dataset into train, validation, and test

In [21]:
samples_count = shuffled_inputs.shape[0]
print(samples_count)
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

4474
1793.0 3579 0.5009779267951942
230.0 447 0.5145413870246085
214.0 448 0.47767857142857145


### Save the three datasets in *.npz

In [22]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)