## Business Case Tensor Example

#### Practical example. Audiobooks

#### Prepocess the data. Balance the dataset. Create 3 datasets; training, validation and testing. Save the newly created sets in a tensor format (e.g *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relecant code, which is not that hard, but refers to data engineering more than machine learning. 

If you want to know how to do that, go through the code and the comments. In any case, this should do the trick for all datasets organized in the way: many inputs, and then 1 cell containing the targets (all supervised learning datasets). 

Note that we have removed the header now, which containes the names of the categories. We simply want the data.

### Extract the data from the CSV

In [1]:
import numpy as np
from sklearn import preprocessing #we use sk learn the standardize the inputs. Almost always we standardize the inputs
                                  # Standardizing gains 10% accuracy in this problem
raw_csv_data = np.loadtxt("C:/Users/camay/OneDrive/Desktop/Data Science Bootcamp/Audiobooks_data.csv", delimiter = ',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### Balance the datasheet 

In [2]:
#Count the number of 1's
num_ones_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove=[]

for i in range(targets_all.shape[0]):  # The shape of targets_all on axis=0, is basically the lengthg of the vector
    if targets_all[i] == 0:    # We want to increase the zeroes counter by 1, if the target is 0
        zero_targets_counter +=1
        if zero_targets_counter > num_ones_targets:   #If the target at position i is 0, and the number of zeroes is bigger than the number of 1's
            indices_to_remove.append(i)                  # we will know the indices of all data points to be removed

unscaled_input_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)  #deletes an object along an axis
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0) #Same thing with the targets column (vector) 
    

### Standardize the inputs

In [3]:
scaled_inputs = preprocessing.scale(unscaled_input_equal_priors) #standardizes an array along an axis

### Shuffle the data

In [4]:
#Since we are batching, we must shuffle the data

shuffled_indices = np.arange(scaled_inputs.shape[0]) #method that returns a evenly spaced calues within a given interval
np.random.shuffle(shuffled_indices)  #method that shuffles the numbers in a given sequence

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train=, validation and test

In [7]:
samples_count = shuffled_inputs.shape[0]

# Lets split the data 80% train, 10% validation and 10% test

train_samples_count = int(0.8 * samples_count)
validation_count = int(.1 * samples_count)
test_count = samples_count - train_samples_count - validation_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count + validation_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count + validation_count]

test_inputs = shuffled_inputs[train_samples_count+validation_count:]
test_targets = shuffled_targets[train_samples_count+validation_count:]


# It is useful to check if we have balanced the dataset

print(np.sum(train_targets),train_samples_count, np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets), validation_count, np.sum(validation_targets)/validation_count)
print(np.sum(test_targets),test_count, np.sum(test_targets)/test_count)

1751.0 3579 0.48924280525286395
242.0 447 0.5413870246085011
244.0 448 0.5446428571428571


### Save the three datasets in *.npz

In [10]:
np.savez('Audiobooks_data_train', inputs = train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation', inputs = validation_inputs, targets= validation_targets)
np.savez('Audiobooks_data_test', inputs = test_inputs, targets = test_targets)

In [12]:
import os
os.getcwd()

'C:\\Users\\camay\\Data Science Bootcamp 2024\\Section 53'