# **Audiobook data**

In [1]:
import os

os.listdir()

['Audiobooks_data.csv',
 'audiobook_train.npz',
 '.ipynb_checkpoints',
 'Untitled.ipynb',
 'audiobook-preproc.ipynb',
 'audiobook_val.npz',
 'audiobook_test.npz']

In [2]:
import pandas as pd

This data represents **2 years** of engagement. We are going to do supervised learning and the target is the last column.<br>
This target is expressed as boolean, during the **next 6 month** after the data gathering, we checked if the person purchased another book (i.e. *1* if *yes*, *0* if *no*). <br>
In other words we have a **classification problem**, will buy or no. The audiobook company wants to do targeted marketing, they want to know if a customer is *likely* to make new purchase or no.

In [3]:
col_names = [
    'ID',   
    'Book_length_overall', # Total of length of all books in minutes
    'Book_length_avg', # Total of length of all books / Number of books in minutes
    'Price_avg', 
    'Price_overall',
    'Review', # Number of review
    'Review_over_10',  # Average review over 10
    'Minutes_listened',
    'Completion', # Pct of minute read over total book length
    'Support_requests', # Number of support requested
    'Last_visit', # Last visit since purchase in days
    'Targets',
]

ab = pd.read_csv('Audiobooks_data.csv', names=col_names)
ab.head()

Unnamed: 0,ID,Book_length_overall,Book_length_avg,Price_avg,Price_overall,Review,Review_over_10,Minutes_listened,Completion,Support_requests,Last_visit,Targets
0,873,2160.0,2160,10.13,10.13,0,8.91,0.0,0.0,0,0,1
1,611,1404.0,2808,6.66,13.33,1,6.5,0.0,0.0,0,182,1
2,705,324.0,324,10.13,10.13,1,9.0,0.0,0.0,1,334,1
3,391,1620.0,1620,15.31,15.31,0,9.0,0.0,0.0,0,183,1
4,819,432.0,1296,7.11,21.33,1,9.0,0.0,0.0,0,0,1


## **Preprocessing**
- First task *Balance the dataset*
- Divide into training, validation and test subsets
- Prepare the data for tensorflow

In [4]:
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter=',')
unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

**Balancing the dataset**<br>
Balance the outputs (here 0 and 1) to make the model work better.

In [5]:
print(f'Number of ones: {raw_csv_data[:,-1].sum()}')
print(f'Number of zeros: {raw_csv_data[:,-1].shape[0] - raw_csv_data[:,-1].sum()}')

Number of ones: 2237.0
Number of zeros: 11847.0


As we can see, the number of *ones* and *zeros* are inbalanced. That may lead the model to tend to classify our prediciton more to zero that to one. <br>
Therefore, we need to balance it as *50 %* for each class. <br>
But this step cripples us as we have to remove a big amount of data, one way to correct this is to create a loop for balancing, each loop works with different portion of observation with zeros as target.

In [6]:
num_one = int(targets_all.sum())
zero_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0: 
        zero_counter += 1 # Counts the 0 in the targets
        # When the count reaches the same length as the count of ones, it adds the indices to the list
        if zero_counter > num_one:  
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

**Standardize the inputs**

In [7]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

**Shuffle the data**

In [8]:
shuffled_indices = np.arange(scaled_inputs.shape[0]) # get the indices for the inputs
np.random.shuffle(shuffled_indices) # Shuffling the indices for good batching

# Shuffled dataset
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

**Split the dataset**

In [9]:
samples_count = shuffled_inputs.shape[0]

# We will use the 70-20-10 split
train_count = int(.7*samples_count)
val_count = int(.2*samples_count)
test_count = samples_count - train_count - val_count

# Splitting the dataset
train_inputs = shuffled_inputs[:train_count]
train_targets = shuffled_targets[:train_count]
print(f'% of ones in the training target set: {train_targets.sum()/train_count:.1%}')

val_inputs = shuffled_inputs[train_count:train_count+val_count]
val_targets = shuffled_targets[train_count:train_count+val_count]
print(f'% of ones in the validation target set: {val_targets.sum()/val_count:.1%}')

test_inputs = shuffled_inputs[train_count+val_count:]
test_targets = shuffled_targets[train_count+val_count:]
print(f'% of ones in the test target set: {test_targets.sum()/test_count:.1%}')

% of ones in the training target set: 50.6%
% of ones in the validation target set: 49.3%
% of ones in the test target set: 47.4%


Sets are balanced, we are good to go

**Save the data in npz.file**

In [10]:
np.savez('audiobook_train', inputs=train_inputs, targets=train_targets)
np.savez('audiobook_val', inputs=val_inputs, targets=val_targets)
np.savez('audiobook_test', inputs=test_inputs, targets=test_targets)