# Audiobooks business case

### Problem

-   Data: 

    - From an Audiobook App relates to the audio versions of books ONLY. 
    
    - There are several variables: Customer ID, Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

    - The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 


-   Goal: 

    Create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.


In [14]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

# Load the data
raw_csv_data = pd.read_csv('Audiobooks_data.csv')

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)

unscaled_inputs_all = raw_csv_data.iloc[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data.iloc[:,-1]

In [15]:
raw_csv_data

Unnamed: 0,00994,1620,1620.1,19.73,19.73.1,1,10.00,0.99,1603.80,5,92,0
0,1143,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,0,0
1,2059,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,388,0
2,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
3,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0
4,3416,2160.0,2160,4.61,4.61,0,8.91,0.00,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14078,28220,1620.0,1620,5.33,5.33,1,9.00,0.61,988.2,0,4,0
14079,28671,1080.0,1080,6.55,6.55,1,6.00,0.29,313.2,0,29,0
14080,31134,2160.0,2160,6.14,6.14,0,8.91,0.00,0.0,0,0,0
14081,32832,1620.0,1620,5.33,5.33,1,8.00,0.38,615.6,0,90,0


### Balance the dataset

In [16]:
# Count how many targets are 1 
num_one_targets = int(np.sum(targets_all))

zero_targets_counter = 0

indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Delete all indices in indices_to_remove.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

In [17]:
targets_equal_priors.shape


(4474,)

In [18]:
unscaled_inputs_equal_priors.shape

(4474, 10)

### Standardize the inputs

In [19]:
scaler = StandardScaler()
scaled_inputs = scaler.fit_transform(unscaled_inputs_equal_priors)

### Shuffle the data

In [20]:
# Shuffle inputs and targets together while preserving the pairing
# shuffled_inputs, shuffled_targets = shuffle(scaled_inputs, targets_equal_priors, random_state=42)

### Split the dataset into train, validation, and test

In [21]:
# Split into training (80%) and temporary (20%) sets
train_inputs, temp_inputs, train_targets, temp_targets = train_test_split(
    scaled_inputs, targets_equal_priors, test_size=0.2, random_state=42
)
# print(len(train_inputs), len(train_targets))
# print(len(temp_inputs), len(temp_targets))

# Split the temporary set into validation and test (50% each of the remaining 20%)
validation_inputs, test_inputs, validation_targets, test_targets = train_test_split(
    temp_inputs, temp_targets, test_size=0.5, random_state=42
)

# Check balancing
print(np.sum(train_targets), len(train_targets), np.sum(train_targets) / len(train_targets))
print(np.sum(validation_targets), len(validation_targets), np.sum(validation_targets) / len(validation_targets))
print(np.sum(test_targets), len(test_targets), np.sum(test_targets) / len(test_targets))

1787 3579 0.4993014808605756
226 447 0.5055928411633109
224 448 0.5


### Save the three datasets in *.npz

An .npz file is a simple way to combine multiple arrays into a single file. 

.npz files store multiple .npy files in a single compressed format.
Internally, an .npz file is just a Zip file containing multiple .npy files.

In [22]:
# Save the three datasets in *.npz.

np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)