# Audiobooks business case

# Problem

You are given data from an Audiobook App. Logically, it relates to the audio versions of books only. Each customer in the database has made a purchase at least once, that's why he/she is in the database. The purpose of this project is to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts solely on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's used only to identify the "person", more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like something about the Audiobook way of dealing information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (.npz)

When dealing with real life data,its crucial that we preprocess it to create a good model.

If you want to know how to do that, go through the code with comments. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supersized learning datasets). Keep in mind that a specific problem may require additional preprocessing.

PS: All the header row's(which contain the names of the categories) have been excluded, solely because we just want the data.

### Quick notes

This is a project made by a student of data science, so i made all the comments on the basis of reminding myself of what each thing on the code does, how its useful for data scientists and some explanation on things i should keep in mind while making deep learning models.So keep in mind that a lot of the comments its not useful if you are already an experienced data scientist, but to newcomers(such as myself) can be really useful.

Thank you for your attention and have fun.

### Import the libraries and extract the data from the csv

In [12]:
#Numpy is a package to work with multidimensional arrays on python, widely used on data science.
import numpy as np
#Sklearn preprocessing library its used to stardardize the data.
from sklearn import preprocessing
#This package will only be used in the machine learning algorithm, its not used in the preprocessing of the data.
import tensorflow as tf

#Load the data and assign it to the variable raw_csv_data
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

#The inputs are all columns in the csv, except for the first one [:,0] that is the customer ID(contain no useful information)
#and the last one [:,-1], which is our targets

#Load all columns except the first(ID's) and the last one(targets) and assign it to unscaled_inpust_all
unscaled_inputs_all = raw_csv_data[:,1:-1]

#The targets are in the last column. That's how datasets are conventionally organized.
#Load the targets and assign it to targets_all
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [13]:
#Count how many targets are 1(meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

#Set a counter for targets that are 0(meaning that the customer did not convert)
zero_targets_counter = 0

#We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
#Declare a variable that will do that:
indices_to_remove = []

#We want to have the same number of 0s and 1s on targets.
#Count the number of targets that are 0. 
#Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

#Create two new variables, one that will contain the inputs, and one that will contain the targets.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
#Delete all the indices that we marked "to remove" in the loop above.
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [14]:
#Here we will use the sklearn functionality, which has good preprocessing capabilities.
#If you try to run this code without standardizing the inputs, you will get 
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [15]:
#When the data was collected it was actually arranged by date, but to have a good model we have to shuffle that data.
#Since we will be batching, we want the data to be as randomly spread out as possible.
#This create the variable shuffled_indices and shuffle it randomly, so the data is not arranged in any way when we feed it.
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

#Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [16]:
#Count the total number of samples and assign it to the samples_count variable.
samples_count = shuffled_inputs.shape[0]

#Always keep in mind that the training dataset must be a lot bigger than the rest.
#Count the samples in each subset, assuming we want 80-10-10(commonly used) distribution of training, validation, and test.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

#We dont need to make the same process as above for the test dataset, just assign it to the all of remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

#Create variables that record the inputs and targets for training
#In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

#Create variables that record the inputs and targets for validation.
#They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

#Create variables that record the inputs and targets for test.
#Again, they are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

#Each time you rerun this code you will get a different value, because they are always shuffled randomly.
#Normally you only preprocess once, so you dont need to run the part of the code that preprocess the data everytime.
#Check if your training, validation and test are balanced, because they are taken from a shuffled dataset.

#Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
#You want to make sure that the proportions are close to 50%(0.5), and that the training dataset its alot bigger 
#than the rest.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1791.0 3579 0.5004191114836547
226.0 447 0.5055928411633109
220.0 448 0.49107142857142855


### Save the three datasets in *.npz

In [17]:
#Save the three datasets in *.npz.
#Always remember to name your datasets in a coherent way, in this case, the name of the original file + the dataset.
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

All the code above was to preprocess the data, balance the datasets into training, validation and tests, and save it in the desired format to work in machine learning(.npz).
Now it's time to create the machine learning algorithm.

### Load the data

In [18]:
#Create a temporary variable npz, where we will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_data_train.npz')

#Extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = npz['inputs'].astype(np.float)
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(np.int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')
# we can load the inputs and the targets in the same line
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

#Load the test data in the npz variable.
npz = np.load('Audiobooks_data_test.npz')
#Create the variables to the tests inputs and targets.
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

### Create and train the model

In [24]:
#Set the input and output sizes
input_size = 10
output_size = 2
#Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50
    
#Define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])


#Choose the optimizer and the loss function

#We define the optimizer we'd like to use, the loss function, and the metrics we are interested in obtaining at 
#each iteration.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

#Training
#That's where we train the model we have built.

#Set the batch size
batch_size = 100

#Set a maximum number of training epochs
#This will rarely be the real number of epochs, because in a certain point to continue testing it would only increase the
#chances of overfitting the model.
max_epochs = 100

#Set an early stopping mechanism
#We will set the patience=2, to be a bit tolerant against random validation loss increases.
#This is very important to stop the training before the model goes into overfitting.
#When the validation error starts increasing that might be a indicator of overfitting.
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

#Fit the model
#Note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  

Epoch 1/100
36/36 - 0s - loss: 0.5261 - accuracy: 0.8284 - val_loss: 0.3941 - val_accuracy: 0.8837
Epoch 2/100
36/36 - 0s - loss: 0.3577 - accuracy: 0.8779 - val_loss: 0.3122 - val_accuracy: 0.8949
Epoch 3/100
36/36 - 0s - loss: 0.3152 - accuracy: 0.8838 - val_loss: 0.2912 - val_accuracy: 0.8949
Epoch 4/100
36/36 - 0s - loss: 0.2963 - accuracy: 0.8896 - val_loss: 0.2761 - val_accuracy: 0.8993
Epoch 5/100
36/36 - 0s - loss: 0.2832 - accuracy: 0.8947 - val_loss: 0.2643 - val_accuracy: 0.9060
Epoch 6/100
36/36 - 0s - loss: 0.2752 - accuracy: 0.8963 - val_loss: 0.2545 - val_accuracy: 0.9150
Epoch 7/100
36/36 - 0s - loss: 0.2689 - accuracy: 0.8986 - val_loss: 0.2487 - val_accuracy: 0.9150
Epoch 8/100
36/36 - 0s - loss: 0.2624 - accuracy: 0.8991 - val_loss: 0.2446 - val_accuracy: 0.9195
Epoch 9/100
36/36 - 0s - loss: 0.2579 - accuracy: 0.9011 - val_loss: 0.2399 - val_accuracy: 0.9195
Epoch 10/100
36/36 - 0s - loss: 0.2548 - accuracy: 0.9025 - val_loss: 0.2434 - val_accuracy: 0.9195
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x1d85e385b48>

### Test the model
After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

In [20]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [21]:
#Prints the loss and accuracy of the test dataset.
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.23. Test accuracy: 91.96%


Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 91%.

Note that each time the code is rerun, we get a different accuracy because each training is different. 

Please note this is a suboptimal solution, there's still space to build on it and there's still room to improve.