### Practical Example. Audiobooks

## Problem

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the databse has made a purchase at least once, thats why he/she is in the database. We want to create a machine learning algorithm bases od our available data that can predict if a customer will buy again from the Audiobook company. 

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates calue and growth opportunities. 

You have a .csv summarizing the data. Ther are several variables: Customer ID, Book length in mins_avg(average of all purchases), Book length in minutes_sum(sum of all purchases), Price Paid_avg(average of all purchases), Price paid_sum(sum of all purchases), Review*(a boolean variable), Review(out of 10), Total minutes listened, Completion(from 0 to 1), SUpport requests(numer), and Last visited minus purchase date(in days).

So these are the inputs (excluding Customer ID, as it is completely arbitrary. Its more like a name, than a number)

The targets are Boolean variable(so 0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as a targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they dont convert after 6 months, chances are theyve gone to a competitor or didnt like the Audtiobook way of difesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: wont buy and will buy represented by 0s and 1s.


## Create the machine learning algorithm

### Import the relecant libraries

In [1]:
import numpy as np
import tensorflow as tf

### Data

In [2]:
npz = np.load('Audiobooks_data_train.npz')

train_inputs = npz['inputs'].astype(float)
train_targets = npz['targets'].astype(int)

npz = np.load('Audiobooks_data_validation.npz')

validation_inputs, validation_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

npz = np.load('Audiobooks_data_test.npz')

test_inputs, test_targets = npz["inputs"].astype(float), npz['targets'].astype(int)

#Unlike the MNIST example, our train, validation and test is simply in array form. 

### Model

Outline, optimizers, loss early stopping and training

In [3]:
input_size = 10
output_size = 2
hidden_layer_size = 50

model = tf.keras.Sequential([
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(output_size, activation='softmax')
                            ])   

model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy',metrics=['accuracy'])

batch_size = 100

max_epochs = 100

# model.fit(train_inputs, 
#           train_targets,
#           batch_size=batch_size,
#           epochs = max_epochs,
#           validation_data = (validation_inputs, validation_targets),
#           verbose = 2
#          )

### Seting an Early Stop Mechanism

In [4]:
# We are doing this because we are overfitting and the validation loss was sometimes increasing so it is overfitting. 
# We didn't use this on the MNIST because it was very well prepocessed. This time we need an early stopping mechanism

early_stopping = tf.keras.callbacks.EarlyStopping(patience = 2) # by default, this object will monitor the validation loss and stop the training process 
                                                    # the first time the calidation loss starts increasing
                                                    # Sometimes the loss isn't too much so we may prefer to let 1 or 2 validation increases slide
                                                    # So we use 'patience' in the EarlyStopping() function
model.fit(train_inputs, 
          train_targets,
          batch_size=batch_size,
          epochs = max_epochs,
          callbacks = [early_stopping],
          validation_data = (validation_inputs, validation_targets),
          verbose = 2
         )



Epoch 1/100
36/36 - 4s - 107ms/step - accuracy: 0.7125 - loss: 0.5684 - val_accuracy: 0.7584 - val_loss: 0.4770
Epoch 2/100
36/36 - 0s - 9ms/step - accuracy: 0.7687 - loss: 0.4483 - val_accuracy: 0.7651 - val_loss: 0.4170
Epoch 3/100
36/36 - 1s - 19ms/step - accuracy: 0.7877 - loss: 0.4045 - val_accuracy: 0.7785 - val_loss: 0.3920
Epoch 4/100
36/36 - 0s - 9ms/step - accuracy: 0.8027 - loss: 0.3830 - val_accuracy: 0.7606 - val_loss: 0.3875
Epoch 5/100
36/36 - 0s - 9ms/step - accuracy: 0.8033 - loss: 0.3709 - val_accuracy: 0.7875 - val_loss: 0.3717
Epoch 6/100
36/36 - 0s - 9ms/step - accuracy: 0.8078 - loss: 0.3628 - val_accuracy: 0.7562 - val_loss: 0.3778
Epoch 7/100
36/36 - 0s - 9ms/step - accuracy: 0.8134 - loss: 0.3551 - val_accuracy: 0.7740 - val_loss: 0.3616
Epoch 8/100
36/36 - 0s - 9ms/step - accuracy: 0.8145 - loss: 0.3508 - val_accuracy: 0.8098 - val_loss: 0.3579
Epoch 9/100
36/36 - 0s - 10ms/step - accuracy: 0.8134 - loss: 0.3460 - val_accuracy: 0.7897 - val_loss: 0.3527
Epoch 

<keras.src.callbacks.history.History at 0x1d56aa62950>

### Test model

In [5]:
# model.evaluate() returns the loss value and metrics values for the model in 'test mode'

test_loss, test_accuracy = model.evaluate(test_inputs,test_targets)

[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.8302 - loss: 0.3630


In [6]:
print('\nTest loss: {0:.2f}. Test accuracy: {1: .2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.35. Test accuracy:  82.59%
