# Audiobooks

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

Good luck!

## Import Libraries

In [3]:
import numpy as np
import tensorflow as tf


## Data

In [4]:
# Extract all 3 npz files and assign temporary variable
temp_npz = np.load('Audiobooks_data_train.npz')
# Ensure all inputs are floats
train_inputs = temp_npz['inputs'].astype(np.float)
# Ensure all targets are integers(0,1)
train_targets = temp_npz['targets'].astype(np.int)

#Do the same for validation and test data
temp_npz = np.load('Audiobooks_data_validation.npz')
validation_inputs  = temp_npz['inputs'].astype(np.float) 
validation_targets = temp_npz['targets'].astype(np.int)

temp_npz = np.load('Audiobooks_data_test.npz')
test_inputs  = temp_npz['inputs'].astype(np.float) 
test_targets = temp_npz['targets'].astype(np.int)

## Model

In [6]:
# there are 10 inputs and 2 outputs
input_size = 10
output_size = 2
# hidden layer
hidden_layer_size = 50

#Model
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

#Optimizer and loss function
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

#Training
# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  


Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 1s - loss: 0.5332 - accuracy: 0.7890 - val_loss: 0.4171 - val_accuracy: 0.8568
Epoch 2/100
3579/3579 - 0s - loss: 0.3707 - accuracy: 0.8762 - val_loss: 0.3369 - val_accuracy: 0.8725
Epoch 3/100
3579/3579 - 0s - loss: 0.3196 - accuracy: 0.8868 - val_loss: 0.3137 - val_accuracy: 0.8770
Epoch 4/100
3579/3579 - 0s - loss: 0.2991 - accuracy: 0.8916 - val_loss: 0.3011 - val_accuracy: 0.8770
Epoch 5/100
3579/3579 - 0s - loss: 0.2864 - accuracy: 0.8938 - val_loss: 0.2950 - val_accuracy: 0.8814
Epoch 6/100
3579/3579 - 0s - loss: 0.2754 - accuracy: 0.8975 - val_loss: 0.2878 - val_accuracy: 0.8837
Epoch 7/100
3579/3579 - 0s - loss: 0.2685 - accuracy: 0.8991 - val_loss: 0.2834 - val_accuracy: 0.8859
Epoch 8/100
3579/3579 - 0s - loss: 0.2625 - accuracy: 0.9000 - val_loss: 0.2838 - val_accuracy: 0.8904
Epoch 9/100
3579/3579 - 0s - loss: 0.2562 - accuracy: 0.9030 - val_loss: 0.2789 - val_accuracy: 0.8881
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x1b24edf9f08>

## Test the model

In [7]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [8]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.25. Test accuracy: 89.96%


### The final test accuracy is around 90%
### !Note that each time the code is rerun, there will be DIFFERENT ACCURACY because each training is different.