# Audio books

The data is from an Audiobook app regarding the purchase of Audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The idea is that if a customer is of low potenitial, there is no reason to spend any money on advertizing to him/her. 
If we can focus our efforts only potential customers that are likely to convert again, we can make great savings. 
Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

We have a .csv summarizing the data. 
There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1)(Targets), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months or not. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is to create a machine learning algorithm, which can predict if a customer will buy audiobooks from the company again or not.

We will use TensorFlow to buld a classification model for this problem 

# 1. Import libraries

In [3]:
import numpy as np
import tensorflow as tf

# 2. Ensuring Data quality

In [7]:
# let's create a temporary variable npz, where we will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_data_train.npz')


# to ensure that all data are of numerical data types and are of the same format/ data type , in this case floats
train_inputs = npz['inputs'].astype(float)

# targets must be int because of sparse_categorical_crossentropy (we want to encode them smoothly (one-hot encoding))
train_targets = npz['targets'].astype(int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')

#loading validation inputs and the targets
validation_inputs, validation_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

# we load the test data in the temporary variable
npz = np.load('Audiobooks_data_test.npz')

# we're creating two variables that will contain the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

# 3. Modelling

In [9]:

input_size = 10
output_size = 2
hidden_layer_size = 50
    
# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    
    # the final layer is no different, we just make sure to activate it with softmax activator
    
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])


## Choose the optimizer and the loss function

In [10]:

# we define the optimizer('Adam')
# the loss function('sparse_categorical_crossentropy')
# and the metrics ('accuracy') we are interested in obtaining at each iteration
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# 4. Training

In [11]:

# Batching - to improve data processing capacity and performance of model (set the batch size)
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism( inorder to prevent model overfitting)
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# 5. Fitting the model

In [13]:
# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2)  # making sure we get enough information about the training process

Epoch 1/100
36/36 - 2s - loss: 0.5935 - accuracy: 0.6544 - val_loss: 0.4749 - val_accuracy: 0.7875 - 2s/epoch - 45ms/step
Epoch 2/100
36/36 - 0s - loss: 0.4534 - accuracy: 0.7768 - val_loss: 0.4086 - val_accuracy: 0.7875 - 94ms/epoch - 3ms/step
Epoch 3/100
36/36 - 0s - loss: 0.4042 - accuracy: 0.7932 - val_loss: 0.3790 - val_accuracy: 0.8009 - 94ms/epoch - 3ms/step
Epoch 4/100
36/36 - 0s - loss: 0.3822 - accuracy: 0.7958 - val_loss: 0.3660 - val_accuracy: 0.8166 - 94ms/epoch - 3ms/step
Epoch 5/100
36/36 - 0s - loss: 0.3684 - accuracy: 0.8005 - val_loss: 0.3523 - val_accuracy: 0.7964 - 94ms/epoch - 3ms/step
Epoch 6/100
36/36 - 0s - loss: 0.3604 - accuracy: 0.7985 - val_loss: 0.3428 - val_accuracy: 0.8367 - 110ms/epoch - 3ms/step
Epoch 7/100
36/36 - 0s - loss: 0.3526 - accuracy: 0.8145 - val_loss: 0.3386 - val_accuracy: 0.8389 - 94ms/epoch - 3ms/step
Epoch 8/100
36/36 - 0s - loss: 0.3470 - accuracy: 0.8148 - val_loss: 0.3326 - val_accuracy: 0.8098 - 110ms/epoch - 3ms/step
Epoch 9/100
36/

<keras.src.callbacks.History at 0x1e2c584ebe0>

# 6. Testing the model

After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

In [17]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [18]:
print('\nTest loss: {0:.2f}. \nTest accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.36. 
Test accuracy: 80.80%


We've obatined a test accuracy of 81 % in this run, each time the code is rerun, we get a different accuracy because each training is different
