In [1]:
# I have a data from an Audiobook app. Logically, it relates only to the audio versions of books. 
# Each customer in the database has made a purchase at least once, that's why he/she is in the database. 
# I want to create a machine learning algorithm based on my available data that can predict if 
# a customer will buy again from the Audiobook company.
# The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on 
# advertizing to him/her. 
# If I can focus my efforts ONLY on customers that are likely to convert again, the company can make great savings. 
# Moreover, this model can identify the most important metrics for a customer to come back again. 
# Identifying new customers creates value and growth opportunities.
# I have a .csv summarizing the data. 
# There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in 
# minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases),
# review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests 
# (number), and Last visited minus purchase date (in days). So these are the inputs (excluding customer ID, as it is 
# completely arbitrary. It's more like a name, than a number).
# The targets are a Boolean variable (so 0, or 1). 
# I am taking a period of 2 years in our inputs, and the next 6 months as targets. 
# So, in fact, I am predicting if: based on the last 2 years of activity and engagement, a customer will convert in 
# the next 6 months. 6 months sounds like a reasonable time. 
# If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of 
# digesting information.
# The task is: create a machine learning algorithm, which is able to predict if a customer will buy again.
# This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

In [2]:
# Create the machine learning algorithm

In [3]:
# Import the relevant libraries

In [4]:
import numpy as np
import tensorflow as tf

In [5]:
# let's create a temporary variable npz, where we will store each of the three Audiobooks datasets

In [6]:
npz = np.load('Audiobooks_data_train.npz')

In [7]:
# I extract the inputs using the keyword under which I saved them
# to ensure that they are all floats, let's also take care of that

In [8]:
train_inputs = npz['inputs'].astype(np.float)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.


In [9]:
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)

In [10]:
train_targets = npz['targets'].astype(np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.


In [11]:
# we load the validation data in the temporary variable

In [12]:
npz = np.load('Audiobooks_data_validation.npz')

In [13]:
# we can load the inputs and the targets in the same line

In [14]:
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.


In [15]:
# we load the test data in the temporary variable

In [16]:
npz = np.load('Audiobooks_data_test.npz')

In [17]:
# we create 2 variables that will contain the test inputs and the test targets

In [18]:
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.


In [19]:
# Model the data

In [20]:
# Outline, optimizers, loss, early stopping and training

In [21]:
# Set the input and output sizes

In [22]:
input_size = 10
output_size = 2

In [23]:
# Use same hidden layer size for both hidden layers. Not a necessity.

In [24]:
hidden_layer_size = 50

In [25]:
# define how the model will look like

In [26]:
model = tf.keras.Sequential([

    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), 
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), 
    tf.keras.layers.Dense(output_size, activation='softmax')
])


In [27]:
### Choose the optimizer and the loss function

# we define the optimizer we'd like to use, the loss function, and the metrics we are interested in obtaining at each iteration


In [28]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [29]:
### Training the data
# That's where I train the model I have built.
# set the batch size

In [30]:
batch_size = 100

In [31]:
# set a maximum number of training epochs

In [32]:
max_epochs = 100

In [33]:
# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases

In [34]:
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

In [35]:
# fit the model
# note that this time the train, validation and test data are not iterable

In [36]:
model.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs=max_epochs, 
          callbacks=[early_stopping], 
          validation_data=(validation_inputs, validation_targets), 
          verbose = 2 
          )

2022-10-06 23:20:43.963615: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-06 23:20:43.964063: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.


Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 1s - loss: 0.6689 - accuracy: 0.6208 - val_loss: 0.5597 - val_accuracy: 0.7136
Epoch 2/100
3579/3579 - 0s - loss: 0.5132 - accuracy: 0.7502 - val_loss: 0.4618 - val_accuracy: 0.7427
Epoch 3/100
3579/3579 - 0s - loss: 0.4471 - accuracy: 0.7659 - val_loss: 0.4200 - val_accuracy: 0.7740
Epoch 4/100
3579/3579 - 0s - loss: 0.4161 - accuracy: 0.7860 - val_loss: 0.3931 - val_accuracy: 0.7718
Epoch 5/100
3579/3579 - 0s - loss: 0.3998 - accuracy: 0.7910 - val_loss: 0.3786 - val_accuracy: 0.7897
Epoch 6/100
3579/3579 - 0s - loss: 0.3897 - accuracy: 0.7977 - val_loss: 0.3696 - val_accuracy: 0.8009
Epoch 7/100
3579/3579 - 0s - loss: 0.3818 - accuracy: 0.7938 - val_loss: 0.3665 - val_accuracy: 0.7785
Epoch 8/100
3579/3579 - 0s - loss: 0.3766 - accuracy: 0.7977 - val_loss: 0.3582 - val_accuracy: 0.8009
Epoch 9/100
3579/3579 - 0s - loss: 0.3725 - accuracy: 0.8016 - val_loss: 0.3562 - val_accuracy: 0.8277
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x7f8bf825e410>

In [37]:
# Test the model

In [38]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [39]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.38. Test accuracy: 79.24%
