# Problem
You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

Good luck! 

## preprocess the data
    Balance the dataset
    standardize the inputs
    shuffle the data
    divide the dataset in trainig,validation and test
    save the data in a tensor friendly format (npz)

In [1]:
import numpy as np
from sklearn import preprocessing 
import tensorflow as tf

raw_data = np.loadtxt('Audiobooks_data.csv',delimiter = ',')
unscaled_inputs = raw_data[:,1:-1]
targets_all = raw_data[:,-1]

2023-08-30 12:32:06.608730: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-30 12:32:06.782896: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-08-30 12:32:06.782922: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-08-30 12:32:07.539243: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

In [2]:
count_ones = int(np.sum(targets_all))
count_zero = 0
remove_indecies = []

for i in range(targets_all.shape[0]):
    if(targets_all[i] == 0):
        count_zero +=1
    if(count_zero > count_ones):
        remove_indecies.append(i)

unscaled_input = np.delete(unscaled_inputs,remove_indecies,axis = 0)
target_all = np.delete(targets_all,remove_indecies,axis = 0)

In [3]:
scaled_input = preprocessing.scale(unscaled_input)

In [4]:
num_shuffles = np.arange(scaled_input.shape[0])
np.random.shuffle(num_shuffles)

shuffled_inputs = scaled_input[num_shuffles]
shuffled_targets = target_all[num_shuffles]

In [5]:
num_samples = shuffled_inputs.shape[0]

train_samples = int(0.8 * num_samples)
validation_samples = int(0.1 * num_samples)
test_samples = num_samples - train_samples - validation_samples

train_inputs = shuffled_inputs[:train_samples]
train_targets = shuffled_targets[:train_samples]

validation_inputs = shuffled_inputs[train_samples:train_samples + validation_samples]
validation_targets = shuffled_targets[train_samples:train_samples + validation_samples]

test_inputs = shuffled_inputs[train_samples + validation_samples:]
test_targets = shuffled_targets[train_samples + validation_samples:]

In [6]:
np.savez('Audio_books_data_train',inputs = train_inputs,targets = train_targets)
np.savez('Audio_books_data_validate',inputs = validation_inputs,targets = validation_targets)
np.savez('Audio_books_data_test',inputs = test_inputs,targets = test_targets)

## create a machine learning algorithm
    outline the model
    compile model 
    fit model
    test model

In [13]:
npz = np.load('Audio_books_data_train.npz')
tarining_input = npz['inputs'].astype(np.float)
tarining_target = npz['targets'].astype(np.int)

npz = np.load('Audio_books_data_validate.npz')
validate_input = npz['inputs'].astype(np.float)
validate_target = npz['targets'].astype(np.int)

npz = np.load('Audio_books_data_test.npz')
test_input = npz['inputs'].astype(np.float)
test_target = npz['targets'].astype(np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  tarining_input = npz['inputs'].astype(np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  tarining_target = npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  validate_input = npz['inputs'].astype(np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  validate_target = npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_input = npz['inputs'].astype(np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_target = npz['t

In [23]:
input_size = 10
output_size = 2
hidden_layers_size = 50

model = tf.keras.Sequential([
        tf.keras.layers.Dense(hidden_layers_size,activation = 'relu'),
        tf.keras.layers.Dense(hidden_layers_size,activation = 'relu'),
    
        tf.keras.layers.Dense(output_size,activation = 'softmax')
])

In [24]:
model.compile(optimizer = 'adam',loss = 'sparse_categorical_crossentropy',metrics = ['accuracy'])

In [25]:
early_stop = tf.keras.callbacks.EarlyStopping(patience=2)
model.fit(tarining_input,
          tarining_target, 
          batch_size=100, 
          epochs=100,
          callbacks=[early_stop], 
          validation_data=(validate_input, validate_target), 
          verbose = 2 
          )

Epoch 1/100
36/36 - 1s - loss: 0.5676 - accuracy: 0.7631 - val_loss: 0.4027 - val_accuracy: 0.8881 - 732ms/epoch - 20ms/step
Epoch 2/100
36/36 - 0s - loss: 0.3734 - accuracy: 0.8726 - val_loss: 0.2933 - val_accuracy: 0.9038 - 69ms/epoch - 2ms/step
Epoch 3/100
36/36 - 0s - loss: 0.3235 - accuracy: 0.8785 - val_loss: 0.2743 - val_accuracy: 0.9060 - 68ms/epoch - 2ms/step
Epoch 4/100
36/36 - 0s - loss: 0.3042 - accuracy: 0.8866 - val_loss: 0.2603 - val_accuracy: 0.9128 - 68ms/epoch - 2ms/step
Epoch 5/100
36/36 - 0s - loss: 0.2913 - accuracy: 0.8908 - val_loss: 0.2511 - val_accuracy: 0.9172 - 67ms/epoch - 2ms/step
Epoch 6/100
36/36 - 0s - loss: 0.2815 - accuracy: 0.8941 - val_loss: 0.2450 - val_accuracy: 0.9128 - 66ms/epoch - 2ms/step
Epoch 7/100
36/36 - 0s - loss: 0.2740 - accuracy: 0.8966 - val_loss: 0.2378 - val_accuracy: 0.9150 - 66ms/epoch - 2ms/step
Epoch 8/100
36/36 - 0s - loss: 0.2660 - accuracy: 0.8963 - val_loss: 0.2387 - val_accuracy: 0.9128 - 66ms/epoch - 2ms/step
Epoch 9/100
36

<keras.callbacks.History at 0x7f2ddc4789a0>

In [26]:
model.evaluate(test_input,test_target)



[0.24687887728214264, 0.9017857313156128]