In this case we were presented data from an Audiobooks company. There were 12 columns, where the first column was just the id of the customer and the last one was the targets columns (either 0s or 1s) and this target showed whether a customer bought again after 6 months of its first purchase. Therefore we were left with 10 predictors, or inputs, and around 14000 rows of data.

Here we preprocess the data, divide it into train,validation and test and feed it into the model. We try to predict whether a customer will buy in the future, so that the ads are targeted to those customer who are potential regular buyers. 


In [3]:
import numpy as np
from sklearn import preprocessing
import tensorflow as tf

# Data Preprocessing

In [4]:
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter = ',')

In [5]:
unscaled_inputs_all = raw_csv_data[:,1:-1] #Disregard rows and omit column 0 and last column
targets_all = raw_csv_data[:,-1] #Targets are found in the last column

## Balancing the dataset

In [6]:
#Our targets are either 0 or 1
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0

#To find the percentage of 0s with respect to 1s
x =((targets_all.shape[0] - num_one_targets)/targets_all.shape[0])*100
x1 = np.round(x,2)
print(x1,'Data targets are incredibly uneven (84%-0, 16%-1)')

84.12 Data targets are incredibly uneven (84%-0, 16%-1)


In [27]:
print(targets_all.shape[0])

14084


In [7]:
#We want to have a balance dataset for better training results
#Therefore we will have to remove some input-target pairs

indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
#Equal priors stands for a balanced input data            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)    
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis = 0) 
#np.delete removes from the chosen array, the object selected, in our case the indices_to_remove

In [39]:
print('Amount of rows deleted:',targets_all.shape[0]-unscaled_inputs_equal_priors.shape[0])

Amount of rows deleted: 9610


## Inputs standarization

In [8]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)
print(scaled_inputs.shape[0],
scaled_inputs.shape[1],
     '.10 are the inputs of the hidden layer (10 predictors)')

4474 10 .10 are the inputs of the hidden layer (10 predictors)


## Data Shuffle

In [9]:
# When the data was collected it was actually arranged by date
# We want the data to be as randomly spread as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

## Split dataset into training, validation and test

In [50]:
samples_count = shuffled_inputs.shape[0]

# Train, validation and test data should be split as 80-10-10
train_samples_count = int(samples_count*0.8)
validation_samples_count = int(samples_count*0.1)
test_samples_count = samples_count -(train_samples_count + validation_samples_count)
print(train_samples_count,validation_samples_count,test_samples_count)

# Create variables that record the inputs and targets for training
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]
print(train_inputs.shape[0],train_inputs.shape[1])

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count + validation_samples_count]
print(validation_inputs.shape[0],validation_inputs.shape[1])

test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]
print(test_inputs.shape[0],test_inputs.shape[1])

3579 447 448
3579 10
447 10
448 10


#### Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.

In [11]:
print(np.sum(train_targets),
      train_samples_count, 
      np.sum(train_targets) / train_samples_count)

1788.0 3579 0.49958088851634536


In [12]:
print(np.sum(validation_targets),
      validation_samples_count, 
      np.sum(validation_targets) / validation_samples_count)

218.0 447 0.48769574944071586


In [13]:
print(np.sum(test_targets),
      test_samples_count,
      np.sum(test_targets) / test_samples_count)

231.0 448 0.515625


## Model

In [14]:
input_size = 10
output_size = 2
hidden_layer_size = 50

model = tf.keras.Sequential([
    
    tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
    tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
    tf.keras.layers.Dense(output_size, activation = 'softmax'),
    
])

model.compile(optimizer = 'adam', #another option is the classic sgd
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

batch_size = 100
max_epoch = 100
early_stopping = tf.keras.callbacks.EarlyStopping(patience = 2)

model.fit(train_inputs,
         train_targets,
         batch_size = batch_size,
         epochs = max_epoch,
         callbacks = [early_stopping],
         validation_data = (validation_inputs,validation_targets),
         verbose = 2)

Epoch 1/100
36/36 - 1s - loss: 0.6229 - accuracy: 0.6449 - val_loss: 0.5082 - val_accuracy: 0.7450
Epoch 2/100
36/36 - 0s - loss: 0.4705 - accuracy: 0.7759 - val_loss: 0.4317 - val_accuracy: 0.7673
Epoch 3/100
36/36 - 0s - loss: 0.4195 - accuracy: 0.7784 - val_loss: 0.4020 - val_accuracy: 0.7830
Epoch 4/100
36/36 - 0s - loss: 0.3957 - accuracy: 0.7835 - val_loss: 0.3901 - val_accuracy: 0.7919
Epoch 5/100
36/36 - 0s - loss: 0.3792 - accuracy: 0.8002 - val_loss: 0.3736 - val_accuracy: 0.7897
Epoch 6/100
36/36 - 0s - loss: 0.3690 - accuracy: 0.7963 - val_loss: 0.3728 - val_accuracy: 0.7919
Epoch 7/100
36/36 - 0s - loss: 0.3615 - accuracy: 0.8069 - val_loss: 0.3630 - val_accuracy: 0.8121
Epoch 8/100
36/36 - 0s - loss: 0.3580 - accuracy: 0.8008 - val_loss: 0.3535 - val_accuracy: 0.7987
Epoch 9/100
36/36 - 0s - loss: 0.3507 - accuracy: 0.8075 - val_loss: 0.3528 - val_accuracy: 0.8143
Epoch 10/100
36/36 - 0s - loss: 0.3464 - accuracy: 0.8120 - val_loss: 0.3509 - val_accuracy: 0.8188
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x1a40b9d150>

## Model Test

In [15]:
test_loss,test_accuracy = model.evaluate(test_inputs,test_targets)



An accuracy of around 80% is suboptimal, thus the model should have been improved prior to the testing. Anyway, out of 100 customers targeted, 80 are bound to be responsive by the ads campaign.