## Problem
You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

In [1]:
import numpy as np
import pandas as pd

## Input data

In [2]:
raw_data=np.loadtxt('Audiobooks_data.csv',delimiter=',')

### Balancing the data

In [3]:
raw_input=raw_data[:,1:11]
raw_target=raw_data[:,11:]

num_one_target=int(np.sum(raw_target))

zero_target_counts=0
num_index=[]

for i in range(raw_target.shape[0]):
    if raw_target[i]==0:
        zero_target_counts +=1
        if zero_target_counts>num_one_target:
            num_index.append(i)
            
raw_balanced_input=np.delete(raw_input,num_index,axis=0)
raw_balanced_target=np.delete(raw_target,num_index,axis=0)

### Standatdizing the data

In [4]:
from sklearn import preprocessing
raw_standard_data=preprocessing.scale(raw_balanced_input)


### Shuffling the data

In [49]:
shuffle_indices=np.arange(raw_balanced_input.shape[0])
np.random.shuffle(shuffle_indices)

raw_shuffled_data=raw_balanced_input[shuffle_indices]
raw_shuffled_target=raw_balanced_target[shuffle_indices]

### Making train, validation, test set

In [56]:
samples_count=raw_shuffled_data.shape[0]

train_samples_count=int(0.8*samples_count)
validation_samples_count=int(0.1*samples_count)
test_samples_count=samples_count-train_samples_count-validation_samples_count

train_input=raw_shuffled_data[:train_samples_count].astype(np.float)
train_output=raw_shuffled_target[:train_samples_count].astype(np.int)

validation_input=raw_shuffled_data[train_samples_count:validation_samples_count+train_samples_count].astype(np.float)
validation_output=raw_shuffled_target[train_samples_count:validation_samples_count+train_samples_count].astype(np.int)

test_input=raw_shuffled_data[train_samples_count+validation_samples_count:].astype(np.float)
test_output=raw_shuffled_target[train_samples_count+validation_samples_count:].astype(np.int)

### Build the model

In [57]:
import tensorflow as tf

input_size=10
output_size=2
hidden_layer_size=50
model=tf.keras.models.Sequential([
                                tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
                                tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
                                tf.keras.layers.Dense(output_size,activation='softmax')
])

### Compiling

In [58]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

### Fitting the model

In [59]:
BATCH_SIZE=100
NUM_EPOCHS=100

earlystopping=tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(train_input,train_output,epochs=NUM_EPOCHS,callbacks=[earlystopping],batch_size=BATCH_SIZE,
          validation_data=(validation_input,validation_output),verbose=2)

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 1s - loss: 51.7352 - accuracy: 0.7388 - val_loss: 15.2920 - val_accuracy: 0.8501
Epoch 2/100
3579/3579 - 0s - loss: 6.0405 - accuracy: 0.8086 - val_loss: 1.0377 - val_accuracy: 0.8054
Epoch 3/100
3579/3579 - 0s - loss: 0.9832 - accuracy: 0.7885 - val_loss: 0.5718 - val_accuracy: 0.8389
Epoch 4/100
3579/3579 - 0s - loss: 0.9878 - accuracy: 0.7594 - val_loss: 1.4204 - val_accuracy: 0.6622
Epoch 5/100
3579/3579 - 0s - loss: 1.2033 - accuracy: 0.7812 - val_loss: 0.7336 - val_accuracy: 0.8591


<tensorflow.python.keras.callbacks.History at 0x1c88211b048>

### Testing the model

In [60]:
test_loss,test_accuracy=model.evaluate(test_input,test_output)



In [61]:
print('test loss: %.2f test accuracy: %.2f'%(test_loss,test_accuracy*100))

test loss: 0.97 test accuracy: 84.38
