# Audiobooks

The data is from an Audiobook app regarding the purchase of Audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The idea is that if a customer is of low potenitial, there is no reason to spend any money on advertizing to him/her. 
If we can focus our efforts only potential customers that are likely to convert again, we can make great savings. 
Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

We have a .csv summarizing the data. 
There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1)(Targets), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months or not. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is to create a machine learning algorithm, which can predict if a customer will buy audiobooks from the company again or not.

We will use TensorFlow to buld a classification model for this problem 

In this case we have a cleaned Audiobooks library dataset with no missing or null values. We will perform a supervised learning with TensorFlow on this example. This note book consists of standard EDA, converting and saving regular text data as tensors inorder to perform deep learning 

# 1. Import libraries

In [4]:
import numpy as np
from sklearn import preprocessing
import pandas as pd

# 2. Extract data from csv

In [16]:
raw_csv_data = np.loadtxt("Audiobooks_data.csv", delimiter= ',')

In [15]:
raw_csv_data

array([[9.9400e+02, 1.6200e+03, 1.6200e+03, ..., 5.0000e+00, 9.2000e+01,
        0.0000e+00],
       [1.1430e+03, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [2.0590e+03, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 3.8800e+02,
        0.0000e+00],
       ...,
       [3.1134e+04, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [3.2832e+04, 1.6200e+03, 1.6200e+03, ..., 0.0000e+00, 9.0000e+01,
        0.0000e+00],
       [2.5100e+02, 1.6740e+03, 3.3480e+03, ..., 0.0000e+00, 0.0000e+00,
        1.0000e+00]])

In [5]:
unscaled_inputs_all = raw_csv_data[:,1:-1]

targets_all = raw_csv_data[:,-1]

# 3. Balance the data set

In [6]:
num_one_targets = int(np.sum(targets_all))

zero_targets_counter = 0

indices_to_remove =[]

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            

unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)



# 4. Standardize the inputs

In [7]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

## Shuffle the data
   Shuffling is done inorder to ensure the randomity of data in each dataset

In [8]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

# 5.  Split- Train, Test & Validation
   Splittng the data into train, test and valiadtio dataset

In [9]:
samples_count = shuffled_inputs.shape[0]

# Manually splitting data between Train, Test & Validation
train_samples_count = int(0.8* samples_count)
validation_samples_count = int(0.1* samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count


train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]



validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]


test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

In [10]:
print(np.sum(train_targets),train_samples_count, np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets),validation_samples_count, np.sum(validation_targets)/validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets)/ test_samples_count)


1793.0 3579 0.5009779267951942
224.0 447 0.5011185682326622
220.0 448 0.49107142857142855


# 6. Saving the preprocesses data as .npz
The preprocessed data is now being saved as tensors so that bwe can build the model and perform analysis on the data in the next section


In [11]:
np.savez("Audiobooks_data_train",inputs=train_inputs,targets=train_targets)
np.savez("Audiobooks_data_validation",inputs=validation_inputs,targets=validation_targets)
np.savez("Audiobooks_data_test",inputs=test_inputs,targets=test_targets)
