# Audiobooks business case

## Overview

This is a Machine Learning problem which consists on estimating the likelihood of an audiobook customer making another purchase (conversion), given past purchases and usage information. If a potential recurring customer can be identified, marketing campaings can be more effectively by focusing efforts and resources to these clients.  

We were given a raw .csv file  which needs to processed in order to extrac relevant data to build the model. Our file represents 2 years of past user engagement data and a final column indicating if any purchase was made by the customer in the following 6 months after that period. In other words, past data contains the inputs and purchase info is the target of the model.

The present notebook is used to preprocess the data and generate a suitable file format (.npz) with useful information to further modeling. The actual modeling is split in another notebook to avoid recreating the file when the kernel is restarted. 

## Extract the data from the csv

The data has the following columns:


[ID, Overall Book Length (mins), Average Book Length (mins), Overall Price, Average Price, Review, Review 10/10, Minutes Listened, Completition, Support Requests, Last visited minus purchase data, Targets]



The user ID does not provide any relevant information and will be dropped. The final 'Targets' columns indicates if the user converted in the following 6 months. The remaining are the inputs which will be used to predict purchases.

In [1]:
import numpy as np
from sklearn import preprocessing

data_raw = np.loadtxt ('Audiobooks_data.csv', delimiter=',')
inputs_raw = data_raw[:,1:-1]
targets_raw = data_raw[:,-1]

## Shuffle the data

First, the data needs to be shuffled thus avoiding our model to become biased if there is some kind of data sorting in the file, such as day of the week, time of purchase and so.  

In [2]:
# Shuffling the indices
shuffled_indices = np.arange(inputs_raw.shape[0])
np.random.shuffle(shuffled_indices)

# Reassing the indices to shuffled indices
inputs_shuffled = inputs_raw[shuffled_indices]
targets_shuffled = targets_raw[shuffled_indices]

## Balance the dataset

The next step is to make sure the data is balanced, since the number of users which converted is probaly not equal to those who didn't. An imbalance could result in a model that performs poorly in real life, since the training data was biased towards a certain result. In this audiobook problem, 'did not convert' represents 84% of the data, so the model would most likely assing this result when subjected to real world data. In that way, we choose to reject some data to obtain a model with better predictbility. 

In [3]:
# Since the data is shuffled, we'll randomly remove indices until the number of users that converted is equal to those who didn't
# Considering purchases are represented by 1, we can sum all values to count the total number
n_target_one = targets_shuffled.sum()

counter_target_zero = 0
indices_to_remove = []

for i in range (targets_shuffled.shape[0]):
    if targets_shuffled[i] == 0:
        counter_target_zero += 1
        if counter_target_zero > n_target_one:
            indices_to_remove.append(i)

inputs_prior = np.delete(inputs_shuffled, indices_to_remove, axis=0)
targets_prior = np.delete(targets_shuffled, indices_to_remove, axis=0)

## Re-shuffle the data

Since the indices were not reset after the balancing step, it's necessary to re-shuffle the data. Otherwise, most of the target 1s would be located by the end of the dataset, since some zeroes were removed. 

In [4]:
indices_prior = np.arange(inputs_prior.shape[0])
np.random.shuffle(indices_prior)

inputs_prior = inputs_prior[indices_prior]
targets_prior = targets_prior[indices_prior]

## Standardize the inputs

A good practice is to scale the imputs, which improves the learning process, since they initialy have very different orders of magnitude.

In [5]:
scaled_inputs = preprocessing.scale(inputs_prior)

## Split the dataset

We'll split the data and use 80% of it to train the model, 10% to validate and the final 10% to perform a final test.

In [6]:
num_samples = inputs_prior.shape[0]

train_samples = int(0.8 * num_samples)
valid_samples = int(0.1 * num_samples)
test_samples = num_samples - train_samples - valid_samples

train_inputs = scaled_inputs[:train_samples]
train_targets = targets_prior[:train_samples]

valid_inputs = scaled_inputs[train_samples : train_samples + valid_samples]
valid_targets = targets_prior[train_samples : train_samples + valid_samples]

test_inputs = scaled_inputs[train_samples + valid_samples : train_samples + valid_samples + test_samples]
test_targets = targets_prior[train_samples + valid_samples : train_samples + valid_samples + test_samples]

# Checking the balance. They should be roughly 0.5
print(train_targets.sum()/train_samples)
print(valid_targets.sum()/valid_samples)
print(test_targets.sum()/test_samples)


0.502374965074043
0.49217002237136465
0.4888392857142857


## Save the three datasets in .npz

In [7]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_valid', inputs=valid_inputs, targets=valid_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)