# Introduction

### The Basics of what we are dealing with

1. we have data of an audiobook app, where each customer has atleast made a single purchase.
2. we want to create a machine learning algorithm that can predict if a customer will buy again.
3. reason:- the company must not spent resources on the customers that are unlikely to come back
4. columns containing, book length, avg book length, price and its avg, review, minutes listened, etc
5. column review -> we have value only for those who left a review. We substitute all missing values with average review.
6. Data taken for 2 years, and then the targets (whether the person converted )are noted in the span of 6 months, so the data is of 2.5 years

### The Data : Columns explained

1. Customer ID
2. Book Length overall (in minutes)
3. Book Length avg (in minutes)
4. price overall
5. price average
6. review (1 if left a review else 0)
7. review out of 10
8. Minutes listened
9. completion (Book length / minutes listened)
9. Support request (forgot password, etc): shows in spite of such troublesome work, the customer sticked with us, also, it may happen that the customer left the platform due these troublesome works.
10. last visited minus purchase date( higher the number the more regular the person is)

11. Targets ( 0 if not converted else 1)

### Since this is a real life data we need to preprocess it 

1. Balance the dataset
2. Divide the dataset in training, validation and test
3. Save the data in a tensor friendly format

### What is Balancing the data

Take an example: If our data has 90% of data of cats and remaining of dogs, since our ML algoritm tries to optimize the loss, it quickly realises that if so many targets are cats then the output is more likely cats, so it comes with same prediction all the time- cats
1. The initial probability of picking a photo of some class as a prior
2. In above example the priors are 0.9 for cats and 0.1 for dogs.
3. the prior must always be close to 0.5 in case of 2 classes, so in case of 3 classes it must be 0.33

In our audiobook data, most of the customers did not convert back, hence we must have equal number of customer those who did and did not convert back

# Preprocessing begins 

### Extract the data

In [8]:
import numpy as np
from sklearn import preprocessing
import pandas as pd
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter =',')
unscaled_inputs_all = raw_csv_data[:,1:-1] # 1-> second column till -1 -> last column excluded
targets_all = raw_csv_data[:,-1]

we dont need customer id, i.e. the first column as it does not bring any value to us. So we remove the tragets and ID to produnce inputs

### Shuffle the data
before balancing the data, shuffling ensures that random records of 0 targets are eliminated, hence shuffling must be done before it. For e.g. what if the data is arranged in the order of the data, so while batching this may confuse the SGD when we average the loss across the batches

In [9]:
shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
seed =100
np.random.seed(seed)
np.random.shuffle(shuffled_indices)

unscaled_inputs_all = unscaled_inputs_all[shuffled_indices]
targets_all = targets_all[shuffled_indices]

### Balance the data
1. we count the no. of targets that are '1' 
2. we keep many 0s as there are 1s

In [10]:
count_of_1s= np.sum(targets_all == 1)
count_of_0s = 0 # currently 0
indices_to_remove = [] # we want it to be list or tuple hence we put empty brackets

for i in range(targets_all.shape[0]):  # shape on the zero axis is the length of the vector
    if targets_all[i]==0:
        count_of_0s += 1
        if count_of_0s > count_of_1s:
            indices_to_remove.append(i)     
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all,indices_to_remove,axis =0)
targets_equal_priors = np.delete(targets_all,indices_to_remove,axis=0)

### Standardize the inputs
If not done, it reduces the accuracy of model by 10%

In [11]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

## Important concept
we shuffled the data, and then balanced it. When we balance it may happen that one category of class may accumulate in one of the set (train validation test). So we reshuffle the data again

In [12]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.seed(seed)
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the data into training, validation and test
we will be using 80-10-10 split

In [13]:
samples_count = shuffled_inputs.shape[0]
train_count = int(0.8 * samples_count)
validation_count = int(0.1 * samples_count)
test_count = samples_count - train_count - validation_count

train_inputs = shuffled_inputs[:train_count]
train_targets = shuffled_targets[:train_count]


validation_inputs = shuffled_inputs[train_count:train_count+validation_count]
validation_targets = shuffled_targets[train_count:train_count+validation_count]

test_inputs = shuffled_inputs[train_count+validation_count:]
test_targets = shuffled_targets[train_count+validation_count:]

print(np.sum(train_targets),train_count,np.sum(train_targets)/train_count)
print(np.sum(validation_targets),validation_count,np.sum(validation_targets)/validation_count)
print(np.sum(test_targets),test_count,np.sum(test_targets)/test_count)

1772.0 3579 0.4951103660240291
237.0 447 0.5302013422818792
228.0 448 0.5089285714285714


we notice that we got approx 50% priors in three cases. (50% -55% also acceptable)

## Now we save our data into npz file

In [14]:
np.savez('Audiobook_train_data',inputs = train_inputs,targets = train_targets)
np.savez('Audiobook_validation_data',inputs = validation_inputs,targets = validation_targets)
np.savez('Audiobook_test_data',inputs = test_inputs,targets = test_targets)

#### note:
you can write any word like cat dog, input, inputss, etc instead of 'inputs' in the savez function, same goes for 'targets'