<a href="https://colab.research.google.com/github/apoorva666/Audiobooks-Customer-Attrition/blob/main/AudioBooks_Customer_Attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Data Preprocessing - There are many missing values in 'Review 10/10' column which have been replaced with their average value***

**Importing packages & extracting data from .csv file**

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing                       #Contains features for standardizing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
raw_data=pd.read_csv('/content/drive/My Drive/Colab Notebooks/Audiobooks attrition/Audiobooks_data.csv')

In [None]:
raw_data

Unnamed: 0.1,Unnamed: 0,Book_length(mins)_overall,Book_length(mins)_avg,Price_overall,Price_avg,Review,Review10/10,Completion,Minutes_listened,Support_Request,Last_Visited_mins_Purchase_date,Target
0,994,1620.0,1620,19.73,19.73,1,10.00,0.99,1603.8,5,92,0
1,1143,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,0,0
2,2059,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,388,0
3,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
4,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14079,28220,1620.0,1620,5.33,5.33,1,9.00,0.61,988.2,0,4,0
14080,28671,1080.0,1080,6.55,6.55,1,6.00,0.29,313.2,0,29,0
14081,31134,2160.0,2160,6.14,6.14,0,8.91,0.00,0.0,0,0,0
14082,32832,1620.0,1620,5.33,5.33,1,8.00,0.38,615.6,0,90,0


**Understanding the data**

**ID: Customer ID.**

**Book_length(mins)_overall: Sum of the lengths of purchases.**

**Book_length(mins)_avg: Sum of the lengths of purchases divided by the number of purchases.**

**Price_overall & Price_avg: Total money spent, money spent on an average.**

**Review: Boolean values. It shows if the customer left a review. If so, Review10/10 saves the review left by the user.**

**Minutes_listened: is a measure of engagement, the total of minutes the user listened to audiobooks.**

**Completion: Minutes_listened / Book_length(mins)_overall.**

**Support_Request: Shows the total number of support request.**

**Last_Visited_mins_Purchase_date: the bigger the difference, the higher sooner the engagement.**

**Visualization**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
original_data = raw_data.values #Converting dataframe to numpy array

In [None]:
input_data = original_data[:,1:-1]      #Preparing input & target data
target_data = original_data[:,-1]

**Balancing the dataset to prevent bias**

In [None]:
no_one_targets = int(np.sum(target_data)) #Counting the no. of 1s
no_one_targets

2237

In [None]:
no_zero_targets_counter=0                              #Counter keeps track of how many times equivalent values are added

indices_to_remove = []           #We'll be removing some entries as part of the balancing process

for i in range(target_data.shape[0]):  #Number of rows
     if target_data[i] == 0:
        no_zero_targets_counter += 1
        if no_zero_targets_counter > no_one_targets:
              indices_to_remove.append(i)

balanced_inputs = np.delete(input_data, indices_to_remove, axis=0)   #Deletes all those rows from input data that contain removable indices
balanced_targets = np.delete(target_data, indices_to_remove, axis=0)

**Standardizing the inputs**

In [None]:
scaled_inputs=preprocessing.scale(balanced_inputs)

**Shuffling data**

In [None]:
shuffled_indices=np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

In [None]:
shuffled_inputs=scaled_inputs[shuffled_indices]
shuffled_targets=balanced_targets[shuffled_indices]

**Splitting into train, validation, & testing**

In [None]:
balanced_row_count=shuffled_inputs.shape[0]

train_count = int(0.8 * balanced_row_count)
validation_count = int(0.1 * balanced_row_count)
test_count = balanced_row_count - train_count - validation_count

In [None]:
train_inputs=shuffled_inputs[:train_count]
train_targets=shuffled_targets[:train_count]

validation_inputs = shuffled_inputs[train_count:train_count+validation_count]
validation_targets = shuffled_targets[train_count:train_count+validation_count]

test_inputs = shuffled_inputs[train_count+validation_count:]
test_targets = shuffled_targets[train_count+validation_count:]

**Balancing the split dataset**

In [None]:
print(np.sum(train_targets), train_count, np.sum(train_targets) / train_count)
print(np.sum(validation_targets), validation_count, np.sum(validation_targets) / validation_count)
print(np.sum(test_targets), test_count, np.sum(test_targets) / test_count)

#Outputs should be close to 50%

1809.0 3579 0.5054484492875104
198.0 447 0.4429530201342282
230.0 448 0.5133928571428571


**Save the three datasets in .npz**

In [None]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

**Machine Learning**

In [None]:
import numpy as np
import tensorflow as tf

In [None]:
npz=np.load('Audiobooks_data_train.npz')
train_inputs=npz['inputs'].astype(np.float)
train_targets=npz['targets'].astype(np.int)

npz=np.load('Audiobooks_data_validation.npz')
validation_inputs=npz['inputs'].astype(np.float)
validation_targets=npz['targets'].astype(np.int)

npz=np.load('Audiobooks_data_test.npz')
test_inputs=npz['inputs'].astype(np.float)
test_targets=npz['targets'].astype(np.int)

In [None]:
# Set the input and output sizes
input_size = 10
output_size = 2
hidden_layer_size = 300
    
# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense implements: output = activation(dot(input, weight) + bias)
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

# we define the optimizer we'd like to use, the loss function, and the metrics we are interested in obtaining at each iteration
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training
batch_size = 100
max_epochs = 100

#Early stopping
early_stopping=tf.keras.callbacks.EarlyStopping(patience=2) #Patience= the number of loss increases that are tolerable

# fit the model
model.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs=max_epochs, # epochs that we will train for 
          callbacks=[early_stopping],
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  

Epoch 1/100
36/36 - 0s - loss: 0.4540 - accuracy: 0.7555 - val_loss: 0.4407 - val_accuracy: 0.7494
Epoch 2/100
36/36 - 0s - loss: 0.3684 - accuracy: 0.8016 - val_loss: 0.3827 - val_accuracy: 0.8054
Epoch 3/100
36/36 - 0s - loss: 0.3479 - accuracy: 0.8106 - val_loss: 0.3768 - val_accuracy: 0.7987
Epoch 4/100
36/36 - 0s - loss: 0.3464 - accuracy: 0.8178 - val_loss: 0.3574 - val_accuracy: 0.8031
Epoch 5/100
36/36 - 0s - loss: 0.3421 - accuracy: 0.8122 - val_loss: 0.3705 - val_accuracy: 0.7942
Epoch 6/100
36/36 - 0s - loss: 0.3403 - accuracy: 0.8097 - val_loss: 0.3648 - val_accuracy: 0.8076


<tensorflow.python.keras.callbacks.History at 0x7f676804f550>

**Testing the model**

In [None]:
test_loss,test_accuracy=model.evaluate(test_inputs,test_targets)



In [None]:
print('Test loss:{0:.2f}, Test accuracy:{1:.2f}'.format(test_loss*100, test_accuracy*100))

Test loss:36.19, Test accuracy:80.36
