# APP purchasing prediction with TensorFlow. Part 1

# Preprocessing

Code Created by Luis Enrique Acevedo Galicia

Date: 2019-24-02

Here, I present a simple and easy way to create a machine learning algorithm, which is able to predict if a customer will buy again.

The APP_data.csv includes the ID of user, average of minutes purchasing the app, Total minutes purchasing the app, Average price paid, Total price paid, If user left review, The review, percentage of all funtions used from the app, minutes using the app, times user have called support, minutes of retunrning to our web site. At the end the target is presented, if user has purchased again or not.


# Preprocessing the data

# The Libraries

In [1]:
import numpy as np
from sklearn import preprocessing

# Get the data

In [2]:
#Load the data
data_raw = np.loadtxt('APP_data.csv', delimiter=',')

#Define inputs and targets
Inputs_unscaled = data_raw[:,1:-1]
Targets = data_raw[:,-1]


# Balance Data

In [3]:
#Count the targets that have purchased
Targets_one = int(np.sum(Targets))

#balance the number of ones with number of zeros
Targets_zero = 0
#In order to balance the dataset, we have to remove some cases
Remove_idx = []
#count the targets that are zero, then balance the numbers of ones and zeros, finally remove the rest cases
for i in range(Targets.shape[0]):
    if Targets[i] == 0:
        Targets_zero += 1
        if Targets_zero > Targets_one:
            Remove_idx.append(i)

#Create balanced inputs and targets
Inputs_unscaled_balanced = np.delete(Inputs_unscaled, Remove_idx, axis = 0)
Targets_balanced = np.delete(Targets, Remove_idx, axis = 0)

# Standardize Data (inputs) and shuffle all data

In [4]:
#Scale the inputs
Inputs_scaled = preprocessing.scale(Inputs_unscaled_balanced)

#Shuffle data to randomly spread dataset
Idx_shuffle = np.arange(Inputs_scaled.shape[0])
np.random.shuffle(Idx_shuffle)

Inputs_shuffle = Inputs_scaled[Idx_shuffle]
Targets_shuffle = Targets_balanced[Idx_shuffle]

# Create data sets for: Train, validation and test

In [5]:
#total number of samples
Tsample = Inputs_shuffle.shape[0]

#define percentage of samples
Train = 0.80
Valid = 0.10


Train_samples = int(Train*Tsample)
Valid_samples = int(Valid*Tsample)
Test_samples = Tsample-Train_samples-Valid_samples

#define Train data set
Inputs_Train = Inputs_shuffle[:Train_samples]
Targets_Train = Targets_shuffle[:Train_samples]

#define Validation data set
Inputs_Valid = Inputs_shuffle[Train_samples:Train_samples+Valid_samples]
Targets_Valid = Targets_shuffle[Train_samples:Train_samples+Valid_samples]


#define Test data set
Inputs_Test = Inputs_shuffle[Train_samples+Valid_samples:]
Targets_Test = Targets_shuffle[Train_samples+Valid_samples:]

#Verify results

print("Total Train samples are %s, percentage of ones is %s  and zeros is %s  " %(Train_samples, np.sum(Targets_Train)/Train_samples, 1-np.sum(Targets_Train)/Train_samples ))
print("\n")
print("Total Validation samples are %s, percentage of ones is %s  and zeros is %s  " %(Valid_samples, np.sum(Targets_Valid)/Valid_samples, 1-np.sum(Targets_Valid)/Valid_samples ))
print("\n")
print("Total Test samples are %s, percentage of ones is %s  and zeros is %s  " %(Test_samples, np.sum(Targets_Test)/Test_samples, 1-np.sum(Targets_Test)/Test_samples ))




Total Train samples are 3579, percentage of ones is 0.49762503492595694  and zeros is 0.502374965074043  


Total Validation samples are 447, percentage of ones is 0.5123042505592841  and zeros is 0.4876957494407159  


Total Test samples are 448, percentage of ones is 0.5066964285714286  and zeros is 0.4933035714285714  


# Create data sets in format *.npz

In [6]:
np.savez('APP_data_Train', inputs=Inputs_Train, targets=Targets_Train)
np.savez('APP_data_Valid', inputs=Inputs_Valid, targets=Targets_Valid)
np.savez('APP_data_Test', inputs=Inputs_Test, targets=Targets_Test)