# **Introduction**

The purpose of this project is to predict customer salaries based on other features in a given dataset. The data was provided by Austrialian and New Zealand Banking group, and is a synthesized representation of actual transaction information. To achieve this we will use to different models the first being ***multiple regression*** and the other being ***A regression tree model***; which is a decision tree that can produce continous numerical values.

## Packages

To complete this analysis we will mainly rely on three main packages namely :
1. Tidyverse for preprocessing and manipulation
2. Broom for modeling 
3. ranger for rpart.
4. rvest for cross validation



In [None]:
#load the tidyverse,broom, readxl packages
library(tidyverse)
library(broom)
library(readxl)
library(lubridate)
library(rpart)
library(vtreat)

In [None]:
anz_xlsx <- "../input/anz-transaction/anz.xlsx"
ANZ <- read_excel(anz_xlsx)

#observe file structure
glimpse(ANZ)
#print first 10 values
head(ANZ, 5)
#COUNT ROW NUMBERS 12043 rows
nrow(ANZ)


### **Data Preprocessing**
To perform this analysis we first need to make our data tidy, and search for outliers. This procees can be divided into the three following steps
1. Search for missing values.
2. Remove outliers
3. Extract relevant information.

In [None]:
#search for missing values.

apply(ANZ, 2, function(x) sum(is.na(x)| x == ''))
# check the number of unique values for each column
apply(ANZ, 2, function(x) length(unique(x)))



Some values are missing especially for merchant information, this is probaly due to the fact that not all transactions where purchase transactions. Its also revealed that they are 100 unique accounts and each transaction was unique. All values are also in the same currency.



In [None]:
# confirm the one -to -one link of account_id and customer_id
ANZ %>% select(account,customer_id) %>%
 unique() %>%
 nrow()

### **Extracting Relevant Information**

This data set contains both location and time data, valueable insights might be gotten from them, to extract this information we can use the lubridate package for time date and "separate" function  to extract the location information. 

In [None]:


#separate long and lat
anz <- ANZ %>% separate("long_lat", c("c_long", "c_lat"),sep=' ')
#turn c_long and c_lat into numeric values and re-add them to the data frame
anz$c_long<- as.numeric(anz$c_long)      
anz$c_lat <- as.numeric(anz$c_lat)


In [None]:

#extract datetime information and into new columns in the dataset
anz_timely <- anz %>%
#mutate() adds new columns and the time related functions help extract information
                   mutate(datetime = ymd_hms(extraction)) %>%
                   mutate(time_of_day = ifelse(am(datetime), "AM", "PM"),
                          weekday = wday(datetime, label = TRUE),
                          month = month(datetime, label = TRUE),
                          date = ymd(date)) 
          


### Removing outliers

 We should comb through the data for plausible outliers and normalize the data we can also check if all transactions are in our date and location range.


In [None]:

# check the range of customer location
# filtering out transactions for those who don't reside in Australia
 anz %>%
 filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
#filtering out outlier
anz_new <- anz_timely %>%
            filter ((c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
 


A single customer is from outside australia, so we should exclude all transactions from this account moving forward as these kind of extremeties can affect the accuracy of the model.

In [None]:
#check date range
DateRange <- seq(min(anz_new$date), max(anz_new$date), by = 1)
#  transactions from 2018-08-16 are missing but no transaction is outside the data set.
DateRange[!DateRange %in% anz_new$date]
 

      


## Feature Engineering

Next we select or generate possible variables from our data set, they will be the "features" for our model. We can start by aggregating the total salary for each customer and their total spending. We can also find less salient correlations with the help of *scatterplots*.


#### **Grouping by Customer ID**

The data has a column that denotes the customer ID aptly named *customer_id*, we can aggregate data by customer ID and extract information such as:
* The average balance
* Relative balance(Balance compared to the median)
* Total salary
* Total spending 
* Spending frequency

The **relative balance** metric is a measure of the total 

we can the help of the ***filter()*** function from the ***dyplr*** package. Alternatively, I suspect that only salary transactions are represented as "credit" in the *movement* column, this can make manipulation easier.

Since we are trying to predict salaries we need to a create column for total salary, we can also make one for the frequency of salaries and average balance of the customer. We can create these features separately, then add them together using the ***inner_join*** function from ***dyplr***.

In [None]:
#All salary transactions are grouped as "credit"   
anz_new %>% group_by(txn_description, movement) %>% summarize(n())

In [None]:
#creating a new dataframe with total balance
anz_balance <- anz_new %>%
                mutate(median_balance = median(balance)) %>%
                group_by(customer_id) %>% 
                summarize(exp_balance = median(balance),
                          relative_balance = unique(exp_balance/median_balance),
                          gender = unique(gender), 
                          age = unique(age),
                          longitude = unique(c_long),
                          latitude = unique(c_lat))


  
head(anz_balance)



In [None]:
#create a new dataframe for the total salary and spending for each customer,
anz_aggregate <- anz_new %>%
                   group_by(customer_id, movement) %>% 
                   summarize(total = sum(amount),
                             median = median(amount)) %>%
                   ungroup() %>%
                   pivot_wider(names_from= c("movement"), values_from = c("total", "median")) %>%
                   transmute(customer_id = customer_id,
                             spending = total_debit,
                             salary = total_credit, 
                             median_spending = median_debit)
head(anz_aggregate)


The two previous code cells created columns for total salary,total spending,average account balance and other variable that can be used in our model. Next, we join the tables and then find correlations between the explanatory variables(what makes up the prediction) and the response variable(the value we predict), this can be done with the ***inner_join*** function from the ***dypyr*** package.

In [None]:
#join the two agreggated dataframe
anz_sparkly <- anz_balance %>%
                 inner_join(anz_aggregate, by =  "customer_id")

Additionally, we can add extra columns indicating the frequency of transactions for each customer.

In [None]:
#create new table with columns indicating frequency
anz_txn_vol <- anz_new %>%  
                 group_by(customer_id, txn_description) %>%
                 summarize(txn_volume = n()) %>%
                 spread("txn_description", "txn_volume", fill = 0)

In [None]:
#join all three datasets
anz_tidy <- anz_sparkly %>% 
             inner_join(anz_txn_vol, by =  "customer_id")
             
             
head(anz_tidy)

Now we can visualize the relationship between certain columns and the predictor variable with the help of scatterplots(for continous variables) and boxplots(for categorical variables).

### Transaction Against Balance

In the tables we created, we calculated the total, median and mean balance, we can visualize each variable to determine which of this metrics has the highest correlation. We can create a function that automates a lot of this processes.

In [None]:
#create a function that takes a single argument "x" that repesents the x-axis of the plot
ggcor<-function(...){ggplot(anz_tidy, aes(..., salary, color = gender)) +
                       geom_point()
                       }
                     

In [None]:
#create a plot of salary vs. log median_balance)
ggcor(log(exp_balance))
#correlation between exp_balance and salary
cor(anz_tidy$exp_balance , anz_tidy$salary)

In [None]:

#correlation between log(exp_balance) and salary
cor(log(anz_tidy$exp_balance) , anz_tidy$salary)

* The first plot above shows little positive correlation. However, if a log of the median balance is taken before plotting, an easy to spot pattern will emerge this is because of the nature of the distribution. 

#### Correlation With Age

The code cell below showed that there is little correlation between age and salary. 

In [None]:
#plot by age
ggcor(age)

ggplot(anz_tidy, aes(age, salary)) + 
  geom_point() +
  xlim(20, 90)

cor(anz_tidy$age , anz_tidy$salary)

#### Correlation between Spending and Salary


In [None]:
#plot spending vs salary
ggcor(spending)

#spending 
cor(anz_tidy$spending, anz_tidy$salary)

In [None]:
#median_spending vs salary
ggcor(median_spending)

In [None]:
#transaction volume vs. salary
ggcor(PAYMENT)
cor(anz_tidy$PAYMENT, anz_tidy$salary)

#### Relationship Between Location and Salary


In [None]:
#shows that there is no strong linear relationship between location and salary
cor(anz_tidy$longitude, anz_tidy$salary)
cor(anz_tidy$latitude, anz_tidy$salary)

#### Relationship Between Gender and Salary

The code cells below show that men in the distribution tend to have a slightly higher salary that others.


In [None]:
ggplot(anz_tidy, aes(gender, salary)) +
  geom_boxplot(fill = "pink", color = "blue") +
  ggtitle("Salary vs. Gender")

From the analysis above we can now select the features of the model, they are:
* gender
* age
* expected balance
* spending(total amount spent)
* PAYMENT

### **Modelling**

#### Modelling with Cross Validation Plan

Creating a kfold cross validation plan is way to validate the modelling processing by testing the models out of sample performance. This can be done with the help of the *kWayCrossValidation* function from the vtreat package. This function splits the data into **K** pairs of training and testing data.



In [None]:
#create cross validation plan
set.seed(2020-10-6,  sample.kind="Rounding")
splitplan <- kWayCrossValidation(nrow(anz_tidy), 3)

In [None]:
#train the linear model
#intialize empty column containing predictions
anz_tidy$pred_lm <- 0

k <- 3 
set.seed(2020-10-6,  sample.kind="Rounding")
for (i in 1:k){
    split <- splitplan[[i]]
    model <- lm(salary ~ factor(gender) + age + log(exp_balance) + spending + PAYMENT, data = anz_tidy[split$train,])
    anz_tidy$pred_lm[split$app] <- predict(model, newdata = anz_tidy[split$app,])
 }



In [None]:
#find the RMSE and R-square for the cross validated model.
anz_tidy %>% 
  summarize(residual_error = (pred_lm - salary)^2,
            mu =mean(residual_error), 
            rmse = sqrt(mu),
            y_bar = mean(salary),
            res = salary - pred_lm,
            ssr = sum(res^2),
            sqr_tot = (salary - y_bar)^2,
            sst = sum(sqr_tot),
            r_squared = 1- ssr/sst) %>% 
  summarize(rmse = unique(rmse), r_squared = unique(r_squared))

In [None]:


# Use nrow to get the number of rows in anz_tidy and print it
N <- nrow(anz_tidy)


# Create the vector of N uniform random variables
gp <- runif(N)

# Use gp to create the training set: anz_train (75% of data) and anz_test (25% of data)
anz_train <- anz_tidy[gp < 0.75, ]
anz_test <- anz_tidy[gp >= 0.75, ]

#training on a split and test data
salary_mod <- lm(salary ~ factor(gender) + age + log(exp_balance) + spending + PAYMENT, data = anz_train)
glance(salary_mod, data = anz_test) 

#the model produced an adj.r.squared value of 0.3414312

### Model Results 

On the cross-validation test the model had a ***Root Mean Square Error*** of **5772.106 AUD** and a **Coefficient of Determination** of 0.2598248. The root mean square error is a measure of the models accuracy and  its in the same unit as the *response variable*, it can be thought of as the standard error of the models output while the coefficient of determination(R-squared) is a measure of how well the model explains the variance in data, values closer to 1 indicate that model explains variablity in the data well. However, the model performed better on the split and train set and had a lower R-square value, indicating better fit.

In [None]:
ggplot(anz_tidy, aes(pred_lm, salary)) +
 geom_point(colour = "violetred1") +
 geom_abline(colour = "darkblue") +
 xlab("Prediction") +
 ggtitle("Ground Truth Vs. Prediction")

###  Predicting With Decision Trees

Unlike the linear model, decision trees can capture non-linear relationships, simply, this means the model can be trained to learn patterns that a linear model couldn't. We will construct this model using a grid search method. This simply means we will construct trees with different *hyperparameters*(***presets***) and choose the one that has to lowest error values.

To accomplish we will follow the following steps:
1. Split the data into three parts: The training set(70%), the validation set(15%), the test set(10%).
2. Create a list possible presets.
3. Create a list of models with each preset
4. Compute the root mean square error of each model against the validation set.
5. Select model with the least error.
6. Make predictions on the test set.


In [None]:
#create three random sets from the data
N <- nrow(anz_tidy)
randomizer <- sample(1:3, N, replace = TRUE, prob = c(0.70, 0.15, 0.15))

#Create train set
anz_tree_train <- anz_tidy[randomizer == 1,]

#Create validation set
anz_tree_validt <- anz_tidy[randomizer == 2,]

#Create a test set
anz_tree_test <- anz_tidy[randomizer == 3,]

In [None]:
#Create a lists of all possible combination.
minsplit <- seq(5, 10, 1)
maxdepth <- seq(5, 15, 1)
tree_presets <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)

# Number of potential models in the grid
(num_models <- nrow(tree_presets))



In [None]:
# Initialize empty list for models
anz_tree_models <- list()

#Create a formula for the prediction
fmla <- salary ~ gender + age + log(exp_balance) + spending + PAYMENT

# Write a loop to create a model for each row in "tree_presets"
for (i in 1:num_models) {

    # Get minsplit, maxdepth values at row i
    minsplit <- tree_presets$minsplit[i]
    maxdepth <- tree_presets$maxdepth[i]

    # Train a model and store in the list
    anz_tree_models[[i]] <- rpart(formula = fmla, 
                               data = anz_tree_train, 
                               method = "anova",
                               minsplit = minsplit,
                               maxdepth = maxdepth)
}


In [None]:

#Intialize empty vector for Root_mean_square_error
RMSE<- c()
 
for (i in 1:num_models) {
    
    #select the ith model
    model <- anz_tree_models[[i]]
    
     # Make predictions on the validation set with each model 
    pred <- predict(model, anz_tree_validt)
    
    # Evaluate predictions
   RMSE[[i]] <- Metrics::rmse(anz_tree_validt$salary,pred)
}



In [None]:
#select the model with the lowest error
 good_tree <- anz_tree_models[[which.min(RMSE)]]

In [None]:
# Test new model 
 
predg <- predict(good_tree, anz_tree_test)
Metrics::rmse(actual = anz_tree_test$salary, predicted = predg)

### Decision Tree Results 

The decision tree produces a similar RMSE value to the linear model at 4758.24 AUD.

#### Recommendations and Conclusions

Both the linear model and the decision tree produces relatively high error value and the R-squared value of the linear model indicated that the model cannot explain the variation in the distribution.

A more advanced model with more information such as education and occupation can produce more accurate prediction. 
