## SMAVA

Test exercise - Question 2 - Data investigation

Reduce the features used and try other models on whole of the dataset or sub-sets for each bank.

### Overview

#### Acceptance

Firstly, we analyze and decide that an acceptance tester for each bank can be 64% accurate using a radial SVM. 
A logistic model is only 60%. (Cross-validated within the train set.)

We note that this is with a reduced feature set: which is invariant across all banks - ie. the model enables each bank to decide with the same data on the customer. No conditioning on x6 or x7.

#### Interest Rate

There are two category fields, x6 and x7 which may help decide the interest rate offered by bank.
x6 is two-valued, x7 is 2000!

We should also revisit the correlations. x1 x10 and x4 are good for acceptance, but the discarded x values could correlate for interest rate.


In [2]:
## weaves
## smava

getwd()

## load in packages
library(tidyverse)
library(rpart.plot)

library(e1071)
library(MASS)
library(mlbench)
library(gbm)
library(kernlab)
library(RSNNS)

## last loaded, first found methods
library(caret)

## implementation
library(doMC)

registerDoMC(cores = detectCores(all.tests = FALSE, logical = TRUE))

options(useFancyQuotes = TRUE)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.2.1     [32mv[39m [34mpurrr  [39m 0.3.3
[32mv[39m [34mtibble [39m 2.1.3     [32mv[39m [34mdplyr  [39m 0.8.3
[32mv[39m [34mtidyr  [39m 1.0.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.5.0

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: rpart


Attaching package: 'MASS'


The following object is masked from 'package:dplyr':

    select


Loaded gbm 2.1.5


Attaching package: 'kernlab'


The following object is masked from 'package:purrr':

    cross


The following object is masked from 'package:ggplot2':

    alpha


Loading required package: Rcpp

Loading req

In [3]:
load("bak/in/train.rdata")
load("bak/in/test.rdata")

train0 <- data.frame(train) # a local copy.

In [None]:
## Logistic regression for acceptance using Narrow Feature set with conditioning on bank over all banks 
method = 'polr' # fails in jupyter-lab, 59% standalone
method = 'plr' # fails in jupyter-lab
method = 'multinom' # fails in jupyter-lab

method = 'glm' # 0.59

fit1 <- train(
  form = accepted ~ x1 + x10 + x4 + bank,
  data = train,
  trControl = trainControl(method = "cv", number = 5),
  method = method,
  family = "binomial"
)
fit1

In [4]:
## Not very good. Let's try a predictor specific to a bank
## List the banks.
smava1 <- list()
smava1$banks <- unique(train[["bank"]])
length(smava1$banks)
smava1$banks

In [5]:
## Random choose a bank to look at
fidx <- function(N0, n=3) sample(1:N0, n, replace = FALSE)

t <- smava1$banks
tag <- t[fidx(length(t))]
tag <- tag[1:1]
tag
colnames(train)

In [3]:
load("bak/in/train.rdata")
load("bak/in/test.rdata")

train0 <- data.frame(train) # a local copy.

In [6]:
cols <- colnames(train)
cols <- cols[grep('^x', cols)]
cols <- setdiff(cols, "x3")
xcols <- c(cols, "accepted")
xcols

In [7]:
## Specific to a bank
## And impute x2
tag
outcomes <- train[ train$bank %in% tag, "accepted"]

train1 <- train[ train$bank %in% tag, xcols]

train1$x2na <- 0
train1[is.na(train1$x2), "x2na"] <- 1

v0 <- train1[train1$x2 & train1$x2na == 0, "x2"]
v0 <- mean(v0[["x2"]])

train1[is.na(train1$x2), "x2"] <- v0

dim(train1)
head(train1)

# test1[ is.na(test1$x2), "x2" ] <- smava0$x2impute

x1,x2,x4,x5,x6,x7,x8,x9,x10,accepted,x2na
<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
1.7044089,5.4123865,-1.209511,1.852449,FBW,PBSEO,0.7316942,0.2678961,0.3731034,NO,0
6.2399036,3.9923303,3.601843,-0.6324472,FBW,O07XX,3.2772413,0.1890368,0.3876183,YES,0
0.6404203,0.4344275,-1.211129,-1.3633386,ACS,QTDO0,-2.5410318,0.2141143,0.3487595,NO,0
2.8531209,7.9625419,4.341369,4.72531,FBW,QUTOJ,9.0522801,0.2389058,0.4001331,YES,1
0.7511392,7.9625419,5.90378,7.7164806,ACS,49CT6,13.8247432,0.3056381,0.3548651,YES,1
0.3687868,7.9625419,-3.454459,5.578333,FBW,IEWGU,1.9566097,0.8271761,0.2675379,NO,1


In [None]:
## Logistic fit for one bank is barely any better.
## Don't add to the formula x7 - no-converge

fit1x <- train(
  form = accepted ~ x1 + x10 + x4,
  data = train1,
  trControl = trainControl(method = "cv", number = 5),
  method = "glm",
  family = "binomial"
)
fit1x

### Note

Tried a few formulae, no improvement with x2. 
No real improvement even if limited to just one bank.

Try some SVM classifiers - these only converge on small data-sets. There are a number of 

A pair plot of features how a couple of X-Y central scattering, suggest that radials can be good method

In [None]:
## Formally constrain to the Narrow Feature set
cols <- c("x1", "x10", "x4", "accepted")
df0 <- train1[,cols]

In [None]:
plot(df0)

In [None]:
## Heuristically try other models
## 0.65 accurate by bank for acceptance with just these three x.

method <- 'svmLinear2'
method <- 'svmPoly'

## all these give 64%
method <- 'svmRadial'
method <- 'svmRadialSigma'
method <- 'svmRadialCost'

In [None]:
## By bank - acceptance using Narrow Feature set

fit1x <- train(
  form = accepted ~ .,
  data = df0,
  trControl = trainControl(method = "cv", number = 5),
  method = method    
)
fit1x

In [None]:
trainPred <- predict(fit1x, df0)
# postResample(testPred, testClass)

conf0 <- confusionMatrix(trainPred, df0[["accepted"]], positive = "YES")
conf0

### Recap and Plan: customer to acceptances at each bank

Model with predictor trained by bank using 'svmRadialCost' gets 64% on the training set - cross-validated.
using just the fields: "x1", "x10", "x4" - the Narrow Feature Set

So for each customer, we want to know what banks might accept them. 

A customer gets a different set of values from each bank, but not on the features of the Narrow Feature set.
So the customer record can be duplicated for each bank. A prediction made with that bank's predictor for all customers.

That would produce the banks.rdata results without any methodological issues, but that is unordered.

### Plan: bank ordering by interest rate

We then have to order the bank choice by interest rate given. We should train a second regression predictor, for all accepted customers at a bank and the interest they receive. And we have to consider if x6 and x7 are correlated to interest rate.

In [None]:
## Visual - check the feature relationships - customer is constant across bank. Interest rate varies.
cols <- unique(c(cols, "x1", "x4", "x10"))
df2 <- train[order(train$customerNumber, train$bank), c("customerNumber", "accepted", "bank", c(cols, "x6", "x7"), "x2", "interestRate")]
head(df2, n=50)

In [None]:
## Logistic regression using Narrow Feature set with conditioning on bank over all banks.

fit1 <- train(
  form = accepted ~ x1 + x10 + x4,
  data = train,
  trControl = trainControl(method = "cv", number = 5),
  method = "glm",
  family = "binomial"
)
fit1

In [None]:
## An interest rate predictor would be like this. 
df3 <- df2[df2$accepted == "YES", c("bank", "interestRate", "customerNumber", "x7", "x6", cols)]
df3[order(-df3$interestRate, df3$bank),]

## Built for each bank it would take the x values, returning interestRate for all 

The customer numbers at all banks, so a weighted regression on x values to interestRate 
with a two-ply tree on x6 and x7.
Banks are 30 in number so same number of predictors. Try a global regression predictor using x6, x7 and xvalues

In [8]:
# Look at interest rates for one bank.
df4 <- train[train$bank == tag & train$accepted == "YES", setdiff(colnames(train), c("customerNumber", "bank", "accepted", "x3", "x6", "x7"))]

x2bar <- mean(df4$x2, trim=0, na.rm=TRUE)
x2bar
df4[is.na(df4$x2), "x2"] <- x2bar

In [None]:
plot(df4)

In [None]:
## All over looks of business
summary(df4)

In [None]:
## Nothing new on corrrelations

c0 <- cor(df4)
corrplot::corrplot(c0, method="number", order="hclust")

In [None]:
## Look for a regression and check Rsquared

regressControl  <- trainControl(method="repeatedcv",
                    number = 4,
                    repeats = 5
                    ) 

## R^2 of only 0.3 is not good enough, with x6 is not better.
regress <- train(interestRate ~ .,
           data = df4,
           method  = "lm",
           trControl = regressControl)
regress

In [None]:
## Let's overfit with a multi-layer perceptron and see if I can use x6 and x7.
## Regression model two inaccurate. A simple multi-layer perceptron - a look up table almost.

fControl  <- trainControl(method="repeatedcv",
                    number = 4,
                    repeats = 5
                    )
method = 'mlp'
method = 'mlpWeightDecay'

mi <- getModelInfo(model = method, regex = FALSE)[[1]]
p0 <- data.frame(mi$parameters)
p0

In [None]:
data.frame(interaction.depth = c(1, 5, 9))

In [10]:
## "x6", "x7"
df4 <- train[train$bank == tag & train$accepted == "YES", setdiff(colnames(train), c("customerNumber", "bank", "accepted", "x3"))]

x2bar <- mean(df4$x2, trim=0, na.rm=TRUE)
x2bar
df4[is.na(df4$x2), "x2"] <- x2bar
head(df4)

x1,x2,x4,x5,x6,x7,x8,x9,x10,interestRate
<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
6.2399036,3.99233,3.601843,-0.6324472,FBW,O07XX,3.277241,0.1890368,0.3876183,12.215344
2.8531209,7.655432,4.341369,4.72531,FBW,QUTOJ,9.05228,0.2389058,0.4001331,3.34864
0.7511392,7.655432,5.90378,7.7164806,ACS,49CT6,13.824743,0.3056381,0.3548651,2.666739
2.7489449,4.250198,3.325267,1.9186531,FBW,GPX9G,5.926024,0.8271125,0.5347677,3.336138
6.9003717,4.110479,-5.17311,0.8183836,FBW,WXATN,-4.403885,0.3888524,0.4672193,14.509473
0.4255057,14.477147,1.60709,-3.3508678,ACS,4AME2,-1.415343,0.5023046,0.3325895,2.348572


In [None]:
## Try a simple neural network - with 11 hidden layers, you can get R^2 to 0.11 - awful.

fit0 <- train(interestRate ~ x1 + x10 + x7 + x6,
           data = df4,
           method  = method,
           trControl = fControl, tuneGrid = data.frame(size=c(7)))
fit0

In [14]:
head(df4)

x1,x2,x4,x5,x6,x7,x8,x9,x10,interestRate
<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
6.2399036,3.99233,3.601843,-0.6324472,FBW,O07XX,3.277241,0.1890368,0.3876183,12.215344
2.8531209,7.655432,4.341369,4.72531,FBW,QUTOJ,9.05228,0.2389058,0.4001331,3.34864
0.7511392,7.655432,5.90378,7.7164806,ACS,49CT6,13.824743,0.3056381,0.3548651,2.666739
2.7489449,4.250198,3.325267,1.9186531,FBW,GPX9G,5.926024,0.8271125,0.5347677,3.336138
6.9003717,4.110479,-5.17311,0.8183836,FBW,WXATN,-4.403885,0.3888524,0.4672193,14.509473
0.4255057,14.477147,1.60709,-3.3508678,ACS,4AME2,-1.415343,0.5023046,0.3325895,2.348572


## Try an SVM regression model

This is developed in smava03.ipynb.

The final configuration is 60% accurate, by bank using svmRadial

In [None]:
# just use train1 no preProcess, just these features.

fit0 <- train(interestRate ~ x1 + x10 + x4,
           data = train1,
           method = 'svmRadial',
              preProcess = c("zv"),
#           tuneGrid = fGrid, 
           trControl = fControl)
fit0

## Note

There's a lot of variation at the bank for the interest rate. It may be that one of the factor fields is a product.
The codes in x6 and x7 might be multi-modal with respective to the x6 (or x7).

So I should check the pair-plot with interestRate.

In [None]:
## check x6 and x7 - chi-sq by eye
## No clear classification on interestRate

## Note
## One day, I'll learn dylpr
tag <- "x6"
tag1 <- "ACS"

tag <- "x7"
tags <- unique(train[[tag]])
length(tags)
tag1<- tags[2]

cols <- c("x10", "bank", "x5", tag, "interestRate")
df3 <- train[!is.na(train$interestRate), cols]
df3 <- df3[ order(df3$bank, df3[[tag]]), cols]
# lapply(df3[ df3$bank == "B1",], summary)

lapply(df3[ df3[[tag]] == tag1 & df3$bank == "B1",], summary)
lapply(df3[ df3[[tag]] != tag1 & df3$bank == "B1",], summary)

## x7 product codes

Is the x7 a product field that is common to all banks?

It looks like bank B9 has never offered a particular x7

In [None]:
x7s <- c("69TO8", "157H7")
x7s0 <- unique(train[["x7"]])
length(x7s0)
df0 <- data.frame(with(train, table(x7, x6)))
df0[order(-df0$Freq),]
df0 <- data.frame(with(train, table(x7, bank)))
df0[order(-df0$Freq),]

In [None]:
summary(train$x6)
summary(train$x7)
with(train, table(x6, x7))

In [None]:
## No obvious splits?
cp0 <- 0.005
method0 <- "class" # anova
method0 <- "anova" # anova
x7s
tag <- "WXATN"
tag <- x7s[1]
df0 <- train[train$x7 == tag, ]
tag
tree <- rpart(interestRate ~ bank + x7 + x6, data=df0, cp=cp0, method = method0)
                         # cp=.02 because want small tree for demo

rpart.plot(tree)

## x6 FBW and ACS

In [None]:
with(warpbreaks, table(wool, tension))

In [None]:
with(train, table(bank, x7))

In [None]:
## Interesting products
## Trying to see if any relationship between x7 and interestRate. Very sparse data, two codes have more than 20 per bank.
## Possibly a standard product?
x7s <- c("69TO8", "157H7")

In [None]:
df0 <- train[!is.na(train$interestRate) & train$x7 %in% x7s, c("bank", "interestRate", "x7")]

f <- function(x) list(summary(x))

by_cyl <- df0 %>% group_by(bank) %>%  summarise(
  disp = f(interestRate),
)
by_cyl

by_cyl <- df0 %>% group_by(x7) %>%  summarise(
  disp = f(interestRate),
)
by_cyl


In [None]:
summary(1:10)