 # Insurance prediction
 
 marketing campaign from the insurance industry 

In [6]:
### Loading libraries
library(Information)
library(gridExtra)
library(compareGroups)
library(ClustOfVar)
library(reshape2)
library(plyr)

options(scipen=10)

ERROR: Error in library(compareGroups): there is no package called ‘compareGroups’


In [2]:
### Loading the data
data(train, package="Information")
data(valid, package="Information")

In [4]:
### Exclude the control group
train <- subset(train, TREATMENT==1)
valid <- subset(valid, TREATMENT==1)

In [5]:
### Ranking variables using penalized Information Value (IV)
IV <-  create_infotables(data=train,
                  valid=valid,
                  y="PURCHASE")

grid.table(head(IV$Summary), rows=NULL)

In [None]:
grid.table(IV$Tables$N_OPEN_REV_ACTS, rows=NULL)

In [None]:
plot_infotables(IV, "N_OPEN_REV_ACTS")

In [None]:
for (i in 1:length(n)){
    plot_infotables(IV, n[i])
 }

In [None]:
MultiPlot(IV, IV$Summary$Variable[1:9])

In [None]:
IV <- create_infotables(data=train, y="PURCHASE")

In [None]:
IV <- create_infotables(data=train, valid=valid, y="PURCHASE", bins=20)

In [None]:
grid.table(IV$Tables$N_OPEN_REV_ACTS,
           %gp=gpar(fontsize=12),
           rows=NULL)

In [None]:
NIV <- create_infotables(data=train,
                         valid=valid,
                         y="PURCHASE",
                         trt="TREATMENT")

In [None]:
grid.table(head(NIV$Summary),
           rows=NULL,
           gp=gpar(fontsize=12))

## Combining IV Analysis With Variable Clustering

Variable clustering divides a set of numeric variables into mutually exclusive clusters. The algorithm attempts to generate clusters such that

the correlations between variables assigned to the same cluster are maximized.
the correlations between variables in different clusters are minimized.

Using this algorithm we can replace a large set of variables by a single member of each cluster, often with little loss of information. The question is which member to choose from a given cluster. One option is to choose the variable that has the highest multiple correlation with the variables within its cluster, and the lowest correlation with variables outside the cluster. A more meaningful choice for a predictive modeling is to choose the variable that has the highest information value. 

In [None]:
tree <- hclustvar(train[,!(names(train) %in% c("PURCHASE", "TREATMENT"))])
nvars <- length(tree[tree$height<0.7])
part_init<-cutreevar(tree,nvars)$cluster
kmeans<-kmeansvar(X.quanti=train[,!(names(train) %in% c("PURCHASE", "TREATMENT"))],init=part_init)

In [None]:
clusters <- cbind.data.frame(melt(kmeans$cluster), row.names(melt(kmeans$cluster)))
names(clusters) <- c("Cluster", "Variable")
clusters <- join(clusters, IV$Summary, by="Variable", type="left")
clusters <- clusters[order(clusters$Cluster),]
clusters$Rank <- ave(-clusters$AdjIV, clusters$Cluster, FUN=rank)

In [None]:
selected_members <- subset(clusters, Rank==1)
selected_members$Rank <- NULL

Using variable clustering in combination with IV cuts the number of variables from 68 to 21:

In [None]:
nrow(selected_members)
nrow(clusters)

In [None]:
grid.table(head(selected_members),
           rows=NULL,
           gp=gpar(fontsize=12))

### Summary

The purpose of exploratory analysis and variable screening is to get to know the data and assess “univariate” predictive strength, before we deploy more sophisticated variable selection approaches.

The weight of evidence (WOE) and information value (IV) provide a great framework for performing exploratory analysis and variable screening prior to building a binary classifier (e.g., logistic regression). It seamlessly handles missing values and character variables, and the output is easy to interpret.

The information value originates from information theory and is closely related to the concept of mutual information.

The information package is specifically written to perform this type of analysis using parallel processing. It also supports exploratory analysis for uplift models, a growing area within marketing analytics. The information package is not designed to transfer data into WOE vectors for Naive Bayes models, although this feature could be added later.