March 1, 2018 
<br>Data Society
<br>Interview Presentation
<br> Alison Peebles Madigan



# **Random Forest Exercise**
***

## Data

Begin with reading in data as we have before with `read.csv`

This data is a bank marketing dataset from the [UCI Maching Learning Repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
> "The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed."



 <div class="panel-group" id="accordion-1">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-1">
        For Variable Descriptions click here</a>
      </h4>
    </div>
    <div id="collapse1-1" class="panel-collapse collapse">
      <div class="panel-body">Input variables:
<br> **# bank client data:**
<br> 1 - age (numeric)
<br>2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
<br>3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
<br>4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
<br>5 - default: has credit in default? (categorical: 'no','yes','unknown')
<br>6 - housing: has housing loan? (categorical: 'no','yes','unknown')
<br>7 - loan: has personal loan? (categorical: 'no','yes','unknown')
<br>**# related with the last contact of the current campaign:**
<br>8 - contact: contact communication type (categorical: 'cellular','telephone') 
<br>9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
<br>10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
<br>11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
<br>**# other attributes:**
<br>12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
<br>13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
<br>14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
<br>**# social and economic context attributes**
<br>16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
<br>17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
<br>18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
<br>19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
<br>20 - nr.employed: number of employees - quarterly indicator (numeric)

<br> **Output variable (desired target):**
<br>21 - y - has the client subscribed a term deposit? (binary: 'yes','no')Select the Chart icon and select Histogram</div>
    </div>
  </div>

In [None]:
test <- read.csv(file="Documents/bank-full.csv", header = T)

In [None]:
df_rf <- read.csv(file ='https://raw.githubusercontent.com/aapeebles/DataSocietyTraining/master/bank-additional-full.csv',
                  sep=";",header = T)

### Start with EDA & Summary Data - Examine the Dataset 

Once again we can use `names()`, `str()`, and `summary()` to examine the entire dataset.
<br>We can also use `table()` to explore specific categorical dimensions.

In [None]:
names(df_rf)

In [None]:
str(df_rf)

In [None]:
summary(df_rf)

#### Target Variable of Classification Model
Let's examine the distribution of our target variable and make sure that R will recognize it is a factor to build a classification model.
<br> `as.factor()` will assign the target variable as a factor.

<br>Do we need to do this for this dataset? 
<br>When might we need to? What would the variable look like?

In [None]:
table(df_rf$y)
table(df_rf$y)*100/nrow(df_rf)

### Pause for the Gini Index

Who remembers what numbers we would use to calculate the default probability of misclassifying a datapoint if we only used the target proportions?

In [None]:
1-(/41188)**2-(/41188)**2

***

## Random Forest Libraries and Functions

In [None]:
library(randomForest)

In [None]:
help(randomForest)

In [None]:
set.seed(300)

In [None]:
rf_mod <- randomForest(y ~ .,
                       data =df_rf, importance = TRUE, mtry = 4, ntree = 400, replace = TRUE )

In [None]:
plot(rf_mod)

In [None]:
# Variable Importance Plot
varImpPlot(rf_mod,
           sort = T,
           main="Variable Importance",
           n.var=17)

In [None]:
rf_mod

In [None]:
summary(rf_mod)

## Wait, something seems off here...
<br> Why is duration so strong?

In [None]:
library(plyr)
library(ggplot2)
ddply(df_rf,~y,summarise,mean=mean(duration),sd=sd(duration))
qplot(y, duration, data=df_rf, geom=c("boxplot"), 
   fill=y, main="Duration by Answer",
   xlab="yes and no", ylab="duration since last call")

Okay, so we **can't** use this variable to help predict. 
Let's redo the random forrest without that variable. 


I could start typing the whole formula again.... OR:

In [None]:
varNames <- names(df_rf)
# Exclude ID or Response variable
varNames <- varNames[!varNames %in% c("y","duration")]

# add + sign between exploratory variables
varNames1 <- paste(varNames, collapse = "+")

# Add response variable and convert to a formula object
rf.form <- as.formula(paste("y", varNames1, sep = " ~ "))

And How would I put that in the formula?`

In [None]:
rf_mod2 <- randomForest(rf.form,
                       data =df_rf, importance = TRUE, mtry = 4, ntree = 400, replace = TRUE )

In [None]:
# Variable Importance Plot
varImpPlot(rf_mod2,
           sort = T,
           main="Variable Importance",
           n.var=17)

In [None]:
plot(rf_mod2)

In [None]:
print(rf_mod2)

In [None]:
df_rf$predicted.response <- predict(rf_mod2,df_rf)

In [None]:
library(e1071)
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2
# Create Confusion Matrix
confusionMatrix(data=df_rf$predicted.response,
                reference=rf_mod2$y,
                positive='yes')

## Further Exploration - Test and Train

In [None]:
sample.ind <- sample(2, 
                     nrow(df_rf),
                     replace = T,
                     prob = c(0.6,0.4))
df_rf.train <- df_rf[sample.ind==1,]
df_rf.test <- df_rf[sample.ind==2,]

table(df_rf.train$y)/nrow(df_rf.train)



table(df_rf.test$y)/nrow(df_rf.test)


In [None]:
rf_mod3 <- randomForest(rf.form,
                       data =df_rf.train, importance = TRUE, mtry = 4, ntree = 400, replace = TRUE )


In [None]:
df_rf.test$predicted.response <- predict(rf_mod3,df_rf.test)
confusionMatrix(data=df_rf.test$predicted.response,
                reference=rf_mod3$y,
                positive='yes')

### References:

- http://dni-institute.in/blogs/random-forest-using-r-step-by-step-tutorial/
- http://cogns.northwestern.edu/cbmg/LiawAndWiener2002.pdf
- http://trevorstephens.com/post/73770963794/titanic-getting-started-with-r-part-5-random