# Analyses in R Final Project

### load packages

In [11]:
library("dplyr")
library("ggplot2")
library("car")
library("rcompanion")
library("IDPmisc")
library("caret")
library("gvlma")
library("predictmeans")
library("e1071")
library("lmtest")
library("lattice")

In [12]:
apps = read.csv('../Data/BALANCEDDummyDropOL.csv')

In [13]:
head(apps)

Unnamed: 0_level_0,X,totalIncome,famSize,ageYrs,yrsEmpl,UNEMPLOYED,ApprStatus,ownsCarR,ownsRealtyR,eduLvlR,incomeTypeR,housingTypeR,famStatusR,genderR
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,238500,2,47,13.5991841,1,1,0,1,0,0,1,1,0
2,1,135000,3,41,4.799551,1,0,1,1,1,0,1,1,0
3,2,157500,2,40,12.5067592,1,0,0,1,1,0,1,1,0
4,3,225000,2,24,0.2573633,1,1,0,1,2,1,1,0,0
5,4,67500,2,37,3.2635852,1,1,0,1,1,0,1,2,0
6,5,180000,4,42,14.8449318,1,1,1,0,0,0,1,1,0


In [15]:
# subset data to keep continuous variables and approval status

keeps <- c("totalIncome", "famSize", "ageYrs", "yrsEmpl", "ApprStatus")
apps2 <- apps[keeps]

In [16]:
head(apps2)

Unnamed: 0_level_0,totalIncome,famSize,ageYrs,yrsEmpl,ApprStatus
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,238500,2,47,13.5991841,1
2,135000,3,41,4.799551,0
3,157500,2,40,12.5067592,0
4,225000,2,24,0.2573633,1
5,67500,2,37,3.2635852,1
6,180000,4,42,14.8449318,1


# Stepwise Logistic Regression

## Create a model that will use approval status as the response variable, and continuous variables as potential predictor variables.

In [17]:
# get a baseline
FitAll = lm(ApprStatus ~ ., data = apps2)
summary(FitAll)


Call:
lm(formula = ApprStatus ~ ., data = apps2)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5444 -0.5222  0.4657  0.4776  0.5001 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.310e-01  3.227e-02  16.454   <2e-16 ***
totalIncome  1.509e-08  7.518e-08   0.201    0.841    
famSize     -8.963e-03  6.586e-03  -1.361    0.174    
ageYrs       1.094e-04  4.857e-04   0.225    0.822    
yrsEmpl      6.553e-04  1.249e-03   0.525    0.600    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4996 on 8496 degrees of freedom
Multiple R-squared:  0.0002693,	Adjusted R-squared:  -0.0002014 
F-statistic: 0.5721 on 4 and 8496 DF,  p-value: 0.6829


### All p values are >.05, so nothing is coming out to be significant.

## Backward Elimination - start with all variables in the model and remove one by one

In [19]:
step(FitAll, direction = 'backward')

Start:  AIC=-11793.74
ApprStatus ~ totalIncome + famSize + ageYrs + yrsEmpl

              Df Sum of Sq    RSS    AIC
- totalIncome  1   0.01005 2120.6 -11796
- ageYrs       1   0.01267 2120.6 -11796
- yrsEmpl      1   0.06871 2120.6 -11796
- famSize      1   0.46223 2121.0 -11794
<none>                     2120.5 -11794

Step:  AIC=-11795.7
ApprStatus ~ famSize + ageYrs + yrsEmpl

          Df Sum of Sq    RSS    AIC
- ageYrs   1   0.01083 2120.6 -11798
- yrsEmpl  1   0.07539 2120.6 -11797
- famSize  1   0.46188 2121.0 -11796
<none>                 2120.6 -11796

Step:  AIC=-11797.65
ApprStatus ~ famSize + yrsEmpl

          Df Sum of Sq    RSS    AIC
- yrsEmpl  1   0.07129 2120.6 -11799
<none>                 2120.6 -11798
- famSize  1   0.51847 2121.1 -11798

Step:  AIC=-11799.37
ApprStatus ~ famSize

          Df Sum of Sq    RSS    AIC
- famSize  1   0.47902 2121.1 -11799
<none>                 2120.6 -11799

Step:  AIC=-11799.45
ApprStatus ~ 1




Call:
lm(formula = ApprStatus ~ 1, data = apps2)

Coefficients:
(Intercept)  
     0.5221  


In [20]:
lm(formula = ApprStatus ~ totalIncome + ageYrs + yrsEmpl + famSize, data = apps2)


Call:
lm(formula = ApprStatus ~ totalIncome + ageYrs + yrsEmpl + famSize, 
    data = apps2)

Coefficients:
(Intercept)  totalIncome       ageYrs      yrsEmpl      famSize  
  5.310e-01    1.509e-08    1.094e-04    6.553e-04   -8.963e-03  


# Backward Elimination Observations:
* The AIC did not decrease as variables were eliminated, meaning the best explanatory variables for cc approval include income, age, employment experience, and family size.
* The coefficients for the IVs are very small numbers (overall very small effect on CC approval comes from these variables), with total income, age, and years employed having a positive effect on CC approval, meaning that all else being equal, as an applicant's income, age, and years employed increase, their approval status increases very slightly.  The family size coefficient is negative, meaning that as family size increases, the approval status decreases slightly.

In [21]:
# Is the model any good? Call it
fitsome = lm(ApprStatus ~ totalIncome + ageYrs + yrsEmpl + famSize, data = apps2)
summary(fitsome)


Call:
lm(formula = ApprStatus ~ totalIncome + ageYrs + yrsEmpl + famSize, 
    data = apps2)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5444 -0.5222  0.4657  0.4776  0.5001 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.310e-01  3.227e-02  16.454   <2e-16 ***
totalIncome  1.509e-08  7.518e-08   0.201    0.841    
ageYrs       1.094e-04  4.857e-04   0.225    0.822    
yrsEmpl      6.553e-04  1.249e-03   0.525    0.600    
famSize     -8.963e-03  6.586e-03  -1.361    0.174    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4996 on 8496 degrees of freedom
Multiple R-squared:  0.0002693,	Adjusted R-squared:  -0.0002014 
F-statistic: 0.5721 on 4 and 8496 DF,  p-value: 0.6829


## Multiple R-Squared of 0.0002 means that the model explains 0.02% of the variation in the CC approval status variable, and the other 99.98% can be explained by noise or random error.  The adjusted R Squared tells us about the same thing.  The p value of 0.6829 reinforces that we would be best served not using a model at all due to its insignificant output.

## Forward Elimination - start with no predictors and add in one at a time until there's no appreciable improvement of the model

In [22]:
# create the linear model
fitstart = lm(ApprStatus ~ 1, data = apps2)
summary(fitstart)


Call:
lm(formula = ApprStatus ~ 1, data = apps2)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5221 -0.5221  0.4779  0.4779  0.4779 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.522056   0.005418   96.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4995 on 8500 degrees of freedom


### The intercept of 0.52 is the average approval status of applicants in our dataset. 

In [23]:
# Begin the forward selection
step(fitstart, direction = 'forward', scope = (formula(FitAll)))

Start:  AIC=-11799.45
ApprStatus ~ 1

              Df Sum of Sq    RSS    AIC
<none>                     2121.1 -11799
+ famSize      1   0.47902 2120.6 -11799
+ ageYrs       1   0.05712 2121.1 -11798
+ yrsEmpl      1   0.03185 2121.1 -11798
+ totalIncome  1   0.00900 2121.1 -11798



Call:
lm(formula = ApprStatus ~ 1, data = apps2)

Coefficients:
(Intercept)  
     0.5221  


In [24]:
# examine the final model:
fitsome2 = lm(ApprStatus ~ famSize + ageYrs + yrsEmpl + totalIncome, data = apps2)
summary(fitsome2)


Call:
lm(formula = ApprStatus ~ famSize + ageYrs + yrsEmpl + totalIncome, 
    data = apps2)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5444 -0.5222  0.4657  0.4776  0.5001 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.310e-01  3.227e-02  16.454   <2e-16 ***
famSize     -8.963e-03  6.586e-03  -1.361    0.174    
ageYrs       1.094e-04  4.857e-04   0.225    0.822    
yrsEmpl      6.553e-04  1.249e-03   0.525    0.600    
totalIncome  1.509e-08  7.518e-08   0.201    0.841    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4996 on 8496 degrees of freedom
Multiple R-squared:  0.0002693,	Adjusted R-squared:  -0.0002014 
F-statistic: 0.5721 on 4 and 8496 DF,  p-value: 0.6829


### Only one iteration was performed, meaning all 4 variables used create the best model.  It appears that the result is the same whether using backward or forward elimination.

# I suspect that a hybrid stepwise approach will yield the same results as forward and backward elimination, let's try it for fun.

In [25]:
# Hybrid stepwise regression - start with no predictors (similar to forward)
step(fitstart, direction="both", scope=formula(FitAll))

Start:  AIC=-11799.45
ApprStatus ~ 1

              Df Sum of Sq    RSS    AIC
<none>                     2121.1 -11799
+ famSize      1   0.47902 2120.6 -11799
+ ageYrs       1   0.05712 2121.1 -11798
+ yrsEmpl      1   0.03185 2121.1 -11798
+ totalIncome  1   0.00900 2121.1 -11798



Call:
lm(formula = ApprStatus ~ 1, data = apps2)

Coefficients:
(Intercept)  
     0.5221  


## Suspicions were confirmed - having all 4 predictor variables gives us the best chance of predicting CC approval. 