-
Notifications
You must be signed in to change notification settings - Fork 2
/
Home_Mortgage_Part-2.Rmd
257 lines (196 loc) · 8.12 KB
/
Home_Mortgage_Part-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
title: "Home Mortgage Disclosure Act Data"
output: html_notebook
---
Description:
Cross-section data on the Home Mortgage Disclosure Act (HMDA).
A data frame containing 2,380 observations on 14 variables.
```{r}
library(AER)
data(HMDA)
dataset<- data.frame(HMDA)
names(dataset)
dim(dataset)
str(dataset)
summary(dataset)
head(dataset)
sapply(dataset,function(x) sum(is.na(x)))
```
As we can see data is clean, there are no missing values. Categorical values are already defined and correctly labeled.
pirat,hirat,lvrat,phist are left skewed.
QQ-plot and density plots for payment to income ratio
```{r}
qqnorm(dataset$pirat)
qqline(dataset$pirat,col='blue')
plot(density(dataset$pirat),main='Pirat')
```
The plot looks right skewed with few outliers.
QQ-plot and density plots housing expense to income ratio
```{r}
qqnorm(dataset$hirat)
qqline(dataset$hirat,col='blue')
plot(density(dataset$hirat),main='hirat')
```
The plot looks right skewed with few outliers.
QQ-plot and dnsity plots Loan to value ratio
```{r}
qqnorm(dataset$lvrat)
qqline(dataset$lvrat,col='blue')
plot(density(dataset$lvrat),main='lvrat')
```
QQ-plot and density plots unemployment
```{r}
qqnorm(dataset$unemp)
qqline(dataset$unemp,col='blue')
plot(density(dataset$unemp),main='Unemp')
```
Box-Plots
```{r}
library(ggplot2)
p10 <- ggplot(dataset, aes(x = deny, y = unemp)) +
geom_boxplot()
p10
p11 <- ggplot(dataset, aes(x = deny, y = pirat)) +
geom_boxplot()
p11
```
From the various QQ-Plots and Box-Plots we can conclude outliers are present.
Model fitting:
We split the data into two chunks: training and testing set. The training set will be used to fit our model which we will be testing over the testing set.
```{r}
dt = sort(sample(nrow(dataset), nrow(dataset)*.8))
train<-dataset[dt,]
test<-dataset[-dt,]
```
Now, let's fit the model.
```{r}
model1 <- glm(deny~.,family=binomial(link='logit'),data=train)
summary(model1)
```
By using function summary() we obtain the results of our model.
Interpreting the results of our logistic regression model:
Now we can analyze the fitting and interpret what the model is telling us.
First of all, we can see that hirat, mhist3,mhist4,unemp and condominyes are not statistically significant. Whereas pirat, phistyes, insuranceyes, afamyes are statistically significant variables based on the p-values and AIC is 1024.9.
Now we can run the anova() function on the model to analyze the table of deviance
```{r}
anova(model1, test="Chisq")
```
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding lvrat, chist,phist,afam and hschool significantly reduces the residual deviance. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.
```{r}
model2 <- glm(deny~pirat+lvrat+chist+phist+selfemp+insurance+afam+single+hschool,family=binomial(link='logit'),data=train)
summary(model2)
```
In model two hirat,unemp,mhist and condomin are removed. AIC is 1016.6.
```{r}
anova(model2, test="Chisq")
```
```{r}
model3 <- glm(deny~pirat+lvrat+chist+phist+insurance+afam+single,family=binomial(link='logit'),data=train)
summary(model3)
```
```{r}
anova(model3, test="Chisq")
```
Even we can see the p-value associated with selfemp and unemp is not significant as its large but removing the element from model increases the AIC value. So model3 is not a good model.
From here we conclude that model2 is the best. Now we will use forward selection to verify our model.
```{r}
fit1<- glm(deny~pirat,family=binomial(link='logit'),data=train)
summary(fit1)
fit2<-glm(deny~pirat+hirat,family=binomial(link='logit'),data=train)
summary(fit2)
fit3<-glm(deny~pirat+lvrat,family=binomial(link='logit'),data=train)
summary(fit3)
fit4<-glm(deny~pirat+lvrat+chist,family=binomial(link='logit'),data=train)
summary(fit4)
fit5<-glm(deny~pirat+lvrat+chist+mhist,family=binomial(link='logit'),data=train)
summary(fit5)
fit6<-glm(deny~pirat+lvrat+chist+phist,family=binomial(link='logit'),data=train)
summary(fit6)
fit7<-glm(deny~pirat+lvrat+chist+phist+unemp,family=binomial(link='logit'),data=train)
summary(fit7)
fit8<-glm(deny~pirat+lvrat+chist+phist+selfemp,family=binomial(link='logit'),data=train)
summary(fit8)
fit9<-glm(deny~pirat+lvrat+chist+phist+insurance,family=binomial(link='logit'),data=train)
summary(fit9)
anova(fit9, test="Chisq")
fit10<-glm(deny~pirat+lvrat+chist+phist+insurance+condomin,family=binomial(link='logit'),data=train)
summary(fit10)
fit11<-glm(deny~pirat+lvrat+chist+phist+insurance+afam,family=binomial(link='logit'),data=train)
summary(fit11)
fit12<-glm(deny~pirat+lvrat+chist+phist+insurance+afam+single,family=binomial(link='logit'),data=train)
summary(fit12)
fit13<-glm(deny~pirat+lvrat+chist+insurance+afam+single+hschool,family=binomial(link='logit'),data=train)
summary(fit13)
```
fit2 discard as AIC increases and p-value is too large
fit3 is good AIC reduces
fit4
fit5 even though the p-values are too large the AIC of the model decreases**
fit6 is good
fit7 although the AIC remains same we can see thr p-value associated with unemp is large so fit7 is not a good model
fit8 AIC remains same, the p-value is greater than 0.05 so fit8 is discarded
fit9 Even though the AIC increases, but the deviance decreases significantly and the assosiated p-value is significant we will keep fit9
fit10 discard based on p-values
fit11 is good reduces AIC a lot
fit12 is good
fit13 discard
So fit12 is the best model. Comparing it with our previous model3
```{r}
model3 <- glm(deny~pirat+lvrat+chist+phist+insurance+afam+single,family=binomial(link='logit'),data=train)
summary(model3)
model2 <- glm(deny~pirat+lvrat+chist+phist+selfemp+insurance+afam+single+hschool,family=binomial(link='logit'),data=train)
fit12<-glm(deny~pirat+lvrat+chist+phist+insurance+afam+single,family=binomial(link='logit'),data=train)
summary(fit12)
```
So from the forward model selection method we can conclude Model3 is the best.
So our model is :
model3 <- glm(deny~pirat+lvrat+chist+phist+insurance+afam+single,family=binomial(link='logit'),data=train)
```{r}
fm1 <- lm(I(as.numeric(deny) - 1) ~ pirat+lvrat+chist+phist+insurance+afam+single, data = dataset)
summary(fm1)
fm2 <- lm(I(as.numeric(deny) - 1) ~ pirat+lvrat+chist+insurance+afam+single+hschool, data = dataset)
summary(fm2)
fm3 <- lm(I(as.numeric(deny) - 1) ~., data = dataset)
summary(fm3)
```
```{r}
plot(model3)
```
Partial F-Test
```{r}
full_model<-glm(deny~.,family=binomial(link='logit'),data=train)
model3 <- glm(deny~pirat+lvrat+chist+phist+insurance+afam+single,family=binomial(link='logit'),data=train)
anova(full_model,model3)
```
Assessing the predictive ability of the model
```{r}
train$model3 <- predict(model2, train, type="response")
head(train)
tail(train)
library(gmodels)
library(ggplot2)
library (Hmisc)
library (caTools)
library (ROCR)
colAUC(train$model3,train$deny, plotROC=TRUE)
predict1 <- ifelse(train$model3>0.95, 1, 0)
tab1 <- table(predicted = predict1, actual = train$deny)
tab1
accuracy1<-(tab1[1,1]+tab1[2,2])/(tab1[1,1]+tab1[2,2]+tab1[1,2]+tab1[2,1])
recall1<-(tab1[2,2])/(tab1[1,2]+tab1[2,2])
precision1<-(tab1[2,2])/(tab1[2,2]+tab1[2,1])
print(c("Accuracy:",accuracy1))
print(c("Precision:",precision1))
print(c("Recall:",recall1))
test$model3 <- predict(model3, test, type='response')
colAUC(test$model3,test$deny, plotROC=TRUE)
predict2 <- ifelse(test$model3>0.95, 1,0)
tab2 <- table(predicted = predict2, actual = test$deny)
tab2
accuracy<-(tab2[1,1]+tab2[2,2])/(tab2[1,1]+tab2[2,2]+tab2[1,2]+tab2[2,1])
recall<-(tab2[2,2])/(tab2[1,2]+tab2[2,2])
precision<-(tab2[2,2])/(tab2[2,2]+tab2[2,1])
print(c("Accuracy:",accuracy))
print(c("Precision:",precision))
print(c("Recall:",recall))
```