forked from jbaquerot/Data-Analysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
W7.Rmd
167 lines (131 loc) · 5.76 KB
/
W7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
Weekly Quiz 7
========================================================
### Question 2
Define a data set according to the code
```
set.seed(53535)
xValues = seq(0,2*pi,length=100)
yValues = rnorm(100) + sin(xValues)
```
Fit linear models with the yValues as outcome and a natural cubic spline model for the xValues as the covariates. Fit the model with degrees of freedom equal to each integer between 1 and 10. For each model, calculate the root mean squared error (RMSE) between the fitted values and the observed yValues (the rmse() function in R may help). At what number of degrees of freedom is there the most dramatic drop in the RMSE? Why does this make sense?
1. The RMSE drops between df=3 and df=4. This is because the sinusoidal model has one inflection points - like a cubic function.
2. The RMSE drops between df=2 and df=3. This is because the sinusoidal model has one inflection points - like a cubic function.
3. The RMSE drops between df=2 and df=3. This is because df = 4 and use too many parameters.
4. The RMSE drops between df=1 and df=2. This is because the sinusoidal model has a quadratic shape - like a cubic function. ✔✘
✘✔
```{r}
#install.packages("devtools")
library(devtools)
#install_github("medley","mewo2")
library(medley)
library(splines)
set.seed(53535)
xValues = seq(0,2*pi,length=100)
yValues = rnorm(100) + sin(xValues)
for (iDF in 1:9){
ns1 <- ns(xValues, df=iDF)
lm1 <- lm(yValues ~ ns1)
rmse1 <- rmse(predict(lm1,data=yValues),yValues)
ns2 <- ns(xValues, df=iDF+1)
lm2 <- lm(yValues ~ ns2)
rmse2 <- rmse(predict(lm2,data=yValues),yValues)
print(c(iDF,":",rmse2-rmse1))
}
```
### Question 3
Load the simpleboot package (you may have to install it first) with the following commands:
```
library(simpleboot)
data(airquality)
attach(airquality)
```
Calculate the 75th percentile of the Wind variable. Then set the seed to 883833 and use the one.boot function with 1,000 replications to calculate the bootstrap standard error of the 75th percentile of the Wind variable.
1. The 75th percentile is: 9.7 The bootstrap s.d. is: 0.5965868
2. The 75th percentile is: 11.5 The bootstrap s.d. is: 0.5965868 ✔
3. The 75th percentile is: 7.4 The bootstrap s.d. is: 0.3463986
4. The 75th percentile is: 11.5 The bootstrap s.d. is: 0.5125929
```{r}
#install.packages("simpleboot")
library(simpleboot)
data(airquality)
attach(airquality)
print("75th percentile:")
quantile(Wind, 0.75)
set.seed(883833)
stderr <- function(x) quantile(x, 0.75)
result <- one.boot(Wind, stderr, R=1000)
print("The bootstrap s.d. is:")
sd(result$t[,1])
```
### Question 4
Load the Cars93 data:
```
data(Cars93,package="MASS")
```
Set the seed to 7363 and calculate three trees using the tree() function on bootstrapped samples (samples with replacement). Each tree should treat the DriveTrain variable as the outcome and Price and Type as covariates. Predict the value of the following data frame
```
newdata = data.frame(Type = "Large",Price = 20)
```
with each tree and report the majority vote winner along with the percentage of votes among the three trees for that value.
1. FWD Percent of Votes = 80%
2. Rear Percent of Votes = 66%
3. Front Percent of Votes = 80%
4. Front Percent of Votes = 100% ✔
✘✔
```{r}
library(tree)
data(Cars93,package="MASS")
set.seed(7363)
tree1 <- tree(DriveTrain ~ Type + Price,
data=Cars93[sample(1:dim(Cars93)[1], replace=TRUE),])
tree2 <- tree(DriveTrain ~ Type + Price,
data=Cars93[sample(1:dim(Cars93)[1], replace=TRUE),])
tree3 <- tree(DriveTrain ~ Type + Price,
data=Cars93[sample(1:dim(Cars93)[1], replace=TRUE),])
newdata = data.frame(Type = "Large",Price = 20)
predict(tree1, newdata)
predict(tree2, newdata)
predict(tree3, newdata)
```
### Question 5
Load the vowel.train and vowel.test data sets:
```
library(ElemStatLearn)
data(vowel.train)
data(vowel.test)
```
Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit:
1. a random forest predictor relating the factor variable y to the remaining variables and
2. an svm predictor using the svm() function in the e1071 package.
What are the error rates for the two approaches on the test data set?
What is the error rate when the two methods agree on a prediction?
1. Test error random forest = 0.4199134 Test error svm = 0.3874459 Test error both agree = 0.6607143 ✘
2. Test error random forest = 2.123354 Test error svm = 2.038695 Test error both agree = 1.995975
3. Test error random forest = 0.4199134 Test error svm = 0.3874459 Test error both agree = 0.2823129 ✔
4. Test error random forest = 2.123354 Test error svm = 2.038695 Test error both agree = 2.123354
```{r}
library(ElemStatLearn)
library(randomForest)
library(e1071)
library(medley)
data(vowel.train)
data(vowel.test)
vowel.test$y <- as.factor(vowel.test$y)
vowel.train$y <- as.factor(vowel.train$y)
set.seed(33833)
testError <- function(pred, data, outcome){
sum(predict(pred, data)!= outcome)/length(outcome)
}
#Random forest
vowelTree <- randomForest(y ~ . , data=vowel.train, prox = TRUE)
print("Test error random forest:")
vowelTreeError <- testError(vowelTree, vowel.test, vowel.test$y)
print(vowelTreeError)
#SVN
vowelSVM <- svm(y ~ . , data=vowel.train)
print("Test error svm:")
vowelSVMError <- testError(vowelSVM, vowel.test, vowel.test$y )
print(vowelSVMError)
print("What is the error rate when the two methods agree on a prediction?")
sum((predict(vowelTree, vowel.test)==predict(vowelSVM, vowel.test)) & (predict(vowelTree, vowel.test)!=vowel.test$y) & (predict(vowelSVM, vowel.test)!=vowel.test$y))/sum(predict(vowelTree, vowel.test)==predict(vowelSVM, vowel.test))
```