W7.Rmd

Weekly Quiz 7
========================================================

### Question 2
Define a data set according to the code

```
set.seed(53535)
xValues = seq(0,2*pi,length=100)
yValues = rnorm(100) + sin(xValues)
```

Fit linear models with the yValues as outcome and a natural cubic spline model for the xValues as the covariates. Fit the model with degrees of freedom equal to each integer between 1 and 10. For each model, calculate the root mean squared error (RMSE) between the fitted values and the observed yValues (the rmse() function in R may help). At what number of degrees of freedom is there the most dramatic drop in the RMSE? Why does this make sense?

1. The RMSE drops between df=3 and df=4. This is because the sinusoidal model has one inflection points - like a cubic function.
2. The RMSE drops between df=2 and df=3. This is because the sinusoidal model has one inflection points - like a cubic function.
3. The RMSE drops between df=2 and df=3. This is because df = 4 and use too many parameters.
4. The RMSE drops between df=1 and df=2. This is because the sinusoidal model has a quadratic shape - like a cubic function. ✔✘


✘✔

```{r}
#install.packages("devtools")
library(devtools) 
#install_github("medley","mewo2") 
library(medley)
library(splines)
set.seed(53535)
xValues = seq(0,2*pi,length=100)
yValues = rnorm(100) + sin(xValues)

for (iDF in 1:9){
  ns1 <- ns(xValues, df=iDF)
  lm1 <- lm(yValues ~ ns1)
  rmse1 <- rmse(predict(lm1,data=yValues),yValues)
  
  ns2 <- ns(xValues, df=iDF+1)
  lm2 <- lm(yValues ~ ns2)
  rmse2 <- rmse(predict(lm2,data=yValues),yValues)
  
  print(c(iDF,":",rmse2-rmse1))
}

```
### Question 3
Load the simpleboot package (you may have to install it first) with the following commands:
```
library(simpleboot) 
data(airquality)
attach(airquality)
```
Calculate the 75th percentile of the Wind variable. Then set the seed to 883833 and use the one.boot function with 1,000 replications to calculate the bootstrap standard error of the 75th percentile of the Wind variable.

1. The 75th percentile is: 9.7 The bootstrap s.d. is: 0.5965868
2. The 75th percentile is: 11.5 The bootstrap s.d. is: 0.5965868 ✔
3. The 75th percentile is: 7.4 The bootstrap s.d. is: 0.3463986
4. The 75th percentile is: 11.5 The bootstrap s.d. is: 0.5125929

```{r}
#install.packages("simpleboot")
library(simpleboot) 
data(airquality)
attach(airquality)
print("75th percentile:")
quantile(Wind, 0.75)

set.seed(883833)
stderr <- function(x) quantile(x, 0.75)
result <- one.boot(Wind, stderr, R=1000)
print("The bootstrap s.d. is:")
sd(result$t[,1])

```      
       

### Question 4
Load the Cars93 data:
```
data(Cars93,package="MASS")
```
Set the seed to 7363 and calculate three trees using the tree() function on bootstrapped samples (samples with replacement). Each tree should treat the DriveTrain variable as the outcome and Price and Type as covariates. Predict the value of the following data frame
```
newdata = data.frame(Type = "Large",Price = 20)
```
with each tree and report the majority vote winner along with the percentage of votes among the three trees for that value.

1. FWD Percent of Votes = 80%
2. Rear Percent of Votes = 66%
3. Front Percent of Votes = 80%
4. Front Percent of Votes = 100%  ✔

✘✔
```{r}
library(tree)
data(Cars93,package="MASS")
set.seed(7363)
tree1 <- tree(DriveTrain ~ Type + Price, 
              data=Cars93[sample(1:dim(Cars93)[1], replace=TRUE),])
tree2 <- tree(DriveTrain ~ Type + Price, 
              data=Cars93[sample(1:dim(Cars93)[1], replace=TRUE),])
tree3 <- tree(DriveTrain ~ Type + Price, 
              data=Cars93[sample(1:dim(Cars93)[1], replace=TRUE),])

newdata = data.frame(Type = "Large",Price = 20)

predict(tree1, newdata)
predict(tree2, newdata)
predict(tree3, newdata)

```

### Question 5
Load the vowel.train and vowel.test data sets:
```
library(ElemStatLearn)
data(vowel.train)
data(vowel.test) 
```
Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit:

1. a random forest predictor relating the factor variable y to the remaining variables and 
2. an svm predictor using the svm() function in the e1071 package.

What are the error rates for the two approaches on the test data set?
What is the error rate when the two methods agree on a prediction?

1. Test error random forest = 0.4199134 Test error svm = 0.3874459 Test error both agree = 0.6607143 ✘
2. Test error random forest = 2.123354 Test error svm = 2.038695 Test error both agree = 1.995975
3. Test error random forest = 0.4199134 Test error svm = 0.3874459 Test error both agree = 0.2823129 ✔
4. Test error random forest = 2.123354 Test error svm = 2.038695 Test error both agree = 2.123354

```{r}
library(ElemStatLearn)
library(randomForest)
library(e1071)
library(medley)
data(vowel.train)
data(vowel.test)
vowel.test$y <- as.factor(vowel.test$y)
vowel.train$y <- as.factor(vowel.train$y)
set.seed(33833)

testError <- function(pred, data, outcome){
  sum(predict(pred, data)!= outcome)/length(outcome)
}

#Random forest
vowelTree <- randomForest(y ~ . , data=vowel.train, prox = TRUE)
print("Test error random forest:")
vowelTreeError <- testError(vowelTree, vowel.test, vowel.test$y)
print(vowelTreeError)

#SVN
vowelSVM <- svm(y ~ . , data=vowel.train)
print("Test error svm:")
vowelSVMError <- testError(vowelSVM, vowel.test, vowel.test$y )
print(vowelSVMError)


print("What is the error rate when the two methods agree on a prediction?")
sum((predict(vowelTree, vowel.test)==predict(vowelSVM, vowel.test)) & (predict(vowelTree, vowel.test)!=vowel.test$y) & (predict(vowelSVM, vowel.test)!=vowel.test$y))/sum(predict(vowelTree, vowel.test)==predict(vowelSVM, vowel.test))


                                                                                                                                                        ```