-
Notifications
You must be signed in to change notification settings - Fork 0
/
Review.Rmd
1261 lines (1051 loc) · 42.2 KB
/
Review.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "STAT2621 main topics"
output: html_notebook
---
## Chapter 1&2 -Introduction (Tut1,2)
- Error: a-H0 is true but reject
Beta- H0 is false but accept
- Sensitivity = TP/(TP+FN) – true positive rate
- Specificity = TN/(TN+FP) – true negative rate
```{r}
nrow(df)
ncol(df)#number of row and column
colnames(df) #get column names
dimnames(tbl) <- list(believe = c("Yes","No"), ##row
age = c("Age3","Age4","Age5","Age6")) ##column name
#create data
data.frame("believe" = c(rep("Yes",63),rep("No",55)),
"age" = c(rep("Age3",30),rep("Age4",13),rep("Age5",15),
rep("Age6",5),rep("Age3",5),rep("Age4",10),
rep("Age5",12),rep("Age6",28)))
# convert factors to numeric
for(i in 1:9) {
bc[, i] <- as.numeric(as.character(bc[, i]))
}
#missing value
any(is.na(df))
sum(is.na(df))
df1<-na.omit(df) #remove the missing values and form df1
Hitters %>%
is.na() %>%
sum()
Hitters = Hitters %>%
na.omit()
#table of missing value
apply(is.na(df), 2, sum)
-MARGIN=1`: the manipulation is performed on rows
-MARGIN=2`: the manipulation is performed on columns
-MARGIN=c(1,2)` the manipulation is performed on rows and columns
#create groups
df$agegrp <- findInterval(df$Age, c(20,40,60,80))
#number of distinct factors in a column
table(df$LCID)
## Table for a single variable
table(Arthritis$Improved)
prop.table(tab1) #Proportions for a single variable table
count(df, Report.date) #=# as.data.frame(table(df$Report.date))
#data frame with table categorized by two groups
as.data.frame(table(df$Report.date, df$Gender))
#select most
which(col_mv==max(col_mv))
# change the character variable into date variable
df1$Report.date = as.Date(df1$Report.date, format="%d/%m/%Y")
#time specific
df$hour_id<-strptime(df$hour_id,format='%m/%d/%Y %H:%M')
r_d<-floor_date(df$hour_id,'day')
table(as.character(r_d)) #number of records per day
#median
col_median=apply(df[5:22],2,function(x) tapply(x,df$LCID,median))#Calculate the median value of each column (variable) per LCID
apply(df[5:22],2,function(x) tapply(x,floor_date(df$hour_id,'day'),median)) #median value per day of df
#aggregate
aggregate(mtcars, by=list(cyl), Fun= mean, na.rm= T)
#sort
df_com_case<-tapply(df$NumComplains,df$LCID,sum) #sum of complaints split by LCID
sort(df_com_case,decreasing = TRUE) #sort out top10
mtcars[order(mtcars$mpg,-mtcars$cyl)]#sort from df mpg scending and cyl descending
#split the string only get first 5
substring(df$LCID,1,5)
#merge dataframe by
merge(dfA,dfB, by=c("ID","Country"))
#create factor
as.factor(ifelse(insurance_smoker$bmi>30,1,0))#create a variable bmi_factor which equals one if bmi > 30 and zero otherwise
## Check factor levels (categories)
levels(Arthritis$Improved)
## Cross table by two variables
xtabs(~ Treatment +Improved, Arthritis)
table(df_q1$SEX, df_q1$TYPE)
```
-plots
```{r}
#a boxplot of charges by smoker and sex
p <-ggplot(insurance, aes(x=sex, y=charges, fill=smoker))+ #by smoker and sex
geom_boxplot(outlier.colour="red")+
ggtitle("boxplot of charges by smoker and sex")+ #title
xlab("smoker & sex")+ #x name
ylab("insurance charges")+ #yname
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
ggplot(dat, aes(x = grps, y = x, fill = grps)) +
geom_boxplot()
boxplot(weight~group, data = PlantGrowth)
boxplot(airquality$Ozone,
main = "Mean ozone in parts per billion at Roosevelt Island",
xlab = "Parts Per Billion",
ylab = "Ozone",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
# Box plot with two factor variables
boxplot(Days ~ A * B, data=df1, frame = FALSE,
col = c("#00AFBB", "#E7B800"), ylab="Days")
# a scatter plot between charges and bmi by smoker
p <- ggplot(insurance, aes(x=bmi, y=charges, color=smoker, group=smoker)) # group by smoker "group"
p + geom_point() +
geom_smooth(method=lm, se=FALSE, aes(fill=smoker))+
xlab("bim") #scatter and line
# ....by sex ans bmifactor
p <- ggplot(insurance_smoker, aes(x=bmi, y=charges, color=sex, shape=bmi_factor)) # color distinguish sex, shape distinguish factor
p + geom_point() +
geom_smooth(method=lm, se=FALSE, aes(fill=sex))+ #se = TRUE plot CI
xlab("bmi")+
ylab("charges")
# density plot of charges by bmifactor
ggplot(isnurance-smoker, aes(x=charges, color = sex ,fill= bmi_factor))
ggplot(grp_date, aes(x=`daily total reported cases`))+
geom_histogram(aes(y=..density..), colour= "black", fill="white",binwidth = 5)+
geom_density(alpha=.2, fill = "#FF6666")+
ggtitle("Density Histogram plot of Daily Total Reported Cases") +
xlab("daily total reported cases") +
ylab("Density")+
geom_vline(aes(xintercept=mean(`daily total reported cases`)),color="blue", linetype="dashed", size=1) +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
#bar plot
barplot(table(df_q1$TYPE)) #need table
#bar plot: cancer distribution by sex and type
counts <- table(df_q1$SEX, df_q1$TYPE)
#side-byside
barplot(counts, main="Cancer Distribution by SEX and Type",
xlab="Number of Patients", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE)
#stacked
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts))
# Basic Scatterplot Matrix
pairs(~mpg+disp+drat+wt,data=mtcars,
main="Simple Scatterplot Matrix")
#pie chart
case_clf<-table(df$Case.classification.)
names<-names(case_clf)
names<-names[order(case_clf)]
case_clf <- sort(case_clf)
pct <- round(as.numeric(case_clf)/sum(as.numeric(case_clf))*100)
lbls <- paste(names, pct)
lbls <- paste(lbls,"%",sep="")
pie(case_clf, lbls, col = rainbow(length(case_clf)), main = "Pie Chart of Case Classification")
```
```{r}
#dplyr
library(dplyr)
df %>%
group_by(poison)%>%
summarise(
count_poison = n(),
mean_time = mean(time, na.rm = TRUE),
sd_time = sd(time, na.rm = TRUE)
)
group_by(my_data, supp, dose) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
sd = sd(len, na.rm = TRUE)
)
df %>% filter(Case.classification. == i) %>% count(Report.date)
```
## Chapter 3- Descriptive stats (tut2)
- mean, median, mode
```{r}
names(table(x))[table(x)==max(table(x))] #mode
```
- summary stats
```{r}
summary(df)
library(psych)
describe(df_q1[vars]) # obtain all the summary statistics except for the quantiles
```
- coeffs of variantion: $cv = \sigma/\mu$
- skewness, kurtosis
```{r}
library(moments)
skewness(x)
kurtosis(x)
```
- testing for normality:
- P-P and Q-Q
- Shapiro Wilk test
- Kolmogorov-Smirnov Test
- Cramer-von mies test
```{r}
plot((rank(x)-0.5)/length(x), pnorm(mean=mean(x), sd=sd(x), x), main="P-P plot")
abline(0, 1, col=2) #pp plot
qqnorm(x) #qqplot
abline(mean(x), sd(x), col=2, lwd=2)
shapiro.test(my_data$len)
ks.test(my_data$len,"pnorm",mean=mean(my_data$len),sd=sd(my_data$len))
library(goftest)
cvm.test(my_data$len,"pnorm",mean=mean(my_data$len),sd=sd(my_data$len))
```
- Chisq test for One-way tables
e.g. Do the data support the hypothesis that the sex ratio is 50:50?
```{r}
chisq.test(c(49,51))
```
### Numerical stats for 2+ variables - Categorical
- Pearson correlation coefficient: values of continuous variable
- Spearman rank correlation: ranks of numerical variables
```{r}
cor(wt, mpg,method="pearson")
cor(wt, mpg,method="spearman")
```
- Test whether it is an important factor (correlation is significant)
```{r}
# test the Pearson's correlation of AGE and DAYC
n = 47
rc = cor(df_q1$AGE, df_q1$DAY_C, method='pearson')
Tc = rc * sqrt(n-2) / sqrt(1-rc^2)
pt(Tc, n-2, lower.tail = TRUE)
```
#### Association of Nominal Variables:
- Pearson Chisq test: whether X and Y are independent
```{r}
chisq.test(matrix, correct =F)
```
- Likelihood ratio test
```{r}
# Log likelihood ratio test
library(DescTools)
observed = c(7, 34)
theoretical = c(0.174, 0.826)
GTest(x = observed,
p = theoretical,
correct = "none")
GTest(matrix)
```
- Fishers exact test, whether a/b independent: non-parametric
```{r}
Input =("
Site Alive Dead
Lower 43 7
Middle 44 6
Upper 49 1")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
fisher.test(Matriz,alternative="two.sided")
```
#### Ordinal
- correlation of scores: spearman rank correlation
- Test for exsitence of trends: MH test
```{r}
source("pears.cor.r")
pears.cor(job.satis, c(7.5, 20, 32.5, 60), c(1,2,3,4)) # c() vectors of score values
```
#### Various mesures of association
```{r}
library(vcd)
assocstats(heart)
```
## Chapter 4- Compare 2 groups of CONTINUOUS (tut3)
### Testing for 1 group mean
- t-test assumptions: iid normal distributed, constant variance
- t-test for non-normal
- Heavy-tailed: Confidnce interval wider -> lost of power of any testing of hypotheses
- skewed distribution: pop. variance is infinite; t-stat sampling not symmetrically distributed, the degree of skewness decreses as the sample size increases; power of any testing of hypotheses is even less; use median as a measure of center
- CLT: n>= 30 is enough for highly skewed pop
- bootstrap: sample size too small for CLT
```{r}
###Bootstrapping
B = 10000
mx = rep(0, B) # initialize a vector to store mean of each bootstrap sample.
sx = rep(0,B) # initialize a vector to store s.d. of each bootstrap sample.
tx = rep(0,B) # initialize a vector to store t-statistic of each bootstrap sample.
n = length(bootsample) # get the sample size.
for(b in 1:B){
# First get a sample of size n with replacement from the original sample.
set.seed(b)
sample_b = sample(bootsample, replace=T)
mx[b] = mean(sample_b)
sx[b] = sd(sample_b)
tx[b] = sqrt(n) * (mx[b] - mean(bootsample))/sx[b]
}
#sample quantile
t_lower = quantile(tx,0.025)
t_upper = quantile(tx,0.975)
#confidence interval
mean(bootsample) + t_lower*sd(bootsample)/sqrt(n)
mean(bootsample) + t_upper*sd(bootsample)/sqrt(n)
#bootstrap testing for mean
mu0 = 75;
t_mu0 = (mean(bootsample) - mu0)/sd(bootsample)*sqrt(n)
p_boot = sum(tx>t_mu0)/length(tx) #H1: mu>m0
p_boot
```
### Testing for 2 groups mean
- paired t-test on Di: $D_i = X_i-Y_i$
```{r}
df$Y <- df$A - df$B
t.test(df_q1$Y, mu = 0, alternative = "greater","less","two.sided", conf.level = 0.95)
t.test(x=bank$Current,y=bank$Start,mu=6000, paired=T) #before and after
```
- 2-sample t test (equal variance) on (X_bar - Y_bar)
- assumption: independence & equal variances
- But, data exhibits dependence due to: cluster effect; serial and spatial correlation
- Test equality of 2 variances: F test
```{r}
# test whether the two variances are equal or not.
var.test(Age~Sex,data=bank, alternative="two.sided")
```
- 2-sample t test (unequal variance) on (X_bar - Y_bar)
- t test
```{r}
t.test(Age~Sex, data=bank, var.equal=F,T) #F is Welch test
t.test(x, y, mu = 0, alternative = "greater", conf.level = 0.95)
```
- Welch's Test
```{r}
t.test(x,y,alternative="less")
```
### ANOVA testing homogeneity of k pop means
- assumption: all pop. variance are equal; iid normally distributed
$H_0: \mu_1 = \mu_2 = ...=\mu_i$
H1: At least one of the four population mean is different from the rest
```{r}
aov(mg~product, data=soy)
```
### Compare Distribution of 2 groups
- Wilcoxon ranksum test ($\mu_x = \mu_y$)
- preferred when: 2 samples are compared; 2 groups of data are independent; continuous or ordinal variable; non-parametric: deviate from normal distribution
```{r}
wilcox.test(respiration~condition, data=soil, alternative="two.sided",exact=TRUE)
wilcox.test(y, x, paired=TRUE, exact=TRUE) #wilcoxon signed rank test
#numerically
# There are no ties in this data, so the result of wilcox.test() and the normal approximation are the same
# note that wilcox.test() can not deal with ties (the reuslts will not the same, a little biased)
x = cars$`suggested.retail.price`
y = cars$`dealer.cost`
m = length(x); n = length(y)
N = m+n
sort.xy = sort(c(x,y))
order.xy = order(c(x,y))
z.stat = as.numeric(order.xy%in%(1:m))
W = sum(c(1:N)*z.stat) # here W is the sum of ranks of full data
Z = (W-m*(N+1)/2)/(sqrt(m*n*(N+1)/12))
W # Wilcoxon rank test statistic based on formula
```
### Variance stablizing transformation
Fulfill assumption that all pop. variance are equal
- log-transformation
- box-cox
- square-root
- arcsin
## Chapter 5- Comparing 2 groups of CATEGORICAL (tut4)
### One-sample case
- z-test for proportion p: binomial dist.
```{r}
prop.test(x=9,n=10,p=.2,correct=F) # returns Wilson CI #"correct" for continuity correction
```
- one-sample exact z-test for p: sample size not too large
```{r}
binom.test(x=9,n=10,p=.2)
```
```{r}
library(DescTools)
BinomCI(x=9,n=10,conf.level = 0.95, sides="two.sided",
method="wilson")
BinomCI(x=9,n=10,conf.level = 0.95, sides="two.sided",
method="wald")
BinomCI(x=9,n=10,conf.level = 0.95, sides="two.sided",
method="clopper-pearson") ###exact CI
```
### Two-sample case
- z-test for p_x and p_y
$H_0: p_x = p_y$
```{r}
prop.test(x=c(9,4), n=c(12,13),correct=F) #x_bar =9/12, y_bar=4/13
#confidence interval for p1-p2 provided
```
- McNemar's test for 2 paired samples
$H_0: \theta_2 = \theta_3$ there is no effect
```{r}
Performance <- matrix(c(4,9,3,16),nrow=2)
Performance
mcnemar.test(Performance,correct=F)
```
- Test for 2x2 contingency table: whether X and Y are independent
```{r}
#pearson chisq test
tbl = matrix(c(173,150,125,73), nrow=2)
chisq.test(tbl,correct=F)
```
## Chapter 6- Linear Regression (tut5)
- Hypothesis testing on $\beta$
H0: K'beta = m ~ F(s,n-p-1), where K' is s*(p+1) full row rank matrix
```{r}
#model
# read this data
corns = read.table('yield.txt')
colnames(corns) <- c("Y", "X1", "X2", "X3", "X4", "X5", "X6", "X7")
head(corns)
fit.full = lm(Y~., data=sales)
summary(fit.full)
#plot
plot(df$X3, df$Y, xlab="X3: time to run 1.5 miles (in minutes)",ylab="Y: oxygen intake rate")
#prediction and CI
pd1 = predict(fit1, newdata=data.frame(X3=9.9),interval="predict",se.fit = T) #prediction interval
pd1 = predict(fit1, newdata=data.frame(X3=9.9),interval="confidence",se.fit = T) #confidence interval
#CI not use
pd1$fit[1] - abs(qt(0.05/2, df=nrow(df)-2))* sqrt(pd1$se.fit^2+pd1$residual.scale^2)
pd1$fit[1] + abs(qt(0.05/2, df=nrow(df)-2))* sqrt(pd1$se.fit^2+pd1$residual.scale^2)
#Perform F-test on the significance of all the parameters
#read from summary(fit)
```
- Explain the reason why we are concerned with the residual plots between the residuals and the five independent variables. And get the residual v.s. fitted values plots, QQ plot.
```{r}
# get the residual v.s. fitted values plots, QQ plot
plot(fit.full)
?scale-location; residual-leverage plots!!!!
# get the residuals
sales.res = resid(fit.full)
# draw the plot for each variable
par(mfrow=c(3,2))
plot(sales$X1, sales.res, ylab="Residuals", xlab="X1")
abline(0, 0)
plot(sales$X2, sales.res, ylab="Residuals", xlab="X2")
abline(0, 0)
plot(sales$X3, sales.res, ylab="Residuals", xlab="X3")
abline(0, 0)
plot(sales$X4, sales.res, ylab="Residuals", xlab="X4")
abline(0, 0)
plot(sales$X5, sales.res, ylab="Residuals", xlab="X5")
abline(0, 0)
```
Based on the model assumption, the error must be independent of the independent variables. Hence, the residual plots between the residuals and each of the independent variables must exhibit random patterns.
## Chapter 7- Variable selection & Model Diagnostics (tut6)
### Model Selection
#### best subset selection
- adjusted R^2:
could be negative; adj R2 can offset some of the increase in the value of R2 and provide measure of goodness of fit
choose larger
- Mallow's Cp: choose close to p+1 and small p
- AIC, BIC: choose lower
```{r}
library(olsrr)
print(cbind(c(ols_mallows_cp(fit1, fit2.full),
ols_mallows_cp(fit2, fit2.full),
ols_mallows_cp(fit3, fit2.full),
ols_mallows_cp(fit4, fit2.full),
ols_mallows_cp(fit5, fit2.full),
ols_mallows_cp(fit6, fit2.full)),
c(summary(fit1)$r.squared,summary(fit2)$r.squared,
summary(fit3)$r.squared,summary(fit4)$r.squared,
summary(fit5)$r.squared,summary(fit6)$r.squared),
c(summary(fit1)$adj.r.squared,summary(fit2)$adj.r.squared,
summary(fit3)$adj.r.squared,summary(fit4)$adj.r.squared,
summary(fit5)$adj.r.squared,summary(fit6)$adj.r.squared)))
BIC(fit1)
```
#### Sequential selection method
- Forward selection: start from intercept
- Backward selection: start from full
- Stepwise selection
- Exhaustive search
```{r}
library(leaps)
#selection
regfit_full = regsubsets(Salary~., data = Hitters, nvmax = 19, method ="forward","backward","seqrep")
reg_summary = summary(regfit_full)
#plot criterion
par(mfrow = c(2,2))
plot(reg_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The which.max() function can be used to identify the location of the maximum point of a vector
adj_r2_max = which.max(reg_summary$adjr2) # 11
# The points() command works like the plot() command, except that it puts points
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, reg_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)
# We'll do the same for C_p and BIC, this time looking for the models with the SMALLEST statistic
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(reg_summary$cp) # 10
points(cp_min, reg_summary$cp[cp_min], col = "red", cex = 2, pch = 20)
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(reg_summary$bic) # 6
points(bic_min, reg_summary$bic[bic_min], col = "red", cex = 2, pch = 20)
#plot regsubset
plot(regfit_full, scale="adjr2")
coef(regfit_full, 11)
plot(regfit_full, scale="Cp")
plot(regfit_full, scale="bic")
```
### Model Diagnostics on model error
#### Assumptions:
- linearity: log-transformation, polynomial transformation
- equal and constant variance: variance stablising transformation
- independence: iid errors
- normality
```{r}
#residual plot
# At any fitted value, the mean of the residuals should be roughly 0. If this is the case, the linearity assumption is valid. For this reason, we generally add a horizontal line at y=0 to emphasize this point.
# At every fitted value, the spread of the residuals should be roughly the same. If this is the case, the constant variance assumption is valid.
par(mfrow = c(1, 2))
plot(fitted(initech_fit), resid(initech_fit), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
abline(h = 0, col = "darkorange", lwd = 2)
qqnorm(resid(initech_fit), main = "Normal Q-Q Plot", col = "darkgrey") #qqplot
qqline(resid(initech_fit), col = "dodgerblue", lwd = 2)
```
#### Non-linearity
- linearity: log-transformation on y
polynomial transformation on x_i
```{r}
#polynomial trans
mark_mod_poly2 = lm(sales ~ advert + I(advert ^ 2), data = marketing)
summary(mark_mod_poly2)
ggplot(data = marketing, aes(x = advert, y = sales)) +
stat_smooth(method = "lm", se = FALSE, color = "green", formula = y ~ x) +
stat_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x + I(x ^ 2)) +
stat_smooth(method = "lm", se = FALSE, color = "red", formula = y ~ x + I(x ^ 2)+ I(x ^ 3)) +
geom_point(colour = "black", size = 3)
#faster way to specify a model with many higher order terms
fit6_alt = lm(mpg ~ poly(mph, 6), data = econ) #poly uses orthogonal polynomials
fit6_alt2 = lm(mpg ~ poly(mph, 6, raw = TRUE), data = econ) # equal to use I() repeatedly
```
#### Non-constant variance (Variance Stablising Transformation)
Fulfill assumption that all pop. variance are equal
- log-transformation
```{r}
initech_fit_log = lm(log(salary) ~ years, data = initech)
#plot log fit curve
plot(salary ~ years, data = initech, col = "grey", pch = 20, cex = 1.5,
main = "Salaries at Initech, By Seniority")
curve(exp(initech_fit_log$coef[1] + initech_fit_log$coef[2] * x),
from = 0, to = 30, add = TRUE, col = "darkorange", lwd = 2)
#compare RMSE
sqrt(mean((initech$salary - exp(fitted(initech_fit))) ^ 2))
sqrt(mean((initech$salary - exp(fitted(initech_fit_log))) ^ 2))
```
- box-cox
$y^\lambda-1/\lambda$
```{r}
boxcox(savings_model, plotit = TRUE, lambda = seq(0.5, 1.5, by = 0.1)) #determine lambda
gala_model_cox = lm((((Species ^ 0.3) - 1) / 0.3) ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala)#refit model
boxcox(df1$Days, lambda = c(-1, 2), optimize = TRUE, objective.name = "Log-Likelihood")
shapiro.test(BoxCox(df1$Days, lambda))#check normality after transformation
```
- square-root
- arcsin
#### Non-normality
- QQ-plot and Shapiro test
- log-transformation sometimes helps
- otherwise boostrap analysis can help find p-value
#### Unusual Observations: Outliers and influential observations
- Outlier
- Influential observation: it could cause LSE to be substantially different from what they would be if if it removed from the data
- Cook's distance: >4mean
- Record typo; Drop them if it's unlikely; Use dummy variable to remove the effect
```{r}
#detect outliers
outlier_values <- boxplot.stats(ozone$pressure_height)$out # outlier values.
boxplot(ozone$pressure_height, main="Pressure Height", boxwex=0.1)
mtext(paste("Outliers: ", paste(outlier_values, collapse=", ")), cex=0.6)
library(car)
outlierTest(mod)
outlier(y) #gets the extreme most observation from the mean
outlier(y,opposite=TRUE) #fetches outlier from the other side.
scores(x) # z-scores => (x-mean)/sd
scores(x, type="chisq") # chi-sq scores => (x - mean(x))^2/var(x)
scores(x, type="t") # t scores
scores(x, type="chisq", prob=0.9) # beyond 90th %ile based on chi-sq, return TRUE/FALSE
```
```{r}
#cooks distance
cooksd <- cooks.distance(lmmodel)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") # add labels
#track down influential rows
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
head(ozone[influential, ]) # influential observations.
```
#### Multicollinearity
- VIF (variance inflation factor):
$VIF_j = 1/TOL_j= 1/(1-R_j^2)$
- VIF greater than 5 is enough to suspect multicollinarity
- Symptoms:
beta coeffs of some important variables are insignificant but the model is significant jointly;
(F-test significant but beta test insignificant)
correlation between some pairs of independent variables are extremely high;
prediction ok but the coeffs do not make sense in explaining
- Remedies:
- Eliminate one of the highly correlated variables
- PCA(principle components analysis): on the covariance of X to produce a set of linearly independent combinatins of columns of X s.t. the total variance of columns of X is preserved
- add constraints on beta: when X is nearly singular, it needs penality for large value of beta:
- Ridge
- Lasso
```{r}
pairs(seatpos, col = "dodgerblue")
round(cor(seatpos), 2) #correlation plot
vif(model)
```
e.g. investigate the effect of adding another variable(htshoes) to this smaller model.
- variable added plot: Similarly a variable added plot visualizes these residuals against each other. It is also helpful to regress the residuals of the response against the residuals of the predictor and add the regression line to the plot.
```{r}
plot(resid(hip_model_small) ~ resid(ht_shoes_model_small),
col = "dodgerblue", pch = 20,
xlab = "Residuals, Added Predictor",
ylab = "Residuals, Original Model")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)
abline(lm(resid(hip_model_small) ~ resid(ht_shoes_model_small)),
col = "darkorange", lwd = 2)
```
- Here the variable added plot shows almost no linear relationship. This tells us that adding HtShoes to the model would probably not be worthwhile. Since its variation is largely explained by the other predictors, adding it to the model will not do much to improve the model. However it will increase the variation of the estimates and make the model much harder to interpret.
- Had there been a strong linear relationship here, thus a large partial correlation coefficient, it would likely have been useful to add the additional predictor to the model.
- partial correlation coefficient
```{r}
hip_model_small = lm(hipcenter ~ Age + Arm + Ht, data = seatpos)
ht_shoes_model_small = lm(HtShoes ~ Age + Arm + Ht, data = seatpos)
cor(resid(ht_shoes_model_small), resid(hip_model_small))
```
## Chapter 8- One Way ANOVA (multiple K groups)
focus on the trade-off between goodness of fit (minimizing errors) and complexity of model.
### One-way ANOVA
testing equality in means for k>2 independent populations
- Assumptions: iid, constant variance, normality
$H_0: \mu_1=\mu_2=...=\mu_k$
H1: at least two means are not equal
```{r}
fit_full <- lm(Salary~. , data=hitters)
select_model <- step(fit_full, direction="both")
summary(select_model)
anova(select_model, fit_full)
anova_one_way <- aov(time~poison, data = df)
summary(anova_one_way)
```
### Testing equality of group variances
- Bartlett's test:
- sensitive to departure from normality; if samples are non-normal, Bartlett only tests for non-normality
- Levene's test and Brown-Forsythe test are alternative
```{r}
bartlett.test(weight ~ group, data = PlantGrowth) # with one independent var
bartlett.test(len ~ interaction(supp,dose), data=ToothGrowth) #collapse multiple factors into a single variable
# Levene's test with one independent variable
leveneTest(weight ~ group, data = PlantGrowth)
# Levene's test with multiple independent variables
leveneTest(len ~ supp*dose, data = ToothGrowth)
bf.test(weight ~ group, data = PlantGrowth)
```
### Multiple Comparison
with rejected ANOVA F-test, identify difference in group means
- Fisher's LSD: just usual two-sample t-test for each pairwise comparison
- Bonferroni: $\alpha' = \alpha/c$ where c = k(k-1)/2
very conservative, easier to reject H0
- Tukey
- Dunnett's
```{r}
#pairwise comparison
pairwise.t.test(df$time, df$treat, p.adj = "none")
pairwise.t.test(df$time, df$treat, p.adj = "bonferroni")
#conclusion alpha = 0.05 still
#LSD
res = aov(df$time~ df$treat)
summary(res)
LSD.test(df$time, df$treat, DFerror="DF residuals", MSerror="mean sq residuals",p.adj="none")
# Tukey test to study each pair of poison level
TUKEY <- TukeyHSD(aov(df$time ~ df$poison), 'df$poison', conf.level=0.95, ordered =TRUE) #ordered for rank
# Tuckey test representation :
plot(TUKEY , las=1 , col="brown")
abs(mean(A)-mean(B)) # 0.1716667 < LSD do not reject
```
### Kruskal-Wallis ANOVA Test- test median, distribution
Extension for Wilcoxon test
```{r}
kruskal.test(number ~ group, data =df5)
```
## Chapter 9- ANCOVA
- combine X effect, and treatment factors
- Hypothesis testing in Covariance Analysis
$y_{ij} = \mu + \alpha_i + \beta x_{ij}+\beta_ix_{ij}+\epsilon_{ij}$
alpha_i: effects,categorical
beta: covariate, continuous
beta_i: interaction
1. H0:(equal slope, no interaction) all betas = 0
2. if H0 cannot be rejected
- H0x:(no x effect) beta = 0
- H0 treatment: (no treatment effect) all alphas = 0
```{r}
options(contrasts = c("contr.treatment", "contr.poly"))### These are the default contrasts in R
# Analysis of covariance
# here i set drug =F as the reference level
# and you can use this to specify the reference level as you need
###relevel: normally use first term as reference
model.1 = lm (PostT ~ relevel(drug, ref = "F") + PreT + PreT:relevel(drug, ref = "F"), data = df) # PreT:drug is the interaction term
summary(model.1) #get detailed effect estimate
Anova(model.1, type="II")#get anova table
#test significance of interaction term
model.1 = lm (Pulse ~ Temp + Species + Temp:Species,data = Data)
Anova(model.2, model.1) ### Interaction is not significant, so the slope among groups is not different
anova(model_reduced, model_full)
#test for factor effect
model.2 = lm (Pulse ~ Temp + Species,data = Data)
model.3 = lm(Pulse ~Temp, data=Data)
Anova(model.3, model.2) ### The category variable (Species) is significant, treatment effect exists
#tets for effect of beta
model.4 - lm(Pulse ~ Species, data = Data)
anova(model.4,model.2)
contrasts(Data$Species)
```
```{r}
#plot fitted line
I.nought = -6.35729 #intercept
I1 = I.nought + 0
I2 = I.nought + 19.81429 #estimate of one group
I3 = I.nought + -10.18571 #estimate of another effect
B = 3.56961 #beta
plot(x = Data$Temp,
y = Data$Pulse,
col = Data$Species,
pch = 16,
xlab = "Temperature",
ylab = "Pulse")
legend('bottomright',
legend = levels(Data$Species),
col = 1:3,
cex = 1,
pch = 16)
abline(I1, B, lty=1, lwd=2, col = 1)
abline(I2, B, lty=1, lwd=2, col = 2)
abline(I3, B, lty=1, lwd=2, col = 3)
```
```{r}
#y means
tapply(df[, 1], df$drug, mean)
#estimated y means
# The means of Post-Treatment scores per drug
tapply(df[, 3], df$drug, mean)
```
- Assumptions:
- linearity between the covatiate and the outcome variable
- outcome variable normal distributed
- Homoscedasticity: iid constant variance for error
- no sign of outliers
- independent variables should be categorical
- dependent variables and covariate shoube continuous
- observations are independent
```{r}
shapiro.test(residuals(model.3))
tapply(df2$Days, df2$B, shapiro.test)#testing normality for groups
#non-normality sqareroot trans or log trans
df$PostT_sq = sqrt(df$PostT)
model.4 = lm(PostT_sq ~ PreT, data = df)
summary(model.4)
shapiro.test(residuals(model.4))
```
e.g. When the mean of Y is the largest among patients from the severe weight gain group, i.e.(μ3>μ1,μ3>μ2) or
(β1<0 and β2<0)
```{r}
model.2 = lm(Days ~ relevel(B, ref = "3"), data = df2)
summary(model.2)
anova(model.2)
```
the p-value is very small, which means that the coefficients are significant (<0). Therefore, we can conclude that the mean of Y is the largest among patients from the severe weight gain group.
## Chapter 10- Two-way ANOVA (Tut9)
- main effect: test whether means are significantly different between levels of one factor
- interactions: a difference in differences of means
- two-way anova: no interaction
two-way (fully) factorial anova: care about (all) interactions
$Y_{ijk}= \mu+\alpha_i+\beta_j+(\alpha\beta)_{ij}+\epsilon_{ijk}$
```{r}
# Two-way interaction plot
interaction.plot(x.factor = df1$B, trace.factor = df1$A,
response = df1$Days, fun = mean,
type = "b", legend = TRUE,
xlab = "B", ylab="Days",
pch=c(1,20), col = c("#00AFBB", "#E7B800"))
```
```{r}
res.aov2 <- aov(len ~ supp + dose, data = my_data)
summary(res.aov2)
# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3)
#estimate \sigma^2
sum(residuals(model.3)^2) /56 # 56 = degrees of freedom
```
1. Test model assumptions
2. Test interaction, individual factors
-interaction: F-test
-factor significance: t-test
Carry out an appropriate test to test that the two least squares means are equal against the alternative hypothesis that the least squares mean of W for A=1 is greater than that for A=2 at the 2.5% level of significance.
$H_0:\alpha = 0, H_1: \alpha>0$
```{r}
#relevel
model = lm(formula = W ~ relevel(A, ref = "2") + relevel(B, ref = "3"),data = df1)
```
## Chapter 11- Logistic Regression
### GLM
- Random component: has the probability distribution of exponential family
- Systematic component
- Link function
### Logistic regression model
$logit(\pi)=log(\frac{\pi}{1-\pi})= \beta_0 + \Sigma\beta_ix_i$
- binomial component: $E(Y)= \pi$
- Systematic component $\eta = \beta_0 + \Sigma\beta_ix_i$
- Link function $g(\pi)=log(\frac{\pi}{1-\pi})$
For rare event use Complementary log-log link
$g(\pi) = log[-log(1-\pi)]$
IMPORTANT NOTIONS:
- odds = prob(success)/prob(fail)
- risk differencce of an event: RD(1:2)=p1-p2
- risk ratio of an event: RR(1:2)= p1/p2
- odds ratio: OR(1:2)= odds1/odds2
Logistic regression: use odds ratio often
- continuous variable:
$OR = exp(\beta_i)$
not work when Xi and Xj highly correlated
- categorical independent variables by dummy varibales:
OR(j:the last level) = exp(beta_j)
- categorical independent variables by contrast varibales:
OR(j:the last) = exp(beta_1 + beta_2 + ... + 2*beta_j+ ...+beta_{g-1})
OR(j:l) = exp(beta_j - beta_l)
```{r}
mydata$rank <- factor(mydata$rank)
mydata$admit <- factor(mydata$admit)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#intepret: For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.
fit2 <- glm(believe~age, family = "binomial", weights = count, data = santa2)
summary(fit2)
## believe age count
## 1 Yes Age3 30