-
Notifications
You must be signed in to change notification settings - Fork 13
/
teaching_R_questions.Rmd
390 lines (281 loc) · 9.26 KB
/
teaching_R_questions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
title: Descriptive analyses and basic plotting in R
author: Aaron Lun, Catalina Vallejos
date: 4 April 2017
output:
html_document:
fig_caption: false
---
```{r, echo=FALSE, results="hide"}
# Make compilation stop upon error:
library(knitr)
opts_chunk$set(error=FALSE)
```
# Introducing (R)markdown
Use triple backticks to indicate code environments:
```{r}
a <- 1
print(a)
```
Set `eval=FALSE` to show the code, but not run it:
```{r, eval=FALSE}
a <- 2
print(a)
```
Set `echo=FALSE` to run the code, but not show it:
```{r, echo=FALSE}
a <- 3
print(a)
```
Set `results="hide"` to hide the results:
```{r, results="hide"}
a <- 4
print(a)
```
For plots, set `fig.width` and `fig.height` to change the dimensions of the output plot:
```{r, fig.width=10, fig.height=6}
hist(rnorm(10000), xlab="X", main="Normal distribution")
```
See `?opts_chunk` and http://yihui.name/knitr/options#chunk_options for more details.
# Exploring the `cars` dataset
## Initial examination of the data
Let's look at one of the datasets that come with the R installation.
From the documentation in `?cars`:
> The data give the speed of cars and the distances taken to stop.
> Note that the data were recorded in the 1920s.
We can inspect the first few elements:
```{r}
head(cars)
```
Or the last few elements - how would we do that?
```{r}
# Put some code here!
tail(<something>)
```
We can have a look at some summary statistics:
```{r}
summary(cars)
```
What functions would we use to directly calculate:
- the mean?
- the median?
- the standard deviation?
- the mininimum or maximum?
- the first or third quartiles?
```{r}
# Put some code here!
```
## Generating some summary plots
We can make histograms of the two sets of values:
```{r}
hist(cars$speed, xlab="Speed (mph)", main="Speed distribution")
hist(cars$dist, xlab="Distance (ft)", main="Distance distribution")
```
Check out `?hist` to tune parameters, e.g., number of bars via `breaks`, fill colour with `col`.
```{r}
# Put some code here!
hist(cars$dist, breaks=<something>, col=<something>)
```
(__Note:__ how do you find out what arguments can be supplied to a function?)
We can also make a scatter plot of speeds vs distances, below.
Why is the speed on the x-axis?
```{r}
# x-coord y-coord
plot(cars$speed, cars$dist, xlab="Speed (mph)", ylab="Distance (ft)")
```
Alternative method, using the formula notation `~`, i.e., plotting distance as a function of speed.
```{r}
# y ~ x
plot(cars$dist ~ cars$speed)
```
Check out `?plot` and `?par` for tunable parameters.
```{r}
# Put some code here!
plot(cars$speed, cars$dist, pch=<something>, cex=<something>, col=<something>)
```
# Exploring the `chickwts` dataset
## Initial examination of the data
From the documentation in `?chickwts`:
> An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens.
Examine the data:
```{r}
# Put some code here!
head(<something>)
```
Looking at some summaries of the data:
```{r}
# Put some code here!
summary(<something>)
```
Getting the number of samples for each feed type:
```{r}
table(chickwts$feed)
```
Previous `summary` computes values across all feed types - what's missing here?
## Computing summaries for each feed type
Using `dplyr`, we can calculate summary statistics for each feed group:
```{r}
library(dplyr)
by.feed.type <- chickwts %>%
group_by(feed) %>%
summarise(min = min(weight),
q1 = quantile(weight, p = 0.25),
median = quantile(weight, p = 0.5),
mean = mean(weight),
q3 = quantile(weight, p = 0.75),
max = max(weight),
sd = sd(weight),
n = n())
by.feed.type
```
Notice the use of the special function `n()`, which counts how many observations
there is in each group.
## Making various types of plots
### Making a boxplot
Let's make a boxplot:
```{r}
boxplot(chickwts$weight ~ chickwts$feed, ylab="Chick weight (g)")
```
Again, using the formula notation to plot weight against the feed type.
Check out `?boxplot` for tunable parameters:
```{r}
feed.cols <- c("blue", "red", "green", "yellow", "purple", "pink")
# Put some code here!
boxplot(chickwts$weight ~ chickwts$feed, ylab="Chickweight (g)", col=<something>)
```
What do the points in the plot represent (hint: have a look at the `range` argument in `?boxplot`)?
### Making a barplot
We can also make a barplot of the mean weights for all feed types.
```{r}
barplot(by.feed.type$mean, ylab = "Mean weight (g)")
```
This would be more useful if it had labels for each bar.
How can we do this (hint: check out `?barplot` for extra parameters)?
```{r}
# Put some code here!
barplot(by.feed.type$mean, names.arg = <something>, ylab="Mean weight (g)")
barplot(by.feed.type$mean, names.arg = <something>, ylab="Mean weight (g)", col=<something>)
```
__Advanced__: how do we add error bars?
Let's compute the standard error:
```{r}
sd.wts <- by.feed.type$sd
n.obs <- by.feed.type$n
se.wts <- sd.wts/sqrt(n.obs)
```
What's the difference between `sd.wts` and `se.wts`, and which one should we use?
Can you interpret what's happening in the code below?
```{r}
mean.wts <- by.feed.type$mean
x.pos <- barplot(mean.wts, ylab="Mean weight (g)", names.arg = by.feed.type$feed, ylim=c(0, max(mean.wts)*1.1))
mean.plus.se <- mean.wts + se.wts
segments(x.pos, mean.wts, x.pos, mean.plus.se)
segments(x.pos+0.1, mean.plus.se, x.pos-0.1, mean.plus.se)
```
# Plotting lines with the `Orange` dataset
## Initial examination of the data
This dataset examines the growth of a number of orange trees over time.
```{r}
# Put some code here!
head(<something>)
```
Having a look at some data summaries:
```{r}
# Put some code here!
summary(<something>)
```
How many trees do we have?
How many data points do we have per tree?
```{r}
# Put some code here!
table(Orange$<something>)
```
To calculate the _mean_, _standard deviation_ and _median_ for both age and circumference of each tree:
```{r}
by.tree <- Orange %>%
group_by(Tree) %>%
summarise(mean_age = mean(age),
mean_circumference = <do it yourself>,
sd_age = sd(age),
sd_circumference = <do it yourself>,
median_age = median(age),
median_circumference = <do it yourself>)
by.tree
```
## Making line plots
Making plots of the data:
```{r}
# Put some code here!
plot(<what is x>, <what is y>, xlab="Age (days)", ylab="Circumference (mm)")
```
Each data point is linked to one at an earlier time, _for the same tree_ - how do we represent this (hint: `lines`)?
```{r}
plot(Orange$age, Orange$circumference, xlab="Age (days)", ylab="Circumference (mm)")
# Put some code here!
is.tree <- <pick all observations for tree 1>
lines(Orange$age[is.tree], Orange$circumference[is.tree])
```
There is a way to automate this kind of repetitive task using a programming technique called a `for` loop.
Try and see if you understand what the following code does:
```{r}
plot(Orange$age, Orange$circumference, xlab="Age (days)", ylab="Circumference (mm)")
for (tree in unique(Orange$Tree)) {
is.tree <- Orange$Tree==tree
lines(Orange$age[is.tree], Orange$circumference[is.tree])
}
```
__Advanced:__ Flexing R's graphical muscle:
```{r}
tree.numbers <- as.character(Orange$Tree)
unique.trees <- unique(tree.numbers)
all.colors <- rainbow(length(unique.trees))
names(all.colors) <- unique.trees
plot(Orange$age, Orange$circumference, xlab="Age (days)", ylab="Circumference (mm)",
col=all.colors[tree.numbers], pch=16)
for (tree in unique.trees) {
is.tree <- Orange$Tree==tree
lines(Orange$age[is.tree], Orange$circumference[is.tree], col=all.colors[tree], lwd=2)
}
legend("topleft", col=all.colors, pch=16, lwd=2, legend=paste("Tree", unique.trees))
```
__WARNING!__ Be careful of the ordering of x-coordinates when using `lines`!
How do we fix the following code?
```{r}
x <- runif(50)
y <- (x-0.5)^2
plot(x, y)
lines(x, y)
# Need to order observations by increasing 'x'
plot(x, y)
# Put some code here!
o <- order(<something>)
print(x)
print(x[o])
lines(x[o], y[o])
```
# Further on the power of R graphics
A lot of graphical fine-tuning can be performed using the `par` command.
Looking at `?par` reveals a _lot_ of widely applicable graphical parameters that are not listed in the documentation of individual plotting functions.
* `mfrow`, to make multi-panel plots (advanced users may prefer `layout`)
* `mar`, to modify the margins of the plot
* `cex.axis` and `cex.label`, to change the font size of axis labels
If you get lost or confused, remember: Dr Google is your friend (if you know what to ask).
An alternative to Rstudio's export is to save R plots directly to file with various graphics devices.
For example, to save to PDF:
```{r}
pdf("my_plot.pdf")
plot(cars$speed, cars$dist)
dev.off()
```
PDFs are preferable for viewing plots in high resolution, but you can also save to PNG (e.g., if there are too many data points to store in a PDF).
Some tinkering is required to save in a decent resolution, though:
```{r, fig.keep='none'}
png("my_plot.png", width=2000, height=2000, res=300, pointsize=15)
plot(cars$speed, cars$dist)
dev.off()
```
# Session information
It's always wise to store your session information, so you know what version of R (and its packages) you ran it on.
```{r}
# Put some code here!
```