-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathv1-introduction-to-r.Rmd
414 lines (288 loc) · 11.8 KB
/
v1-introduction-to-r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: "1. Introduction to R"
date: "2022-11-29"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{1. Introduction to R}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor: visual
editor_options:
chunk_output_type: inline
---
> **🎯 GOALS**
>
> *Learning or re-capping the basics of writing code in the R language, including mathematical operations, defining variables, different types of data, and using R functions and packages.*
## 1.1 Intro: Milk
- For complex tasks, there are multiple ways to achieve a certain goal
- Some of these ways are more effective than others
- Humans don't always naturally gravitate towards these; they often need to be taught
![](images/milk.gif)
![](images/spss_r.png)
***Figure:** Different ways of achieving complex tasks.*[^1]
[^1]: **Sources:** [gifs.com](https://gifs.com), [University of Wisconsin-Madison](https://www.ssc.wisc.edu/sscc/pubs/spss/classintro/spss_students1.html), [Techademia](https://pages.vassar.edu/acs/r-for-data-analysis-and-graphics/rstudio-screenshot/)
## 1.2 Doing math
- Think of R like a fancy pocket calculator: You give it (numerical) inputs, it gives you an output
- For example: Multiplying two numbers
- In RStudio, hit the green ▶️ symbol to run the R code and see the output below
```{r}
42 * 2
```
> **✍️ EXERCISE**
>
> *Check the current temperature for Berlin on the internet. Then use R to convert it from degrees Celsius to degrees Fahrenheit. The formula is:* $\text{°C}\times1.8+32$
>
> *Replace the three dots (`...`) with R code and hit the green ▶️ symbol to run it.*
```{r, eval=FALSE}
...
```
## 1.3 Variables
- The real power of programming languages: Storing the output of a computation as an intermediate result (**variable**) that can be reused for the next computation
- We assign (define, create) a variable by choosing a custom name (here: `my_var`) and an arrow symbol (`<-`)
```{r}
my_var <- 4 + 4
```
- To show the current value of a variable:
```{r}
my_var
```
- Or both assign *and* show the variable at once, using parentheses:
```{r}
(my_other_var <- 3 * 2)
```
- Re-use a previously defined variable:
```{r}
my_var ^ 2
```
> **✍️ EXERCISE**
>
> *Re-do the temperature calculation from the previous exercise, but this time storing the temperatures in two separate variables.*
```{r, eval=FALSE}
degrees_celsius <- ...
(degrees_fahrenheit <- ...)
```
## 1.4 Data types
- We've only dealt with numbers, but there's also other types of data
- **`numeric`**: A number
```{r}
13.4
```
- **`character`**: A string of letters (must be surrounded by quotation marks)
```{r}
"Hello world"
```
- **`logical`**: The logical statements `TRUE` and `FALSE`
```{r}
TRUE
```
- To check the type of a variable (here `my_var`, defined above):
```{r}
class(my_var)
```
- We often want to store not just a single value (number, string, etc.) but multiple ones
- **`vector`**: A list of elements, where all elements are *of the same type*
- Defined using the `c()` function (for "combine")
```{r}
c(1, 5, 8, 21)
```
- **`data.frame`**: A table, that is, a two-dimensional structure with rows and columns
- Each column is a vector, each row is from the same sample (e.g., participant, trial)
- Each column should be given a name (the part before the equals sign)
```{r}
data.frame(
country = c("Germany", "UK", "Denmark"),
population_mil = c(84.1, 67.1, 5.9),
eu_member = c(TRUE, FALSE, TRUE)
)
```
- **`list`**: A list of elements, where the elements can be of different types
- Extremely powerful (e.g., a list of data frames, a list of lists)
```{r}
list("Hello world!", 42, c("Vector", "inside", "a", "list!"))
```
> **✍️ EXERCISE**
>
> *As a group, create one vector with your names and another vector with your heights (in cm). Combine these two vectors into a data frame.*
```{r, eval=FALSE}
names <- ...
heights <- ...
(persons <- data.frame(
...,
...
))
```
## 1.4 Selecting data
- We often have large amounts of data and want to pick out a subset of them
- A **numeric index** *n* in square brackets (here: `[5]`) selects the *n*th (here: 5th) element
```{r}
my_vector <- c(2, 3, 5, 7, 11, 13, 17, 19, 23)
my_vector[5]
```
- A **range of indices** (e.g., 2nd to 4th) or **vector of indices** (e.g., 1st and 3rd) selects multiple elements
```{r}
my_vector[2:4]
my_vector[c(1, 3)]
```
- A **logical condition** selects all elements for which the test is `TRUE`
```{r}
my_vector[my_vector > 5]
```
- Data frames have two dimensions (rows and columns), so we need **two indices** separated by a comma
```{r}
universities <- data.frame(
city = c("Heidelberg", "Leipzig", "Rostock", "Greifswald"),
established = c(1386, 1409, 1419, 1456),
students_k = c(31.5, 29.5, 14.0, 12.0)
)
universities[1, 3]
```
- Select all rows (or columns) by leaving the respective index **empty**
```{r}
universities[2, ]
```
- Select **a single column** (i.e., vector) with a dollar sign (`$`) and the column name
- Note that this returns a vector
```{r}
universities$established
```
- Again, we can select only records (rows) that fulfill a certain **logical condition**
```{r}
universities[universities$students_k > 20, ]
```
> **✍️ EXERCISE**
>
> *The data frame `ToothGrowth` is a dataset built into R. It contains the measured length of the teeth (column `len`) of 60 guinea pigs after receiving different doses of Vitamin C (column `dose`). The vitamin C was delivered either via ascorbic acid (value `"VC"` in column `supp`) or via orange juice (value `"OJ"`).*
```{r}
head(ToothGrowth)
```
> *Extract the vector of tooth lengths for all guinea pigs that received 2 milligrams (the maximum dose) of vitamin C and via orange juice.*
```{r, eval=FALSE}
(tooth_lengths <- ToothGrowth[... & ..., ]$...)
```
> <details>
>
> <summary>💡 **NOTE: Welcome to the Tidyverse**</summary>
>
> *There's a very popular set of R packages called the "tidyverse"*[^2]*, which includes many functions designed to make the processing of (tabular) data in R easier (note that you'll learn more about R functions and packages below). The result is code that is often a lot more readable.*
>
> *For example, here's the usual "base" R expression for extracting a vector from a dataframe based on a certain condition: `universities[universities$students_k > 20, ]$city`. The corresponding tidyverse code looks like this: `universities %>% filter(students_k > 20) %>% pull(city)`.*
>
> *For simplicity, we'll mostly stick to base R functions and syntax for the remainder of the course -- but you'll surely encounter many useful tidyverse functions and examples as you become a more advanced R programmer.*
>
> </details>
[^2]: Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., ... Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software*, *4*(43), 1686. <https://doi.org/10.21105/joss.01686>
## 1.5 Functions
- R can do a lot more than simple mathematical operations
- Predefined **functions** exist for many types of tasks, e.g., taking the mean of a vector of numbers:
```{r}
my_vector <- c(2, 4, 6, 8)
mean(my_vector)
```
- Here's how we call the different bits and pieces when using a function:
![](images/function.png){width="400"}
- Each function has a unique **name**
- **Arguments** are the inputs that we want the function to do something with
- Each argument has the form of `argument_name = value`
- The first argument of a function is often the data -- it's common to leave its name out (making it a *positional* argument)
- Arguments are separated from each other by commas
- The **return value** is the output that the function gives back to us
> **✍️ EXERCISE**
>
> *Intuitively, one might have thought that the way to compute the mean of four numbers would be `mean(2, 4, 6, 8)`. Why doesn't this work?*
> <details>
>
> <summary>💡 **NOTE: R can be confusing**</summary>
>
> *While not the case for `mean()`, some functions indeed do take an arbitrary number of positional arguments. An example is `paste()`, which pastes together multiple character strings into one long character string.*
>
> *You can check the help file of a function (more on that below) to see if a function behaves like `mean()`, with a single first element (typically called `x`), or like `paste()`, with an arbitrary number of positional input arguments (indicated by `...` in the help file).*
## 1.6 Getting help
- To learn more about a function -- its arguments, return value, examples, etc.:
```{r}
help(mean)
```
- Or via this shortcut:
```{r}
?mean
```
- But note that this only helps when you know (part of) the function name -- if not, try to ask Google (e.g., "averaging numbers in R")
> **✍️ EXERCISE**
>
> *Find out the R function to create a vector of random numbers from a normal distribution. Once you know the name of the function, use it to create a vector of 100 numbers with a mean 500 and a standard deviation of 50.*
```{r, eval=FALSE}
(random_numbers <- ...)
```
## 1.7 Packages
- R comes with "batteries included" -- it has functions for many different tasks
- For other, more specialized tasks, functions are often available in additional **packages** that can be downloaded for free from the internet
- We first need to download and install the package -- this only needs to be done once
```{r, eval=FALSE}
install.packages("cowsay")
```
- Next we load the package -- this needs to be done every time we restart our R session
```{r}
library("cowsay")
```
- Finally, we can use one of the functions from the package:
```{r}
say("What a cool function!")
```
- We can also skip the `library("package_name")` step and directly specify the package of the function:
```{r}
cowsay::say("This also works!", by = "cow")
```
## Further reading
- McNeill, M. (2015). *Base R cheatsheet*. RStudio cheatsheets. <https://github.com/rstudio/cheatsheets/blob/main/base-r.pdf>
- Navarro, D. (2018). Getting started with R. In *Learning statistics with R: A tutorial for psychology students and other beginners* (pp. 37--71). <https://learningstatisticswithr.com/lsr-0.6.pdf>
## Add-on topics
### 1.8 Custom functions
- In addition to the functions in base R and published packages, we can write our own function(s):
```{r}
say_dude <- function(what, by = "cat") {
what <- paste0(what, ", dude!")
cowsay::say(what, by)
}
say_dude("Nice function")
```
- Short functions like this can also be defined on a single line:
```{r}
say_dude <- function(what, by = "cat") cowsay::say(paste0(what, ", dude!"))
```
> **✍️ EXERCISE**
>
> *Write a custom function that implements the Celsius-to-Fahrenheit conversion from Section 1.2.*
```{r, eval=FALSE}
celsius_to_fahrenheit <- ...
celsius_to_fahrenheit(19.3)
```
### 1.9 Repeating stuff
- We often want to **repeat the same operation** on multiple inputs
- Let's assume we have a couple of friends and want to create a friendly message for each of them (stored in a vector or list)
- The naive approach:
```{r}
messages <- character()
messages[1] <- paste("Hello", "Ezra")
messages[2] <- paste("Hello", "Tom")
messages[3] <- paste("Hello", "Samantha")
messages
```
- Better: Using a **for loop**
```{r}
friends <- c("Ezra", "Tom", "Samantha")
messages <- character()
for (i in 1:length(friends)) {
messages[i] <- paste("Hello", friends[i])
}
messages
```
- Even better: **Applying a function** to each element
```{r}
friends <- c("Ezra", "Tom", "Samantha")
greet <- function(x) paste("Hello", x)
(messages <- lapply(friends, greet))
```
- Best of all: Making use of R's **vectorization** of inputs (but this only works for simple functions)
```{r}
friends <- c("Ezra", "Tom", "Samantha")
(messages <- paste("Hello", friends))
```