/
06-vectors.qmd
401 lines (332 loc) · 11.9 KB
/
06-vectors.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
# Vectors
**R is a vectorized language** with built-in arithmetic and relational operators, mathematical functions and types of number. It means that each mathematical function and operator works on a set of data rather than a single scalar value as traditional computer language do, and thus __formal looping can _usually_ be avoided__.
## Different ways of defining a vector
If you want to define a vector by specifying the values, use the function `c()`{.R}, like so:
- Here, `x` is a vector of doubles:
```{r, warnings=FALSE}
x <- c(1, 5, 3, 12, 4.2)
x
```
- But here, `x` is converted to a vector of strings because it contains a string:
```{r, warnings=FALSE}
x <- c(1, 5, 3, "hello")
x
```
- To define a sequence of increasing numbers, either use the notation `start:end` for a sequence going from `start` to `end` by step of 1, or use the `seq()`{.R} function that is more versatile:
```{r, warnings=FALSE}
1:10
seq(-10, 10, by = .5)
seq(-10, 10, length = 6)
seq(-10, 10, along = x)
```
- To repeat values, use the `rep()`{.R} function:
```{r, warnings=FALSE}
rep(0, 10)
rep(c(0, 2), 5)
rep(c(0, 2), each = 5)
```
- To create vectors of random numbers, use `rnorm()`{.R} or `runif()`{.R} for normally or uniformly distributed numbers, respectively:
```{r, warnings=FALSE}
# vector with 10 random values normally distributed around mean
# with given standard deviation `sd`
rnorm(10, mean=3, sd=1)
# vector with 10 random values uniformly distributed between min and max
runif(10, min = 0, max = 1)
```
## Numerical and categorical data types
Data can be of two different types: numerical, or categorical. Let's say you are measuring a the temperature in a room and recording its value over time:
```{r}
T1 <- c(22.3, 23.5, 26.0, 30.2)
```
`T1` is a vector containing **numerical** data.
Let's say that now you are recording the temperature level, which can be `low`, `high` or `medium`
```{r}
T2 <- c("low", "low", "medium", "high")
```
`T2` is a vector containing **categorical** data, _i.e._ the data in this example can fall into either of 3 categories. For now, `T2` is however a vector of strings, and we need to tell R that it contains categorical data by using the function `factor()`{.R}:
```{r}
T2 <- factor(T2)
T2
```
We see here that we now have 3 levels, and a numerization of `T2` leads to obtaining the numbers 1, 2 and 3 according to the levels in `T2`:
```{r}
as.numeric(T2)
```
Numerical data can be converted to factors in the same way -- this can be useful sometimes, _e.g._ when plotting with `ggplot` as we will see later:
```{r}
factor(T1)
```
## Principal operations on vectors
### Mathematical operations
Because R is a vectorized language, you don't need to loop on all elements of a vector to perform element-wise operations on it. Let's say that `x <- 1:6`{.R}, then:
```{r, include=FALSE}
x <- 1:6
```
- Addition of a value to all elements:
```{r}
x + 2.5
```
- Multiplication / division of all elements:
```{r}
x*2
```
- Integer division:
```{r}
x %/% 3
```
- Math functions apply on all elements:
```{r}
sqrt(abs(cos(x)))
```
- Power:
```{r}
x^2.5
```
- Multiplication of vectors of the same size is performed element by element:
```{r}
y <- c(2.3, 5, 7, 10, 12, 20)
x*y
```
- Multiplication of vectors of different sizes: the smaller vector is automatically repeated the number of times needed to get a vector of the size of the larger one. It will work also if the longer object length is not a multiple of shorter object length, but the shorter object will be truncated and you'll get an error:
```{r}
x*1:2
x*1:4
```
- Modulo:
```{r}
x %% 2
```
- Outer product of vectors (the result is a matrix):
```{r}
x %o% y
```
### Accessing values
Let's work on this vector `x`:
```{r}
x <- c(5, 3, 4, 9, 3)
```
#### Accessing values by index
- To access values of a vector `x`, use the `x[i]` notation, where `i` is the index you want to access. `i` can be (and in fact, is always) a vector.
:::{.callout-warning}
## Attention
In R, indexes numbering start at **1** !!!
:::
```{r, warnings=FALSE}
# accessing by indexes
x[3]
# access index 1, 5 and 2
x[c(1,5,2)]
```
- To remove elements `j` from a vector `x`, use the notation `x[-j]`{.R}:
```{r}
# remove elements 1 and 3
x[-c(1,3)]
```
:::{.callout-warning}
## Attention
Writing `x[-c(1,3)]`{.R} will **just print** the result of `x[-c(1,3)]`{.R}, but not actually modify `x`.
To really modify `x`, you'd need to write `x <- x[-c(1,3)]`{.R}.
:::
#### Filtering values with tests
You can access values with booleans. Values getting a `TRUE`{.R} will be kept, while values with a `FALSE`{.R} will be discarded – or return a `NA` if a `TRUE` is given to a non existing value (*i.e.* to an index larger than the size of the vector):
```{r}
x
x[c(TRUE,TRUE,TRUE,FALSE,TRUE,TRUE)]
```
Therefore, you can apply tests on your values and filter them very easily:
```{r}
x > 4 # is a vector of booleans
x[x > 4] # is a filtered vector
```
#### Accessing values by name
Finally, values in vectors can be **named**, and thus can be accessed by their name:
```{r, warnings=FALSE}
y <- c(age=32, name="John", pet="Cat")
y
y[c('age','pet')] # prints a named vector
y[['name']] # prints an un-named vector
```
And you can access the names of the vector using `names()`{.R}:
```{r, warnings=FALSE}
names(y)
```
### Sorting, removing duplicates, sampling
- Sorting by ascending number:
```{r, warnings=FALSE}
sort(x)
```
- It works with strings too, but `stringr::str_sort()`{.R} might be needed for strings mixing letters and numbers:
```{r, warnings=FALSE}
sort(c("c","a","d","ab"))
sort(c("c", "a10", "a2", "d", "ab"))
stringr::str_sort(c("c", "a10", "a2", "d", "ab"), numeric = TRUE)
```
- Sorting by descending number:
```{r, warnings=FALSE}
sort(x, decreasing = TRUE)
```
- Inverting the order of the vector:
```{r}
rev(x)
```
- Find the order of the indexes of the sorting:
```{r, warnings=FALSE}
order(x)
x[order(x)] # is thus equivalent to `sort(x)`
```
- Find duplicates:
```{r, warnings=FALSE}
duplicated(x)
```
- Remove duplicates:
```{r, warnings=FALSE}
unique(x)
```
- Choose 3 random values:
```{r, warnings=FALSE}
sample(x, 3)
```
### Maximum and minimum
This is quite straightforward:
```{r, warnings=FALSE}
# maximum of x and its index
x; max(x); which.max(x)
# minimum of x and its index
x; min(x); which.min(x)
# range of a vector
range(x)
```
### Characteristics of vectors
- Size of a vector:
```{r, warnings=FALSE}
length(x)
```
- Statistics on vector:
```{r, warnings=FALSE}
summary(x)
```
- Sum of all terms:
```{r, warnings=FALSE}
sum(x)
```
- Average value:
```{r, warnings=FALSE}
mean(x)
```
- Median value:
```{r, warnings=FALSE}
median(x)
```
- Standard deviation:
```{r, warnings=FALSE}
sd(x)
```
- Count occurrence of values:
```{r, warnings=FALSE}
table(x)
```
- Cumulative sum:
```{r}
cumsum(x)
```
- Term-by-term difference
```{r}
diff(x)
```
### Concatenation of vectors
This is done using the `c()` notation: you basically create a vector of vectors, the result being a new vector:
```{r, warnings=FALSE}
# concatenate vectors
z <- c(-1:4, NA, -x); z
```
Another option is to use the `append()`{.R} function that allows for more options:
```{r, warnings=FALSE}
# append values
append(x, 4) # at the end
append(x, 4:1, 3) # or after a given index
```
## Exercises {#exo-vectors}
<a href="Data/exo-in-class.zip" download target="_blank">Download the archive with all the exercises files</a>, unzip it in your `R class` RStudio project, and edit the R files.
---
<details>
<summary>**Exercise 1**</summary>
Let's create a named vector containing age of the students in the class, the names of each value being the first name of the students. Then:
- Compute the average age and its standard deviation
- Compute the median age
- What is the maximum, minimum and range of the ages in the class?
- What are all the student names in the class?
- Print the sorted ages by increasing and decreasing order
- Print the ages sorted by alphabetically ordered names (increasing and decreasing)
- Show a histogram of the ages distribution using `hist()`{.R} and play with the parameter `breaks` to modify the histogram
- Show a boxplot of the ages distribution using `boxplot()`{.R}
</details>
<details>
<summary>**Exercise 2**</summary>
This exercise is adapted from [here](https://perso.esiee.fr/~courivad/R/05-vectors.html#application-population-in-french-cities-1962-2012).
Open Rstudio and create a new R script, save it as `population.R` in your wanted directory, say `Rcourse/`.
<a href="Data/population.csv" download target="_blank">Download</a> the `population.csv` file and save it in your working directory.
A csv file contains raw data stored as plain text and separated by a comma (Comma Separated Values). Open it with Rstudio.
We can of course directly load such file with R and store its data in an appropriate format (_i.e._ a `data.frame`), but this is for the next chapter. For now, just copy-paste the text in the Rstudio script area to:
- Create a `cities` vector containing all the cities listed in `population.csv`
- Create a `pop_1962` and `pop_2012` vectors containing the populations of each city at these years. Print the 2 vectors.
- Use `names()`{.R} to name values of `pop_1962` and `pop_2012`. Print the 2 vectors again. Are there any change?
- What are the cities with more than 200000 people in 1962? For these, how many residents in 2012?
- What is the population evolution of Montpellier and Nantes?
- Create a `pop_diff` vector to store population change between 1962 and 2012
- Print cities with a negative change
- Print cities which broke the 300000 people barrier between 1962 and 2012
- Compute the total change in population of the 10 largest cities (as of 1962) between 1962 and 2012.
- Compute the population mean for year 1962
- Compute the population mean of Paris over these two years
- Sort the cities by decreasing order of population for 1962
<details>
<summary>Solution</summary>
```{r, warnings=FALSE}
# Create a `cities` vector containing all the cities listed in `population.csv`
cities <- c("Angers", "Bordeaux", "Brest", "Dijon", "Grenoble", "Le Havre",
"Le Mans", "Lille", "Lyon", "Marseille", "Montpellier", "Nantes",
"Nice", "Paris", "Reims", "Rennes", "Saint-Etienne", "Strasbourg",
"Toulon", "Toulouse")
# Create a `pop_1962` and `pop_2012` vectors containing the populations
# of each city at these years. Print the 2 vectors.
pop_1962 <- c(115273,278403,136104,135694,156707,187845,132181,239955,
535746,778071,118864,240048,292958,2790091,134856,151948,
210311,228971,161797,323724)
pop_2012 <- c(149017,241287,139676,152071,158346,173142,143599,228652,
496343,852516,268456,291604,343629,2240621,181893,209860,
171483,274394,164899,453317)
pop_1962; pop_2012
# Use names() to name values of `pop_1962` and `pop_2012`.
# Print the 2 vectors again. Are there any change?
names(pop_2012) <- names(pop_1962) <- cities
pop_1962; pop_2012
# What are the cities with more than 200000 people in 1962?
# For these, how many residents in 2012?
cities200k <- cities[pop_1962>200000]
cities200k; pop_2012[cities200k]
# What is the population evolution of Montpellier and Nantes?
pop_2012['Montpellier'] - pop_1962['Montpellier']; pop_2012['Nantes'] - pop_1962['Nantes']
# Create a `pop_diff` vector to store population change between 1962 and 2012
pop_diff <- pop_2012 - pop_1962
# Print cities with a negative change
cities[pop_diff<0]
# Print cities which broke the 300000 people barrier between 1962 and 2012
cities[pop_2012>300000 & pop_1962<300000]
# Compute the total change in population of the 10 largest cities
# (as of 1962) between 1962 and 2012.
ten_largest <- cities[order(pop_1962, decreasing = TRUE)[1:10]]
sum(pop_2012[ten_largest] - pop_1962[ten_largest])
# Compute the population mean for year 1962
mean(pop_1962)
# Compute the population mean of Paris
mean(c(pop_1962['Paris'], pop_2012['Paris']))
# Sort the cities by decreasing order of population for 1962
(pop_1962_sorted <- sort(pop_1962, decreasing = TRUE))
```
</details>
</details>
<br>
<br>
<br>
<br>
<br>