/
07-dataframes.qmd
796 lines (646 loc) · 25.2 KB
/
07-dataframes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
# Data frames
## Defining a data.frame
### Defining a data.frame from vectors
In R, the principal object is *the data*. Hence the `data.frame`{.R} object, which is basically a table of vectors. A `data.frame`{.R} is a list presented under the form of a table – *i.e.* a spreadsheet. On a day-to-day basis, you will either define `data.frame`{.R} from existing vectors or other `data.frame`{.R}, or define a `data.frame`{.R} from a file (text, Excel...). In this example, we use <a href="Data/test.dat" download target="_blank">test.dat</a> and <a href="Data/test.xlsx" download target="_blank">test.xlsx</a>.
To define a `data.frame` from known vectors, we just have to do:
```{r, warnings=FALSE, message=FALSE}
x <- seq(-pi, pi, length = 6)
y <- sin(x)
df <- data.frame(x, y) # df is a data.frame (a table)
df
```
Then some information about our table are readily accessible, like:
- The dimension, the number of rows and columns:
```{r, warnings=FALSE, message=FALSE}
dim(df); nrow(df); ncol(df)
```
- Print the first and last 3 values
```{r, warnings=FALSE, message=FALSE}
head(df, 3); tail(df, 3)
```
- Print some information on `df`
```{r, warnings=FALSE, message=FALSE}
str(df)
```
- Print some statistics on `df`
```{r, warnings=FALSE, message=FALSE}
summary(df)
```
If not defined when creating the `data.frame`, the column names will be by default the vector names. To specify your own column names, do it when creating the `data.frame`:
```{r}
df <- data.frame(xxx = x, yyy = y)
head(df, 2)
```
Or, once it's created, do it using `names()`{.R}
```{r}
names(df)
names(df) <- c("X", "Y")
head(df, 2)
```
### Defining a data.frame from a file
#### A text file
Let's say we have `test.dat` that looks like this:
```{bash}
# Bash code:
head Data/test.dat
```
- Then, to read this file into a `data.frame`, we will use `read.table()`{.R}. If you don't specify that the file contains a header, `read.table()`{.R} will default to attributing column names that will be V1, V2, V3, etc:
```{r}
read.table("Data/test.dat")
```
- If your file contains column names, you can use the first line as column names, like so:
```{r}
read.table("Data/test.dat", header=TRUE)
```
- If you want to skip some lines before starting the reading, use `skip`:
```{r}
read.table("Data/test.dat", skip=1)
```
- You can specify your own column names using `col.names`:
```{r}
read.table("Data/test.dat", skip=1, col.names = c("A","B"))
```
- You can type `?read.table` for more options.
#### An Excel file
Now, to read an Excel file, use the `readxl` library:
```{r}
library(readxl) # load readxl from tidyverse to read Excel files
read_excel("Data/test.xlsx", sheet=1)
read_excel("Data/test.xlsx", sheet=2)
```
---
## Accessing values
- Like with vectors, accessing values is done using the `[]` notation, except that here there are two indexes: `df[row, column]`:
```{r, warnings=FALSE}
df[3,1]
```
- In general however, what you want is to access a given column, by its index:
```{r, warnings=FALSE}
df[,1]
df[[1]]# this is a vector too
```
- Or, **preferably**, by its name using the `$` notation:
```{r, warnings=FALSE}
df$X
```
- Finally, you may want to apply filters on your table:
```{r, warnings=FALSE}
df[df$X < 0, ]
```
- Using the function `subset()`{.R}, the conditions are applied on column names (no need for `df$col_name` here, while you need it in the above expression):
```{r, warnings=FALSE}
subset(df, X>1)
subset(df, X>1, select = c(X))
```
---
## Adding columns or rows
### Adding columns
- To add a column, just attribute a value to an column that do not exist yet, it will be created:
```{r, warnings=FALSE}
# Adding columns
df <- data.frame(x,y)
df$z <- df$x^2
df
```
- You can also create a `data.frame` of a `data.frame`:
```{r, warnings=FALSE}
data.frame(df, z=df$x^2, u=cos(df$x))
```
- Finally, you can use the `cbind()`{.R} function to bind two `data.frame` column-wise:
```{r, warnings=FALSE}
df2 <- data.frame(a = 1:length(x), b = 1:length(x))
cbind(df, df2)
```
### Adding rows
For this, use the `rbind()`{.R} function.
:::{.callout-warning}
## Attention
The two `data.frame` must have the same number of columns _**and**_ the same column names.
:::
```{r, warnings=FALSE}
rbind(df, df)
```
### Deleting rows/columns
This works like with vectors:
```{r, warnings=FALSE}
df[-1,]
df[,-1]
```
---
## Tidy up!
### What is tidy data?
**A good practice in R is to *tidy* your data.** R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules:
- Each variable in the data set is placed in its own column
- Each observation is placed in its own row
- Each value is placed in its own cell
```{r tidy, echo=FALSE, fig.cap="Illustration of tidy data.", fig.align="center", out.width="100%"}
knitr::include_graphics("./Plots/tidy.png")
```
Data that satisfies these rules is known as tidy data: you see that thanks to this representation, a 2D table can handle an arbitrary number of variables – this avoids using multi-dimensional arrays or multi-tab Excel documents. Note that it does't matter if a value is repeated in a column.
Here is an example:
```r
df <- read.csv("Data/population.csv")
df # is not tidy
```
<details>
<summary>Show output</summary>
```{r echo=FALSE}
df <- read.csv("Data/population.csv")
df # is not tidy
```
</details>
<br>
```r
library(tidyr)
df <- pivot_longer(df, cols=-year, names_to="city", values_to="pop")
df #is tidy
```
<details>
<summary>Show output</summary>
```{r echo=FALSE, message=FALSE}
library(tidyr)
df <- pivot_longer(df, cols=-year, names_to="city", values_to="pop")
df #is tidy
```
</details>
<br>
```r
# is not tidy
pivot_wider(df, names_from="city", values_from="pop")
```
<details>
<summary>Show output</summary>
```{r echo=FALSE}
# is not tidy
pivot_wider(df, names_from="city", values_from="pop")
```
</details>
<br>
You can find more information on data import and [tidyness](https://garrettgman.github.io/tidying/) on the [data-import cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf) and on the [tidyr](http://www.sthda.com/english/wiki/tidyr-crucial-step-reshaping-data-with-r-for-easier-analyses) package.
```{r pivotgif, echo = FALSE, fig.cap = "Understanding long and wide data with an animation. Source: [tidyexplain](https://github.com/gadenbuie/tidyexplain)", fig.align="center", out.width="70%"}
knitr::include_graphics('./Plots/tidyr-pivoting.gif')
```
### Tibbles
A `tibble`{.R} is an enhanced version of the `data.frame`{.R} provided by the `tibble` package (which is part of the `tidyverse`). The main advantage of `tibble`{.R} is that it has easier initialization and nicer printing than `data.frame`{.R}.
Moreover, the performance are also enhanced for the reading from files with `read_csv()`{.R}, `read_tsv()`{.R}, `read_table()`{.R} and `read_delim()`{.R} that do the same things as their `read.xx()` counterparts and return a `tibble`. Otherwise, the handling is basically the same.
More on tibbles [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html).
```{r, include=FALSE}
rm(x)
```
- Note that when initializing `tibble`s, the construction is iterative. It means that when creating a second column, one can refer to the first one that was created. This does't work with `data.frame`s.
```{r, error=TRUE, warnings=FALSE, message=FALSE}
# won't work unless a `x` vector was created before
data.frame(x=runif(1e3), y=cumsum(x))
library(tidyverse)
tib <- tibble(x=runif(1e3), y=cumsum(x))
tib
```
:::{.callout-warning}
## Attention
:::
`Tibble`s are quite strict about subsetting. `[` always returns another `tibble`. Contrast this with a data frame: sometimes `[` returns a data frame and sometimes it just returns a vector:
```{r}
head(tib[[1]]) # is a vector
head(tib[,1]) # is a tibble
```
Unless you want to get a `tibble`, I recommend always using the `$` notation when you want to get a column as a vector to avoid problems.
- Another interesting feature of `tibble`s is that their columns can contain vectors, like usual, but also lists of any R objects like other tibbles, `nls()`{.R} objects, etc. This is called "nesting", and you can nest and un-nest `tibble`s using these explicit functions:
```{r warning=FALSE, message=FALSE}
tib1 <- tibble(x=1:3, y=1:3)
tib2 <- tibble(x=1:5, y=1:5)
tib <- tibble(number=1:2, data=list(tib1, tib2))
tib
```
```{r}
#| label: fig-excel-nested
#| echo: false
#| fig-cap: "Excel equivalent to a nested `tibble`."
#| out-width: 50%
knitr::include_graphics("./Plots/excel-nested.png")
```
```{r warning=FALSE, message=FALSE}
tib_unnested <- unnest(tib, data)
tib_unnested
tib_unnested_renested <- nest(tib_unnested, data = c(number, y))
tib_unnested_renested
tib_unnested_renested$data # The `data` column is a list
```
---
## Operations in the tidyverse
In the end, base R and the `tidyverse` package provide many efficient functions to perform most of the tasks you would want to perform recursively, thus allowing avoiding explicit for loops (that are slow).
Here are some examples, and you will find much more [here](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html). Take a look at the cheatsheets on [tidyr](https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf) and on [dplyr](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf), it's really helpful.
Let's work on this `tibble`:
```{r, message=FALSE}
# Let's create a random tibble
library(tidyverse)
N <- 500
dt <- tibble(x = rep(runif(N, -1, 1), 3),
y = runif(N*3, -1, 1),
signx = ifelse(x>0, "positive", "negative"),
signy = ifelse(y>0, "positive", "negative")
)
dt
```
### The pipe operator
In the following, we will introduce the pipe operator `|>`{.R}, that was introduced in the version 4.1 of R.
This operator allows a clear syntax for successive operations, as "what is on the left of the operator is given as first argument of what is on the right". It is thus a good habit to write each operation on a separate line to facilitate the reading. This is particularly helpful when performing multiple nested operations. For example, `summary(head(tail(dt),2))`{.R}, which is hard to read, would translate to:
```r
dt |>
tail() |>
head(2) |>
summary()
```
Note that before the 4.1 version of R, the pipe operator was only present thanks to the `magrittr` package, and was written `%>%`{.R}. In `magrittr`'s pipe, retrieving the piped object is done with the operator `.`, while it is done with the operator `_` in base R's.
Silly example:
```{r, error=TRUE}
"Hello" |> gsub("o", "e") # replace the substring "Hello" by "o" in string "e"
"Hello" |> gsub("o", "e", x=_) # replace the substring "o" by "e" in string "Hello"
"Hello" %>% gsub("o", "e") # replace the substring "Hello" by "o" in string "e"
# /!\ In base R pipe, the '_' needs to be for a named argument,
# while it is not necessary for the '%>%' pipe
"Hello" %>% gsub("o", "e", .) # replace the substring "o" by "e" in string "Hello"
```
### Sampling data
```{r, message=FALSE}
dt |> slice(1:3) # by index
dt |> sample_n(3) # randomly
```
### Operations on groups of a variable
`group_by(column)`{.R} groups by similar values of the wanted column(s) and performs the next operations on each element of the group successively.
```{r, message=FALSE}
dt |>
group_by(signx)
dt |>
group_by(signx) |>
sample_n(3)
dt |>
group_by(signx, signy) |>
sample_n(3)
```
### Summary by groups of a variable
`summarise()`{.R} returns a **single value** for each element of the groups.
```{r, message=FALSE}
dt |>
group_by(signx) |>
summarise(count = n(),
mean_x = mean(x),
sd_x = sd(x))
dt |>
group_by(signx, signy) |>
summarise(count = n(),
mean_x = mean(x),
mean_y = mean(y))
```
### Sorting
```{r, message=FALSE}
dt |> arrange(x)
dt |> arrange(x, desc(y))
```
### Merge tables column-wise
At least one column with the **exact** same name must be present in each table to use the `xx_join()`{.R} functions. There are more possibilities than `inner_join()`{.R} that I show here, see the help for more information.
```{r, message=FALSE}
dt2 <- tibble(signx=c("positive","positive","negative","negative"),
signy=c("positive","negative","positive","negative"),
value=c(TRUE, FALSE, FALSE, TRUE))
dt2
inner_join(dt, dt2)
```
### Merge tables row-wise
This works even if there are missing rows.
```{r, message=FALSE}
dt3 <- tibble(a=1:3, b=3:5, c=6:8)
dt4 <- tibble(a=3:1, c=3:5)
bind_rows(dt3, dt4)
```
### Add/modify a column
`mutate()`{.R}, like `$`{.R}, adds a column if it doesn't exist, and modifies it if it does.
```{r, message=FALSE}
dt |> mutate(w=seq_along(x), z=sin(x))
dt |> mutate(x=seq_along(x))
```
### Selecting columns
```{r, message=FALSE}
dt |> select(x) # only x
dt |> select(-x) # all but x
dt |> select(starts_with("sign"))
dt |> select(contains("x"))
```
### Filtering columns
```{r, message=FALSE}
dt |> filter(signx=="positive")
dt |> filter(x<0, y>.1) # multiple filters can be applied at once
```
### Reorder columns
```{r, message=FALSE}
dt |> relocate(y, .after = signy)
```
### Separate columns
The separation is based on standard separators such as "-", "\_", ".", " ", etc. A single separator can be specified with the argument `sep`, otherwise all separators are used. One must provide the resulting vector of new column names: if one value is `NA`{.R}, this column will be discarded. Examples:
```{r, message=FALSE}
dt5 <- tibble(file=list.files(path="Exo/FTIR/Data/", pattern=".xls"))
dt5
dt5 |> separate(file, c(NA, "sample", "temperature", NA), convert = TRUE)
dt5 |> separate(file,
c("name", "extension"),
sep = "\\.")
```
### Apply a function recursively on each element of a column
Take a look at the [cheatsheet on the `purrr` package](https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf) for more options and a visual help on the `map()` family. I show here a use of `purrr::map(vector, function)`{.R} that returns a list. `map(x, f)`{.R} applies the function `f()` to each element of the vector `x`, putting the result in a separate element of a list: `map(x, f) -> list(f(x1), f(x2), ... f(xn))`{.R}. In case `f(xi)` returns a single value, you might want to use `map_dbl()` or `map_chr()`, for example, that will return a vector of doubles or of characters, respectively.
```{r, message=FALSE}
x <- c(pi, pi/3, pi/2)
map(x, sin) # returns a list
x |> map_dbl(sin) # returns a vector
```
Of course, in the above case, it's a stupid use of the power of `map()`{.R}. A typical use case is when you want to read multiple files, for example:
```{r, message=FALSE, warning=FALSE}
dt6 <- tibble(file=list.files(path="Exo/spectro2/Data",
pattern = ".txt",
full.names = TRUE)) |>
slice(1:5) |>
mutate(data = map(file, read_table, col_names = c("w", "Int"))) |>
mutate(file = basename(file))
dt6
```
This is (almost) equivalent to:
```{r, message=FALSE, warning=FALSE}
dt6 <- tibble(file=list.files(path="Exo/spectro2/Data",
pattern = ".txt",
full.names = TRUE)) |>
slice(1:5) |>
mutate(data = map(file, ~read_table(., col_names = c("w", "Int")))) |>
mutate(file = basename(file))
```
You see that you can create the function directly within the call to map using the shortcut `map(vector, ~ function(.))`{.R}. This is useful to provide more arguments to the function -- another solution is to write your own function before the call to `map()`{.R} and then call this function in `map()`{.R}.
Note that in case you need more parameters, you can use `purrr::map2(vector1, vector2, ~function(.x, .y))`{.R}, where `.x` and `.y` refer to `vector1` and `vector2`, respectively (it's always `.x` and `.y` whatever the name of `vector1` and `vector2`).
```{r}
tibble(x=1:3, y=5:7) |>
mutate(sum = map2_dbl(x, y, sum))
tibble(a=list(tibble(x=1:3, y=5:7),
tibble(x=0:3, y=4:7)),
b=list(tibble(x=10:13, y=15:18),
tibble(x=-1:2, y=-14:-17))) |>
mutate(sumx = map2_dbl(a, b, ~sum(.x$x, .y$x)),
sumy = map2_dbl(a, b, ~sum(.x$y, .y$y)))
```
### Nesting and un-nesting data
```{r, message=FALSE}
dt7 <- dt6 |>
mutate(file = basename(file)) |>
unnest(data)
dt7
# Nesting data per repeated values in a column (~equivalent to grouping)
dt7 |> nest(data=-file)
```
### Providing data to ggplot
```{r, message=FALSE}
dt |>
filter(abs(y) > 0.1) |>
ggplot(aes(x=x, y=y, color=signy))+
geom_point()
```
## Exercises {#exo-df}
<a href="Data/exo-in-class.zip" download target="_blank">Download the archive with all the exercises files</a>, unzip it in your `R class` RStudio project, and edit the R files.
---
<details>
<summary>**Exercise 1**</summary>
- Create a 3 column `data.frame`{.R} containing 10 random values, their sinus, and the sum of the two first columns.
- Print the 4 first lines of the table
- Print the second column
- Print the average of the third column
- Using `plot(x,y)`{.R} where `x` and `y` are vectors, plot the 2nd column as a function of the first
- Look into the function `write.table()`{.R} to write a text file containing this `data.frame`{.R}
- Do the all the same things with a `tibble`{.R}
<details>
<summary>Solution</summary>
```{r, warnings=FALSE}
# Create a 3 column `data.frame`{.R} containing 10 random values, their sinus,
# and the sum of the two first columns.
x <- runif(10)
y <- sin(x)
z <- x + y
df <- data.frame(x=x, y=y, z=z)
# Print the 4 first lines of the table
head(df, 4)
# Print the second column
df[,2]
# Print the average of the third column
mean(df$z); mean(df[3]); mean(df[,3])
# Using `plot(x,y)`{.R} where `x` and `y` are vectors,
# plot the 2nd column as a function of the first
plot(df[,1], df[,2])
plot(df$x, df$y)
# Look into the function `write.table()`{.R} to write a text file
# containing this `data.frame`{.R}
write.table(df, "Data/some_data.dat", quote = FALSE, row.names = FALSE)
# # # # # # # # # # # # # # # # #
# Tibble version
library(tidyverse)
df_tib <- tibble(a = runif(10), b = sin(a), c = a + b)
head(df_tib, 4)
df_tib[,2]; df_tib[[2]];
mean(df_tib$c); mean(df_tib[3]); mean(df_tib[,3]); mean(df_tib[[3]])
write.table(df_tib, "Data/some_data.dat", quote = FALSE, row.names = FALSE)
plot(df_tib$a, df_tib$b)
```
</details>
</details>
<details>
<summary>**Exercise 2**</summary>
- Download the files:
- <a href="Data/rubis_01.txt" download target="_blank">rubis_01.txt</a>
- <a href="Data/population.csv" download target="_blank">population.csv</a>
- <a href="Data/FTIR_rocks.xlsx" download target="_blank">FTIR_rocks.xlsx</a>
- Load them into separate `data.frames`{.R}. Look into the options of `read.table()`{.R}, `read.csv()`{.R}, `readxl::read_excel()`{.R}, to get the proper data fields.
- Add column names to the data.frame containing `rubis_01.txt`.
- Print their dimensions.
- Do the same things with tibbles.
<details>
<summary>Solution</summary>
```{r, warnings=FALSE, message=FALSE}
rubis_01 <- read.table("Data/rubis_01.txt", col.names = c("w", "intensity"))
population <- read.csv("Data/population.csv")
FTIR_rocks <- readxl::read_excel("Data/FTIR_rocks.xlsx")
dim(rubis_01); names(rubis_01)
dim(population); names(population)
dim(FTIR_rocks); names(FTIR_rocks)
library(tidyverse)
rubis_01 <- read_table("Data/rubis_01.txt", col_names = c("w", "intensity"))
population <- read_csv("Data/population.csv")
```
</details>
</details>
<details>
<summary>**Exercise 3**</summary>
- Download the TGA data file <a href="Data/ATG.txt" download target="_blank">ATG.txt</a>
- Load it into a `data.frame`{.R}. Look into the options of `read.table()`{.R} to get the proper data fields.
- Do the same with a tibble
<details>
<summary>Solution</summary>
```{r}
d <- read.table("Data/ATG.txt",
skip=12,
header=FALSE,
nrows=4088)
names(d) <- c("Index", "t", "Ts", "Tr", "Value")
head(d)
d <- read.table("Data/ATG.txt",
skip=10,
comment.char="[",
header=TRUE,
nrows=4088)
head(d)
library(tidyverse)
d <- read_table("Data/ATG.txt",
skip = 10,
comment = "[") |>
drop_na()
d
```
</details>
</details>
<details>
<summary>**Exercise 4**</summary>
Download <a href="Data/population.csv" download target="_blank">population.csv</a> and load it into a `tibble`{.R}.
- What are the names of the columns?
- Are the data tidy? make the table tidy if needed
- Create a subset containing the data for Montpellier
+ What is the max and min of population in this city?
+ The average population over time?
- What is the total population in 2012?
- What is the total population per year?
- What is the average population per city over the years?
<details>
<summary>Solution</summary>
```{r}
# Download population.txt and load it into a `data.frame`{.R}.
library(tidyverse)
popul <- read_csv("Data/population.csv")
# What are the names of the columns and the dimension of the table?
names(popul); dim(popul)
# Are the data tidy?
head(popul) # no
popul.tidy <- popul |>
pivot_longer(cols=-year,
names_to = "city",
values_to = "pop"
)
popul.tidy
# Create a subset containing the data for Montpellier
mtp <- subset(popul.tidy, city == "Montpellier")
# I prefer the tidyverse version
mtp <- popul.tidy |> filter(city == "Montpellier")
# What is the max and min of population in this city?
max(mtp$pop)
min(mtp$pop)
range(mtp$pop)
# The average population over time?
mean(mtp$pop)
# What is the total population in 2012?
sum(popul.tidy[popul.tidy$year == 2012, "pop"])
popul.tidy |>
filter(year==2012) |>
select(pop) |>
sum()
# What is the total population per year?
popul.tidy |>
group_by(year) |>
summarise(pop_tot=sum(pop))
# What is the average population per city over the years?
popul.tidy |>
group_by(city) |>
summarise(pop_ave=mean(pop))
```
</details>
</details>
<details>
<summary>**Exercise 5**</summary>
- First, load the `tidyverse` and `lubridate` package
- Load <a href="Data/people1.csv" download target="_blank">people1.csv</a> and <a href="Data/people2.csv" download target="_blank">people2.csv</a> into `pp1` and `pp2`
- Create a new tibble `pp` by using the pipe operator (`%>%`{.R}) and successively:
- joining the two tibbles into one using `inner_join()`{.R}
- adding a column `age` containing the age in years (use `lubridate::time_length(x, 'years')`{.R} with x a time difference in days) by using `mutate()`{.R}
- Display a summary of the table using `str()`{.R}
- Using `groupe_by()`{.R} and `summarize()`{.R}:
- Show the number of males and females in the table (use the counter `n()`{.R})
- Show the average age per gender
- Show the average size per gender and institution
- Show the number of people from each country, sorted by descending population (`arrange()`{.R})
- Using `select()`{.R}, display:
- only the name and age columns
- all but the name column
- Using `filter()`{.R}, show data only for
- Chinese people
- From institution ECL and UCBL
- People older than 22
- People with a `e` in their name
<details>
<summary>Solution</summary>
```{r, warnings=FALSE, message=FALSE}
# First, load the `tidyverse` package
library(tidyverse)
# Load people1.csv and people2.csv
pp1 <- read_csv("Data/people1.csv")
pp2 <- read_csv("Data/people2.csv")
# Create a new tibble `pp` by using the pipe operator (`%>%`)
# and successively:
# - joining the two tibbles into one using `inner_join()`
# - adding a column `age` containing the age in years
# (use lubridate's `time_length(x, 'years')` with x a time
# difference in days) by using `mutate()`
pp <- pp1 |>
inner_join(pp2) |>
mutate(age=time_length(today()-dateofbirth,'years'))
# Display a summary of the table using `str()`
str(pp)
# Using `groupe_by()` and `summarize()`:
# - Show the number of males and females in the table
# (use the counter `n()`)
pp |>
group_by(gender) |>
summarize(count=n())
# - Show the average age per gender
pp |>
group_by(gender) |>
summarize(age=mean(age))
# - Show the average size per gender and institution
pp |>
group_by(gender, institution) |>
summarize(size=mean(size))
# - Show the number of people from each country,
# sorted by descending population
pp |>
group_by(origin) |>
summarize(count=n()) |>
arrange(desc(count))
# Using `select()`, display:
# - only the name and age columns
pp |> select(c(name, age))
# - all but the name column
pp |> select(-name)
# Using `filter()`, show data only for
# - Chinese people
pp |> filter(origin=='China')
# - From institution ECL and UCBL
pp |> filter(institution %in% c('ECL', 'UCBL'))
# - People older than 22
pp |> filter(age>22)
# - People with a `e` in their name
pp |> filter(grepl('e',name))
```
</details>
----
For more interesting exercises in the tidyverse, look at:
- [CO2 emissions: data wrangling and ggplot2](Exo/co2/exercise.html)
- [Religion and babies: data handling, ggplot2 and plotly](Exo/religion_babies/exercise.html)
- [COVID-19: data wrangling, ggplot2](Exo/covid/exercise.html)
- [Nanoparticles statistics from SEM images: data wrangling, ggplot2 and fitting](Exo/SEM_particles/exercise.html)
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>