/
putting-it-all-together-writing-a-package-to-work-on-data.html
728 lines (648 loc) · 57.7 KB
/
putting-it-all-together-writing-a-package-to-work-on-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
<!DOCTYPE html>
<html >
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>Functional programming and unit testing for data munging with R</title>
<meta name="description" content="This book is an introduction to functional programming and unit testing with the R programming language, for the purpose of data muning">
<meta name="generator" content="bookdown 0.5 and GitBook 2.6.7">
<meta property="og:title" content="Functional programming and unit testing for data munging with R" />
<meta property="og:type" content="book" />
<meta property="og:description" content="This book is an introduction to functional programming and unit testing with the R programming language, for the purpose of data muning" />
<meta name="github-repo" content="b-rodrigues/fput" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Functional programming and unit testing for data munging with R" />
<meta name="twitter:description" content="This book is an introduction to functional programming and unit testing with the R programming language, for the purpose of data muning" />
<meta name="author" content="Bruno Rodrigues">
<meta name="date" content="2017-12-28">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<link rel="prev" href="unit-testing.html">
<link rel="next" href="references.html">
<script src="libs/jquery-2.2.3/jquery.min.js"></script>
<link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-bookdown.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link rel="stylesheet" href="style.css" type="text/css" />
</head>
<body>
<div class="book without-animation with-summary font-size-2 font-family-1" data-basepath=".">
<div class="book-summary">
<nav role="navigation">
<ul class="summary">
<li><a href="./">Functional programming and unit testing for data munging</a></li>
<li class="divider"></li>
<li class="chapter" data-level="1" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i><b>1</b> Why this book?</a><ul>
<li class="chapter" data-level="1.1" data-path="index.html"><a href="index.html#important-notice"><i class="fa fa-check"></i><b>1.1</b> Important notice</a></li>
<li class="chapter" data-level="1.2" data-path="index.html"><a href="index.html#motivation"><i class="fa fa-check"></i><b>1.2</b> Motivation</a></li>
<li class="chapter" data-level="1.3" data-path="index.html"><a href="index.html#who-am-i"><i class="fa fa-check"></i><b>1.3</b> Who am I?</a></li>
<li class="chapter" data-level="1.4" data-path="index.html"><a href="index.html#thanks"><i class="fa fa-check"></i><b>1.4</b> Thanks</a></li>
<li class="chapter" data-level="1.5" data-path="index.html"><a href="index.html#license"><i class="fa fa-check"></i><b>1.5</b> License</a></li>
</ul></li>
<li class="chapter" data-level="2" data-path="intro.html"><a href="intro.html"><i class="fa fa-check"></i><b>2</b> Introduction</a><ul>
<li class="chapter" data-level="2.1" data-path="intro.html"><a href="intro.html#get_r"><i class="fa fa-check"></i><b>2.1</b> Getting R</a></li>
<li class="chapter" data-level="2.2" data-path="intro.html"><a href="intro.html#fprog_overview"><i class="fa fa-check"></i><b>2.2</b> A short overview of functional programming</a></li>
<li class="chapter" data-level="2.3" data-path="intro.html"><a href="intro.html#unit_overview"><i class="fa fa-check"></i><b>2.3</b> A short overview of unit testing</a></li>
<li class="chapter" data-level="2.4" data-path="intro.html"><a href="intro.html#general-recommendations-to-follow-this-book"><i class="fa fa-check"></i><b>2.4</b> General recommendations to follow this book</a></li>
</ul></li>
<li class="chapter" data-level="3" data-path="fprog.html"><a href="fprog.html"><i class="fa fa-check"></i><b>3</b> Functional Programming</a><ul>
<li class="chapter" data-level="3.1" data-path="fprog.html"><a href="fprog.html#fprog_intro"><i class="fa fa-check"></i><b>3.1</b> Introduction</a><ul>
<li class="chapter" data-level="3.1.1" data-path="fprog.html"><a href="fprog.html#function-definitions"><i class="fa fa-check"></i><b>3.1.1</b> Function definitions</a></li>
<li class="chapter" data-level="3.1.2" data-path="fprog.html"><a href="fprog.html#properties-of-functions"><i class="fa fa-check"></i><b>3.1.2</b> Properties of functions</a></li>
</ul></li>
<li class="chapter" data-level="3.2" data-path="fprog.html"><a href="fprog.html#mapping-and-reducing-the-base-way"><i class="fa fa-check"></i><b>3.2</b> Mapping and Reducing: the <em>base</em> way</a><ul>
<li class="chapter" data-level="3.2.1" data-path="fprog.html"><a href="fprog.html#mapping-with-map-and-the-apply-family-of-functions"><i class="fa fa-check"></i><b>3.2.1</b> Mapping with <code>Map()</code> and the <code>*apply()</code> family of functions</a></li>
<li class="chapter" data-level="3.2.2" data-path="fprog.html"><a href="fprog.html#reduce"><i class="fa fa-check"></i><b>3.2.2</b> <code>Reduce()</code></a></li>
</ul></li>
<li class="chapter" data-level="3.3" data-path="fprog.html"><a href="fprog.html#map_reduce_purrr"><i class="fa fa-check"></i><b>3.3</b> Mapping and Reducing: the <code>purrr</code> way</a><ul>
<li class="chapter" data-level="3.3.1" data-path="fprog.html"><a href="fprog.html#the-map-family-of-functions"><i class="fa fa-check"></i><b>3.3.1</b> The <code>map*()</code> family of functions</a></li>
<li class="chapter" data-level="3.3.2" data-path="fprog.html"><a href="fprog.html#reducing-with-purrr"><i class="fa fa-check"></i><b>3.3.2</b> Reducing with <code>purrr</code></a></li>
</ul></li>
<li class="chapter" data-level="3.4" data-path="fprog.html"><a href="fprog.html#basic-anonymous-functions"><i class="fa fa-check"></i><b>3.4</b> Basic anonymous functions</a></li>
<li class="chapter" data-level="3.5" data-path="fprog.html"><a href="fprog.html#wrap-up"><i class="fa fa-check"></i><b>3.5</b> Wrap-up</a></li>
<li class="chapter" data-level="3.6" data-path="fprog.html"><a href="fprog.html#exercises"><i class="fa fa-check"></i><b>3.6</b> Exercises</a></li>
</ul></li>
<li class="chapter" data-level="4" data-path="tidyverse.html"><a href="tidyverse.html"><i class="fa fa-check"></i><b>4</b> The <code>tidyverse</code></a><ul>
<li class="chapter" data-level="4.1" data-path="tidyverse.html"><a href="tidyverse.html#smoking-is-bad-for-you-but-pipes-are-your-friend"><i class="fa fa-check"></i><b>4.1</b> Smoking is bad for you, but pipes are your friend</a></li>
<li class="chapter" data-level="4.2" data-path="tidyverse.html"><a href="tidyverse.html#getting-data-into-r-with-readr-readxl-haven-and-what-are-tibbles"><i class="fa fa-check"></i><b>4.2</b> Getting data into R with <code>readr</code>, <code>readxl</code>, <code>haven</code> and what are <em>tibbles</em></a><ul>
<li class="chapter" data-level="4.2.1" data-path="tidyverse.html"><a href="tidyverse.html#the-swiss-army-knife-of-data-import-and-export-rio"><i class="fa fa-check"></i><b>4.2.1</b> The swiss army knife of data import and export: <code>rio</code></a></li>
</ul></li>
<li class="chapter" data-level="4.3" data-path="tidyverse.html"><a href="tidyverse.html#writing-any-object-to-disk"><i class="fa fa-check"></i><b>4.3</b> Writing any object to disk</a></li>
<li class="chapter" data-level="4.4" data-path="tidyverse.html"><a href="tidyverse.html#using-rstudio-projects-to-manage-paths"><i class="fa fa-check"></i><b>4.4</b> Using RStudio projects to manage paths</a></li>
<li class="chapter" data-level="4.5" data-path="tidyverse.html"><a href="tidyverse.html#transforming-your-data-with-dplyr"><i class="fa fa-check"></i><b>4.5</b> Transforming your data with <code>dplyr</code></a><ul>
<li class="chapter" data-level="4.5.1" data-path="tidyverse.html"><a href="tidyverse.html#filter-and-friends"><i class="fa fa-check"></i><b>4.5.1</b> <code>filter()</code> and friends</a></li>
<li class="chapter" data-level="4.5.2" data-path="tidyverse.html"><a href="tidyverse.html#select-and-its-helpers"><i class="fa fa-check"></i><b>4.5.2</b> <code>select()</code> and its helpers</a></li>
<li class="chapter" data-level="4.5.3" data-path="tidyverse.html"><a href="tidyverse.html#group_by"><i class="fa fa-check"></i><b>4.5.3</b> <code>group_by()</code></a></li>
<li class="chapter" data-level="4.5.4" data-path="tidyverse.html"><a href="tidyverse.html#summarise"><i class="fa fa-check"></i><b>4.5.4</b> <code>summarise()</code></a></li>
<li class="chapter" data-level="4.5.5" data-path="tidyverse.html"><a href="tidyverse.html#mutate-and-transmute"><i class="fa fa-check"></i><b>4.5.5</b> <code>mutate()</code> and <code>transmute()</code></a></li>
<li class="chapter" data-level="4.5.6" data-path="tidyverse.html"><a href="tidyverse.html#tally-and-count"><i class="fa fa-check"></i><b>4.5.6</b> <code>tally()</code> and <code>count()</code></a></li>
<li class="chapter" data-level="4.5.7" data-path="tidyverse.html"><a href="tidyverse.html#joining-tibbles-with-full_join-left_join-right_join-and-all-the-others"><i class="fa fa-check"></i><b>4.5.7</b> Joining <code>tibble</code>s with <code>full_join()</code>, <code>left_join()</code>, <code>right_join()</code> and all the others</a></li>
</ul></li>
<li class="chapter" data-level="4.6" data-path="tidyverse.html"><a href="tidyverse.html#tidy-your-data-with-tidyr"><i class="fa fa-check"></i><b>4.6</b> Tidy your data with <code>tidyr</code></a></li>
<li class="chapter" data-level="4.7" data-path="tidyverse.html"><a href="tidyverse.html#functional-programming-with-purrr-and-purrrlyr"><i class="fa fa-check"></i><b>4.7</b> Functional programming with <code>purrr</code> and <code>purrrlyr</code></a><ul>
<li class="chapter" data-level="4.7.1" data-path="tidyverse.html"><a href="tidyverse.html#mapping-and-reducing-with-purrr-continued"><i class="fa fa-check"></i><b>4.7.1</b> Mapping and reducing with <code>purrr</code>, continued</a></li>
<li class="chapter" data-level="4.7.2" data-path="tidyverse.html"><a href="tidyverse.html#safely-and-possibly"><i class="fa fa-check"></i><b>4.7.2</b> <code>safely()</code> and <code>possibly()</code></a></li>
<li class="chapter" data-level="4.7.3" data-path="tidyverse.html"><a href="tidyverse.html#transposing-lists"><i class="fa fa-check"></i><b>4.7.3</b> «Transposing lists»</a></li>
</ul></li>
<li class="chapter" data-level="4.8" data-path="tidyverse.html"><a href="tidyverse.html#special-packages-for-special-kinds-of-data-forcats-lubridate-and-stringr"><i class="fa fa-check"></i><b>4.8</b> Special packages for special kinds of data: <code>forcats</code>, <code>lubridate</code>, and <code>stringr</code></a><ul>
<li class="chapter" data-level="4.8.1" data-path="tidyverse.html"><a href="tidyverse.html#section"><i class="fa fa-check"></i><b>4.8.1</b> 🐈🐈🐈🐈</a></li>
</ul></li>
<li class="chapter" data-level="4.9" data-path="tidyverse.html"><a href="tidyverse.html#exercises-1"><i class="fa fa-check"></i><b>4.9</b> Exercises</a></li>
</ul></li>
<li class="chapter" data-level="5" data-path="prog-tidyverse.html"><a href="prog-tidyverse.html"><i class="fa fa-check"></i><b>5</b> Programming with the <code>tidyverse</code></a></li>
<li class="chapter" data-level="6" data-path="packages.html"><a href="packages.html"><i class="fa fa-check"></i><b>6</b> Packages</a><ul>
<li class="chapter" data-level="6.1" data-path="packages.html"><a href="packages.html#why-you-need-your-own-packages-in-your-life"><i class="fa fa-check"></i><b>6.1</b> Why you need your own packages in your life</a></li>
<li class="chapter" data-level="6.2" data-path="packages.html"><a href="packages.html#r-packages-the-basics"><i class="fa fa-check"></i><b>6.2</b> R packages: the basics</a></li>
<li class="chapter" data-level="6.3" data-path="packages.html"><a href="packages.html#writing-documentation-for-your-functions"><i class="fa fa-check"></i><b>6.3</b> Writing documentation for your functions</a></li>
<li class="chapter" data-level="6.4" data-path="packages.html"><a href="packages.html#extra-files-inside-your-package-and-dependencies"><i class="fa fa-check"></i><b>6.4</b> Extra files inside your package and dependencies</a><ul>
<li class="chapter" data-level="6.4.1" data-path="packages.html"><a href="packages.html#the-namespace-file"><i class="fa fa-check"></i><b>6.4.1</b> The <code>NAMESPACE</code> file</a></li>
<li class="chapter" data-level="6.4.2" data-path="packages.html"><a href="packages.html#how-can-you-use-functions-from-other-packages-inside-your-package"><i class="fa fa-check"></i><b>6.4.2</b> How can you use functions from other packages inside your package?</a></li>
</ul></li>
<li class="chapter" data-level="6.5" data-path="packages.html"><a href="packages.html#unit-test-your-package"><i class="fa fa-check"></i><b>6.5</b> Unit test your package</a></li>
<li class="chapter" data-level="6.6" data-path="packages.html"><a href="packages.html#checking-the-coverage-of-your-unit-tests-with-covr"><i class="fa fa-check"></i><b>6.6</b> Checking the coverage of your unit tests with <code>covr</code></a></li>
<li class="chapter" data-level="6.7" data-path="packages.html"><a href="packages.html#wrap-up-1"><i class="fa fa-check"></i><b>6.7</b> Wrap-up</a></li>
</ul></li>
<li class="chapter" data-level="7" data-path="unit-testing.html"><a href="unit-testing.html"><i class="fa fa-check"></i><b>7</b> Unit testing</a><ul>
<li class="chapter" data-level="7.1" data-path="unit-testing.html"><a href="unit-testing.html#introduction"><i class="fa fa-check"></i><b>7.1</b> Introduction</a></li>
<li class="chapter" data-level="7.2" data-path="unit-testing.html"><a href="unit-testing.html#unit-testing-with-the-testthat-package"><i class="fa fa-check"></i><b>7.2</b> Unit testing with the <code>testthat</code> package</a></li>
<li class="chapter" data-level="7.3" data-path="unit-testing.html"><a href="unit-testing.html#actually-running-your-tests"><i class="fa fa-check"></i><b>7.3</b> Actually running your tests</a></li>
<li class="chapter" data-level="7.4" data-path="unit-testing.html"><a href="unit-testing.html#wrap-up-2"><i class="fa fa-check"></i><b>7.4</b> Wrap-up</a></li>
<li class="chapter" data-level="7.5" data-path="unit-testing.html"><a href="unit-testing.html#exercises-2"><i class="fa fa-check"></i><b>7.5</b> Exercises</a></li>
</ul></li>
<li class="chapter" data-level="8" data-path="putting-it-all-together-writing-a-package-to-work-on-data.html"><a href="putting-it-all-together-writing-a-package-to-work-on-data.html"><i class="fa fa-check"></i><b>8</b> Putting it all together: writing a package to work on data</a><ul>
<li class="chapter" data-level="8.1" data-path="putting-it-all-together-writing-a-package-to-work-on-data.html"><a href="putting-it-all-together-writing-a-package-to-work-on-data.html#getting-the-data"><i class="fa fa-check"></i><b>8.1</b> Getting the data</a></li>
<li class="chapter" data-level="8.2" data-path="putting-it-all-together-writing-a-package-to-work-on-data.html"><a href="putting-it-all-together-writing-a-package-to-work-on-data.html#your-first-data-munging-package-preparedata"><i class="fa fa-check"></i><b>8.2</b> Your first data munging package: <code>prepareData</code></a><ul>
<li class="chapter" data-level="8.2.1" data-path="putting-it-all-together-writing-a-package-to-work-on-data.html"><a href="putting-it-all-together-writing-a-package-to-work-on-data.html#reading-a-lot-of-datasets-at-once"><i class="fa fa-check"></i><b>8.2.1</b> Reading a lot of datasets at once</a></li>
<li class="chapter" data-level="8.2.2" data-path="putting-it-all-together-writing-a-package-to-work-on-data.html"><a href="putting-it-all-together-writing-a-package-to-work-on-data.html#treating-the-columns-of-your-datasets"><i class="fa fa-check"></i><b>8.2.2</b> Treating the columns of your datasets</a></li>
</ul></li>
</ul></li>
<li class="chapter" data-level="" data-path="references.html"><a href="references.html"><i class="fa fa-check"></i>References</a></li>
<li class="divider"></li>
<li><a href="https://github.com/rstudio/bookdown" target="blank">Published with bookdown</a></li>
</ul>
</nav>
</div>
<div class="book-body">
<div class="body-inner">
<div class="book-header" role="navigation">
<h1>
<i class="fa fa-circle-o-notch fa-spin"></i><a href="./">Functional programming and unit testing for data munging with R</a>
</h1>
</div>
<div class="page-wrapper" tabindex="-1" role="main">
<div class="page-inner">
<section class="normal" id="section-">
<div id="putting-it-all-together-writing-a-package-to-work-on-data" class="section level1">
<h1><span class="header-section-number">Chapter 8</span> Putting it all together: writing a package to work on data</h1>
<p>Everything we have seen until now allows us to develop our own packages with the goal of <em>working</em> on data. By <em>working</em> on data I mean any operation that involves cleaning, transforming, analyzing or plotting data. I will summarize why everything we have seen until now helps us in this task:</p>
<ol style="list-style-type: decimal">
<li>Functional programming makes our code easier to test</li>
<li>Unit tests make sure our code is correct</li>
<li>Packages allows us to forget about paths, so unit tests are easier to run, makes writing documentation easier and makes sharing our code easier</li>
</ol>
<p>For the rest of this chapter we are going to work with mock datasets that I created. The data is completely random but for our purposes it does not matter. In this chapter, we are going to write a number of functions with the goal of going from these awful, badly formatted datasets to a nice longitudinal data set.</p>
<div id="getting-the-data" class="section level2">
<h2><span class="header-section-number">8.1</span> Getting the data</h2>
<p>You can download the data from the <a href="https://github.com/b-rodrigues/functional_programming_and_unit_testing_for_data_munging">github repository</a> of the book. There are 5 <code>.csv</code> files that comprise the data sets we are going to work with:</p>
<ul>
<li><code>data_2000.csv</code></li>
<li><code>data_2001.csv</code></li>
<li><code>data_2002.csv</code></li>
<li><code>data_2003.csv</code></li>
<li><code>data_2004.csv</code></li>
</ul>
<p>The first step, of course, is to load these datasets into R. For 5 datasets, I assume that you would simply write the following into Rstudio:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_<span class="dv">2000</span> <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"/path/to/data/data_2000.csv"</span>, <span class="dt">header =</span> T)
data_<span class="dv">2001</span> <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"/path/to/data/data_2001.csv"</span>, <span class="dt">header =</span> T)
data_<span class="dv">2002</span> <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"/path/to/data/data_2002.csv"</span>, <span class="dt">header =</span> T)
data_<span class="dv">2003</span> <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"/path/to/data/data_2003.csv"</span>, <span class="dt">header =</span> T)
data_<span class="dv">2004</span> <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"/path/to/data/data_2004.csv"</span>, <span class="dt">header =</span> T)</code></pre></div>
<p>This might be ok for 5 datasets which are named very similarily, especially since you can do block editing in Rstudio. However, imagine that you have hundreds, thousands, of datasets? And image that their names are not so well formatted as here? We will start our package by writing a function that reads a lot of datasets at once.</p>
</div>
<div id="your-first-data-munging-package-preparedata" class="section level2">
<h2><span class="header-section-number">8.2</span> Your first data munging package: <code>prepareData</code></h2>
<div id="reading-a-lot-of-datasets-at-once" class="section level3">
<h3><span class="header-section-number">8.2.1</span> Reading a lot of datasets at once</h3>
<p>Using Rstudio, create a new project like shown in the previous chapter, and select <em>R package</em>. Give it a name, for example <code>prepareData</code>. If you are working with datasets that have a name, for example the <em>Penn World Tables</em>, you could call your package <code>preparePWT</code>, or something similar. By the way, we are going to work on some test data sets that I created for illustration purposes. When you will develop your own package to work on your own data, you do not have to write unit tests that use you original data. A subset can be enough, or taking the time to create a small test dataset might be preferable. It depends on what features of your functions you want to test. The first function I will show you is actually very general and could work with any datasets. This means that I created a package called <code>broTools</code><a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> that contains all the little functions that I use daily. But for illustration purposes, we will put this function inside <code>prepareData</code>, even if it does not have anything directly to do with it. I have called this function <code>read_list()</code> and here is the source code:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#' Reads a list of datasets</span>
<span class="co">#' @param list_of_datasets A list of datasets (names of datasets are strings)</span>
<span class="co">#' @param read_func A function, the read function to use to read the data</span>
<span class="co">#' @return Returns a list of the datasets</span>
<span class="co">#' @export</span>
<span class="co">#' @examples</span>
<span class="co">#' \dontrun{</span>
<span class="co">#' setwd("path/to/datasets/")</span>
<span class="co">#' list_of_datasets <- list.files(pattern = "*.csv")</span>
<span class="co">#' list_of_loaded_datasets <- read_list(list_of_datasets, read_func = read.csv)</span>
<span class="co">#' }</span>
read_list <-<span class="st"> </span><span class="cf">function</span>(list_of_datasets, read_func, ...){
<span class="kw">stopifnot</span>(<span class="kw">length</span>(list_of_datasets)<span class="op">></span><span class="dv">0</span>)
read_and_assign <-<span class="st"> </span><span class="cf">function</span>(dataset, read_func){
dataset_name <-<span class="st"> </span><span class="kw">as.name</span>(dataset)
dataset_name <-<span class="st"> </span><span class="kw">read_func</span>(dataset, ...)
}
<span class="co"># invisible is used to suppress the unneeded output</span>
output <-<span class="st"> </span><span class="kw">invisible</span>(
purrr<span class="op">::</span><span class="kw">map</span>(list_of_datasets,
read_and_assign,
<span class="dt">read_func =</span> read_func)
)
<span class="co"># Remove the ".csv" at the end of the data set names</span>
names_of_datasets <-<span class="st"> </span><span class="kw">c</span>(<span class="kw">unlist</span>(<span class="kw">strsplit</span>(list_of_datasets, <span class="st">"[.]"</span>))[<span class="kw">c</span>(T, F)])
<span class="kw">names</span>(output) <-<span class="st"> </span>names_of_datasets
<span class="kw">return</span>(output)
}</code></pre></div>
<p>The basic idea of <code>read_list()</code> is that it takes a list of datasets as the first argument, then a functon to read in the datasets as a second argument and as a third argument the famous <code>...</code>, which allows the user to specify further options to other functions that are contained in the body of the main function. In this case, further arguments are passed to the <code>read_func</code> function, for example if your data does not contains headers, you could pass the option <code>header = FALSE</code> to <code>read_list()</code> which would then get passed to <code>read_func</code>. I use <code>purrr::map()</code> to apply <code>read_and_assign()</code>; a helper function whose role is to read in a dataset and save it with its name, to the whole list of datasets. This step is wrapped inside <code>invisible()</code> as to remove unecessary output. Finally I use <code>strsplit()</code> with a regular expression to remove the extension of the dataset from its name. The output is thus a list of datasets where each dataset is named as it is on your hard drive. Save this function in a script called <code>read_list.R</code> and save it in the <code>R</code> folder of your package. Now you need to invoke <code>roxygen2::roxygenise()</code> to create the documentation of your function. I suggest you also run <code>devtools::use_testtthat</code>. This creates the necessary folder to hold your tests as well as creating a small <code>testthat.R</code> file with the code that gets called to run your tests. Without this, you might encounter weird issues (for example, <code>covr</code> not finding your tests!).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">roxygen2<span class="op">::</span><span class="kw">roxygenise</span>()</code></pre></div>
<pre><code>First time using roxygen2. Upgrading automatically...
Updating roxygen version in /home/bro/Dropbox/prepareData/DESCRIPTION
Writing NAMESPACE
Writing read_list.Rd</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">devtools<span class="op">::</span><span class="kw">use_testthat</span>()</code></pre></div>
<pre><code>* Adding testthat to Suggests
* Creating `tests/testthat`.
* Creating `tests/testthat.R` from template.</code></pre>
<p>Now let us check the coverage of our package:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"covr"</span>)
cov <-<span class="st"> </span><span class="kw">package_coverage</span>()
<span class="kw">shine</span>(cov)</code></pre></div>
<p>Unsurprisingly we get a coverage of 0% for our package. We will now write a unit test for this function. For example, let us see if the condition <code>stopifnot(length(list_of_datasets)>0)</code> works. Because you ran <code>detools::use_testthat()</code> you should have a folder called <code>tests</code> on the root of your project directory. In it, there is a folder called <code>testthat</code>. This is were you will save your unit tests, and any file needed for the tests to run (for example, mock datasets that are used by tests).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"testthat"</span>)
<span class="kw">library</span>(<span class="st">"prepareData"</span>)
<span class="kw">test_that</span>(<span class="st">"Try to import empty list of datasets: this may be caused because</span>
<span class="st"> the path to the datasets is wrong for instance"</span>,{
list_datasets <-<span class="st"> </span><span class="ot">NULL</span>
<span class="kw">expect_error</span>(<span class="kw">read_list</span>(list_datasets, read_csv, <span class="dt">col_types =</span> <span class="kw">cols</span>()))
})</code></pre></div>
<p>Run the test using <code>CTRL-SHIFT-T</code> if you are on Rstudio.</p>
<pre><code>==> devtools::test()
Loading prepareData
Loading required package: testthat
Testing prepareData
.
DONE ===========================================================================</code></pre>
<p>This is the output you should see. If you check the coverage of your package, you should see that the line <code>stopifnot(length(list_of_datasets)>0)</code> is highlightened in green and you should have around 9% of coverage for your package. You can spend some to to get the coverage as high as possible, but you have to take into account the time it will take you to write tests vs the benefits you are going to get from them. In the case of this function, I do not really see what more you could test.</p>
<p>Let us use this function to read in the datasets:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"readr"</span>)
<span class="kw">library</span>(<span class="st">"purrr"</span>)
<span class="kw">library</span>(<span class="st">"tibble"</span>)
list_of_data <-<span class="st"> </span><span class="kw">Sys.glob</span>(<span class="st">"assets/*.csv"</span>)
datasets <-<span class="st"> </span><span class="kw">read_list</span>(list_of_data, read_csv, <span class="dt">col_type =</span> <span class="kw">cols</span>())</code></pre></div>
<p><code>list_of_data</code> is a variable that contains the path to the datasets. I used <code>Sys.glob("assets/*.csv")</code> to find the datasets. The datasets are saved in the <code>assets</code> folder of the book and end with the <code>.csv</code> extension. You could also use <code>list.files("*.csv")</code> to achieve the same. Let’s take a look inside this list using <code>head()</code>. Since <code>head()</code> only works on single data frames or tibbles, we use <code>map()</code> to apply <code>head()</code> to each data frame on the list.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">map</span>(datasets, head)</code></pre></div>
<pre><code>## $`assets/data_2000`
## # A tibble: 6 x 6
## id Variable1 other2000 gender2000 eggs2000 spam2000
## <int> <int> <int> <chr> <int> <chr>
## 1 1 32 3 F 80 -1.5035369157
## 2 2 28 2 F 20 -0.1836726393
## 3 3 36 4 M 58 -0.6851988608
## 4 4 28 1 F 30 1.9900760191
## 5 5 34 3 F 14 0.4324725273
## 6 6 30 3 F 40 -0.79001853
##
## $`assets/data_2001`
## # A tibble: 6 x 6
## id VARIABLE1 other2001 Gender2001 eggs2001 spam2001
## <int> <int> <int> <chr> <int> <dbl>
## 1 1 32 3 F 80 -1.5035369
## 2 2 28 2 F 20 -0.1836726
## 3 3 36 4 M 58 -0.6851989
## 4 4 28 1 F 30 1.9900760
## 5 5 34 3 F 14 0.4324725
## 6 6 30 3 F 40 -0.7900185
##
## $`assets/data_2002`
## # A tibble: 6 x 6
## ID variable1 Other2002 gender2002 eggs2002 Spam2002
## <int> <int> <int> <chr> <int> <dbl>
## 1 1 32 3 F 80 -1.5035369
## 2 2 28 2 F 20 -0.1836726
## 3 3 36 4 M 58 -0.6851989
## 4 4 28 1 F 30 1.9900760
## 5 5 34 3 F 14 0.4324725
## 6 6 30 3 F 40 -0.7900185
##
## $`assets/data_2003`
## # A tibble: 6 x 6
## id variable1 other2003 gender2003 EGGS2003 spam2003
## <int> <int> <int> <chr> <int> <dbl>
## 1 1 32 3 F 80 -1.5035369
## 2 2 28 2 F 20 -0.1836726
## 3 3 36 4 M 58 -0.6851989
## 4 4 28 1 F 30 1.9900760
## 5 5 34 3 F 14 0.4324725
## 6 6 30 3 F 40 -0.7900185
##
## $`assets/data_2004`
## # A tibble: 6 x 6
## Id Variable1 Other2004 Gender2004 Eggs2004 Spam2004
## <int> <int> <int> <chr> <int> <dbl>
## 1 1 32 3 F 80 -1.5035369
## 2 2 28 2 F 20 -0.1836726
## 3 3 36 4 M 58 -0.6851989
## 4 4 28 1 F 30 1.9900760
## 5 5 34 3 F 14 0.4324725
## 6 6 30 3 F 40 -0.7900185</code></pre>
<p>The datasets we will work with all have the the same variables and the same inviduals. We have datasets for the years 2000 to 2004. It would be much better for analysis if we could have clean variable names and merge every datasets together in a single, longitudinal dataset. In short, what we need:</p>
<ul>
<li>Have nice names for the columns.</li>
<li>Remove the year from the name of the columns and add a column containing the year.</li>
<li>Merge every dataset together.</li>
</ul>
<p>This is to make the dataset tidy, as explained <span class="citation">Wickham (<a href="#ref-wickham2014tidy">2014</a><a href="#ref-wickham2014tidy">b</a>)</span>. Of course, depending on your needs, you might need to add further operations, for example creating new variables etc. For now, we are going to focus on these three steps.</p>
</div>
<div id="treating-the-columns-of-your-datasets" class="section level3">
<h3><span class="header-section-number">8.2.2</span> Treating the columns of your datasets</h3>
<p>Let us take a look at the column names of the datasets:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">map</span>(datasets, colnames)</code></pre></div>
<pre><code>## $`assets/data_2000`
## [1] "id" "Variable1" "other2000" "gender2000" "eggs2000"
## [6] "spam2000"
##
## $`assets/data_2001`
## [1] "id" "VARIABLE1" "other2001" "Gender2001" "eggs2001"
## [6] "spam2001"
##
## $`assets/data_2002`
## [1] "ID" "variable1" "Other2002" "gender2002" "eggs2002"
## [6] "Spam2002"
##
## $`assets/data_2003`
## [1] "id" "variable1" "other2003" "gender2003" "EGGS2003"
## [6] "spam2003"
##
## $`assets/data_2004`
## [1] "Id" "Variable1" "Other2004" "Gender2004" "Eggs2004"
## [6] "Spam2004"</code></pre>
<p>This is very messy, we would need to have a function that would clean all this mess and “normalize” these column names. Turns out that we’re lucky, and there is exactly what we are looking for in the <code>janitor</code> package. The function <code>janitor::clean_names()</code> does exactly this. Let’s use it and see the output:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"janitor"</span>)
datasets <-<span class="st"> </span><span class="kw">map</span>(datasets, clean_names)
<span class="kw">map</span>(datasets, colnames)</code></pre></div>
<pre><code>## $`assets/data_2000`
## [1] "id" "variable1" "other2000" "gender2000" "eggs2000"
## [6] "spam2000"
##
## $`assets/data_2001`
## [1] "id" "variable1" "other2001" "gender2001" "eggs2001"
## [6] "spam2001"
##
## $`assets/data_2002`
## [1] "id" "variable1" "other2002" "gender2002" "eggs2002"
## [6] "spam2002"
##
## $`assets/data_2003`
## [1] "id" "variable1" "other2003" "gender2003" "eggs2003"
## [6] "spam2003"
##
## $`assets/data_2004`
## [1] "id" "variable1" "other2004" "gender2004" "eggs2004"
## [6] "spam2004"</code></pre>
<p>This is much better. If <code>clean_names()</code> didn’t exist, you would have to have written your own function for this. This could have been a complicated exercise, depending on how messy and heterogenous the variable names would have been in your data. However <code>clean_names()</code> does a great job, so there’s no need to reivent the wheel!</p>
<p>Now we would like to remove the years from the column names and add a column with the name of each dataset. Let us start by removing the years from the column names by writing a function. For this function, a little regular expression knowledge will not hurt. Here is what the function looks like:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#' Remove year strings from column names</span>
<span class="co">#' @param list_of_datasets A list containing named datasets</span>
<span class="co">#' @return A list of datasets with the supplied string prepended to the column names</span>
<span class="co">#' @description This function removes year strings from column names, meaning that a column called</span>
<span class="co">#' "eggs9000" gets renamed into "eggs"</span>
<span class="co">#' @export</span>
<span class="co">#' @examples</span>
<span class="co">#' \dontrun{</span>
<span class="co">#' #`list_of_data_sets` is a list containing named data sets</span>
<span class="co">#' # For example, to access the first data set, called dataset_1 you would</span>
<span class="co">#' # write</span>
<span class="co">#' list_of_data_sets$dataset_1</span>
<span class="co">#' remove_years_from_strings(list_of_data_sets)</span>
<span class="co">#' }</span>
remove_years_from_strings <-<span class="st"> </span><span class="cf">function</span>(list_of_datasets){
for_one_dataset <-<span class="st"> </span><span class="cf">function</span>(dataset){
<span class="co"># strsplit() accepts regular expressions, so it's easy to get rid of a number made up of</span>
<span class="co"># *exactly* 4 digits</span>
<span class="kw">colnames</span>(dataset) <-<span class="st"> </span><span class="kw">unlist</span>(<span class="kw">strsplit</span>(<span class="kw">colnames</span>(dataset), <span class="st">"</span><span class="ch">\\</span><span class="st">d{4}"</span>, <span class="dt">perl =</span> <span class="ot">TRUE</span>))
<span class="kw">return</span>(dataset)
}
output <-<span class="st"> </span>purrr<span class="op">::</span><span class="kw">map</span>(list_of_datasets, for_one_dataset)
<span class="kw">return</span>(output)
}</code></pre></div>
<p>and here is the accompanying unit test:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"testthat"</span>)
<span class="kw">library</span>(<span class="st">"prepareData"</span>)
<span class="kw">library</span>(<span class="st">"readr"</span>)
data_sets <-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">pattern =</span> <span class="st">"2001"</span>)
data_list <-<span class="st"> </span><span class="kw">read_list</span>(data_sets, read_csv, <span class="dt">col_types =</span> <span class="kw">cols</span>())
<span class="kw">test_that</span>(<span class="st">"Test remove years from srings"</span>,{
data_list_result <-<span class="st"> </span>purr<span class="op">::</span><span class="kw">map</span>(data_list, janitor<span class="op">::</span>clean_names)
data_list_result <-<span class="st"> </span><span class="kw">remove_years_from_strings</span>(data_list_result)
expect <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"id"</span>, <span class="st">"year_"</span>, <span class="st">"variable1"</span>, <span class="st">"other"</span>, <span class="st">"gender"</span>, <span class="st">"eggs"</span>, <span class="st">"spam"</span>)
actual <-<span class="st"> </span><span class="kw">colnames</span>(data_list_result[[<span class="dv">1</span>]])
<span class="kw">expect_equal</span>(expect, actual)
})</code></pre></div>
<p>For the unit test to work, I had to add the dataset for the year 2001 in the <code>tests/testthat</code> directory. Again, this dataset does not have to be the real dataset you will ultimately be working on. A mock dataset with simulated data on 10 rows and with the same column names works exactly the same!</p>
<p>Let’s take a look at the output:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">datasets <-<span class="st"> </span><span class="kw">remove_years_from_strings</span>(datasets)
<span class="kw">map</span>(datasets, colnames)</code></pre></div>
<pre><code>## $`assets/data_2000`
## [1] "id" "variable1" "other" "gender" "eggs" "spam"
##
## $`assets/data_2001`
## [1] "id" "variable1" "other" "gender" "eggs" "spam"
##
## $`assets/data_2002`
## [1] "id" "variable1" "other" "gender" "eggs" "spam"
##
## $`assets/data_2003`
## [1] "id" "variable1" "other" "gender" "eggs" "spam"
##
## $`assets/data_2004`
## [1] "id" "variable1" "other" "gender" "eggs" "spam"</code></pre>
<p>This is starting to look like something!</p>
<p>Now, since we removed the years from the column names, we need to add a column containing the year to our datasets. And now to add the year column:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#' Adds the year column</span>
<span class="co">#' @param list_of_datasets A list containing named datasets</span>
<span class="co">#' @return A list of datasets with the year column</span>
<span class="co">#' @description This function works by extracting the year string contained in</span>
<span class="co">#' the data set name and appending a new column to the data set with the numeric</span>
<span class="co">#' value of the year. This means that the data sets have to have a name of the</span>
<span class="co">#' form data_set_2001 or data_2001_europe, etc</span>
<span class="co">#' @export</span>
<span class="co">#' @examples</span>
<span class="co">#' \dontrun{</span>
<span class="co">#' #`list_of_data_sets` is a list containing named data sets</span>
<span class="co">#' # For example, to access the first data set, called dataset_1 you would</span>
<span class="co">#' # write</span>
<span class="co">#' list_of_data_sets$dataset_1</span>
<span class="co">#' add_year_column(list_of_data_sets)</span>
<span class="co">#' }</span>
add_year_column <-<span class="st"> </span><span class="cf">function</span>(list_of_datasets){
for_one_dataset <-<span class="st"> </span><span class="cf">function</span>(dataset, dataset_name){
<span class="co"># Split the name of the dataset at the "_". The datasets must have a name of the</span>
<span class="co"># form "data_2000" (notice the underscore).</span>
name_year <-<span class="st"> </span><span class="kw">unlist</span>(<span class="kw">strsplit</span>(dataset_name, <span class="st">"[_.]"</span>))
<span class="co"># Get the index of the string that contains digits</span>
index <-<span class="st"> </span><span class="kw">grep</span>(<span class="st">"</span><span class="ch">\\</span><span class="st">d+"</span>, name_year)
<span class="co"># Get the year</span>
year <-<span class="st"> </span><span class="kw">as.numeric</span>(name_year[index])
<span class="co"># Add it to the data set</span>
dataset<span class="op">$</span>year <-<span class="st"> </span>year
<span class="kw">return</span>(dataset)
}
output <-<span class="st"> </span>purrr<span class="op">::</span><span class="kw">map2</span>(list_of_datasets, <span class="kw">names</span>(list_of_datasets), for_one_dataset)
<span class="kw">return</span>(output)
}</code></pre></div>
<p>And its unit test:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"testthat"</span>)
<span class="kw">library</span>(<span class="st">"prepareData"</span>)
<span class="kw">library</span>(<span class="st">"readr"</span>)
data_sets <-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">pattern =</span> <span class="st">"data"</span>)
data_list <-<span class="st"> </span><span class="kw">read_list</span>(data_sets, read_csv, <span class="dt">col_types =</span> <span class="kw">cols</span>())
<span class="kw">test_that</span>(<span class="st">"Test add year column"</span>,{
data_list_result <-<span class="st"> </span>purrr<span class="op">::</span><span class="kw">map</span>(data_list, janitor<span class="op">::</span>clean_names)
data_list_result <-<span class="st"> </span><span class="kw">add_year_column</span>(data_list_result)
expect <-<span class="st"> </span><span class="kw">list</span>(<span class="kw">rep</span>(<span class="dv">2001</span>, <span class="dv">1000</span>), <span class="kw">rep</span>(<span class="dv">2002</span>, <span class="dv">1000</span>))
actual <-<span class="st"> </span><span class="kw">list</span>(data_list_result[[<span class="dv">1</span>]]<span class="op">$</span>year, data_list_result[[<span class="dv">2</span>]]<span class="op">$</span>year)
<span class="kw">expect_equal</span>(expect, actual)
})</code></pre></div>
<p>This function does not work if the names of the datasets are not of the form “data_2000”. This means that this function should have either an additional argument, where you specify the separator (for example “_" or “.” or even “-”) or fail if the name does not contain an “_“. I like the second solution better:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#' Adds the year column</span>
<span class="co">#' @param list_of_datasets A list containing named datasets</span>
<span class="co">#' @return A list of datasets with the year column</span>
<span class="co">#' @description This function works by extracting the year string contained in</span>
<span class="co">#' the data set name and appending a new column to the data set with the numeric</span>
<span class="co">#' value of the year. This means that the data sets have to have a name of the</span>
<span class="co">#' form data_set_2001 or data_2001_europe, etc</span>
<span class="co">#' @export</span>
<span class="co">#' @examples</span>
<span class="co">#' \dontrun{</span>
<span class="co">#' #`list_of_data_sets` is a list containing named data sets</span>
<span class="co">#' # For example, to access the first data set, called dataset_1 you would</span>
<span class="co">#' # write</span>
<span class="co">#' list_of_data_sets$dataset_1</span>
<span class="co">#' add_year_column(list_of_data_sets)</span>
<span class="co">#' }</span>
add_year_column <-<span class="st"> </span><span class="cf">function</span>(list_of_datasets){
for_one_dataset <-<span class="st"> </span><span class="cf">function</span>(dataset, dataset_name){
<span class="cf">if</span>(<span class="op">!</span>(<span class="st">"_"</span> <span class="op">%in%</span><span class="st"> </span><span class="kw">unlist</span>(<span class="kw">strsplit</span>(dataset_name, <span class="dt">split =</span> <span class="st">""</span>)))){
<span class="kw">stop</span>(<span class="st">"Make sure that your datasets are named like</span>
<span class="st"> `data_2000.csv` or similar. The `_` between `data`</span>
<span class="st"> and `2000` is what matters"</span>)}
<span class="co"># Split the name of the dataset at the "_". The datasets must have a name of the</span>
<span class="co"># form "data_2000" (notice the underscore).</span>
name_year <-<span class="st"> </span><span class="kw">unlist</span>(<span class="kw">strsplit</span>(dataset_name, <span class="dt">split =</span> <span class="st">"[_.]"</span>))
<span class="co"># Get the index of the string that contains digits</span>
index <-<span class="st"> </span><span class="kw">grep</span>(<span class="st">"</span><span class="ch">\\</span><span class="st">d+"</span>, name_year)
<span class="co"># Get the year</span>
year <-<span class="st"> </span><span class="kw">as.numeric</span>(name_year[index])
<span class="co"># Add it to the data set</span>
dataset<span class="op">$</span>year <-<span class="st"> </span>year
<span class="kw">return</span>(dataset)
}
output <-<span class="st"> </span>purrr<span class="op">::</span><span class="kw">map2</span>(list_of_datasets, <span class="kw">names</span>(list_of_datasets), for_one_dataset)
<span class="kw">return</span>(output)
}</code></pre></div>
<p>If you check the coverage of this function, you will see that the lines that test if the datasets are correctly named do not get called. Let’s add a unit test that does this, but first, we need to create <em>wrong</em> datasets. Just copy the datasets you have in your tests folder, and rename them to <code>wrongdata2001.csv</code> and <code>wrongdata2002.csv</code>. We expect our function to stop with an error message if it tries anything on these datasets:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_sets <-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">pattern =</span> <span class="st">"wrong"</span>)
data_list <-<span class="st"> </span><span class="kw">read_list</span>(data_sets, read_csv, <span class="dt">col_types =</span> <span class="kw">cols</span>())
<span class="kw">test_that</span>(<span class="st">"Test add year column: wrong name"</span>,{
data_list_result <-<span class="st"> </span>purrr<span class="op">::</span><span class="kw">map</span>(data_list, janitor<span class="op">::</span>clean_names)
<span class="kw">expect_error</span>(<span class="kw">add_year_column</span>(data_list_result))
})</code></pre></div>
<p>Now have fully covered your function, and you also know when the function breaks. With the informative error message, future you or your coworkers will know how to correctly name the datasets. Let’s try <code>add_year_column()</code> to see how it behaves on our data:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">datasets <-<span class="st"> </span><span class="kw">add_year_column</span>(datasets)
<span class="kw">map</span>(datasets, head)</code></pre></div>
<pre><code>## $`assets/data_2000`
## # A tibble: 6 x 7
## id variable1 other gender eggs spam year
## <int> <int> <int> <chr> <int> <chr> <dbl>
## 1 1 32 3 F 80 -1.5035369157 2000
## 2 2 28 2 F 20 -0.1836726393 2000
## 3 3 36 4 M 58 -0.6851988608 2000
## 4 4 28 1 F 30 1.9900760191 2000
## 5 5 34 3 F 14 0.4324725273 2000
## 6 6 30 3 F 40 -0.79001853 2000
##
## $`assets/data_2001`
## # A tibble: 6 x 7
## id variable1 other gender eggs spam year
## <int> <int> <int> <chr> <int> <dbl> <dbl>
## 1 1 32 3 F 80 -1.5035369 2001
## 2 2 28 2 F 20 -0.1836726 2001
## 3 3 36 4 M 58 -0.6851989 2001
## 4 4 28 1 F 30 1.9900760 2001
## 5 5 34 3 F 14 0.4324725 2001
## 6 6 30 3 F 40 -0.7900185 2001
##
## $`assets/data_2002`
## # A tibble: 6 x 7
## id variable1 other gender eggs spam year
## <int> <int> <int> <chr> <int> <dbl> <dbl>
## 1 1 32 3 F 80 -1.5035369 2002
## 2 2 28 2 F 20 -0.1836726 2002
## 3 3 36 4 M 58 -0.6851989 2002
## 4 4 28 1 F 30 1.9900760 2002
## 5 5 34 3 F 14 0.4324725 2002
## 6 6 30 3 F 40 -0.7900185 2002
##
## $`assets/data_2003`
## # A tibble: 6 x 7
## id variable1 other gender eggs spam year
## <int> <int> <int> <chr> <int> <dbl> <dbl>
## 1 1 32 3 F 80 -1.5035369 2003
## 2 2 28 2 F 20 -0.1836726 2003
## 3 3 36 4 M 58 -0.6851989 2003
## 4 4 28 1 F 30 1.9900760 2003
## 5 5 34 3 F 14 0.4324725 2003
## 6 6 30 3 F 40 -0.7900185 2003
##
## $`assets/data_2004`
## # A tibble: 6 x 7
## id variable1 other gender eggs spam year
## <int> <int> <int> <chr> <int> <dbl> <dbl>
## 1 1 32 3 F 80 -1.5035369 2004
## 2 2 28 2 F 20 -0.1836726 2004
## 3 3 36 4 M 58 -0.6851989 2004
## 4 4 28 1 F 30 1.9900760 2004
## 5 5 34 3 F 14 0.4324725 2004
## 6 6 30 3 F 40 -0.7900185 2004</code></pre>
<p>Just as expected!</p>
<p>TBC…</p>
</div>
</div>
</div>
<h3>References</h3>
<div id="refs" class="references">
<div id="ref-wickham2014tidy">
<p>Wickham, Hadley. 2014b. “Tidy Data.” <em>Journal of Statistical Software</em> 59 (1): 1–23. doi:<a href="https://doi.org/10.18637/jss.v059.i10">10.18637/jss.v059.i10</a>.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol start="2">
<li id="fn2"><p>It stands for <code>Bruno Rodrigues' Tools</code>. I’m still working on releasing the package on Github, and maybe CRAN.<a href="putting-it-all-together-writing-a-package-to-work-on-data.html#fnref2">↩</a></p></li>
</ol>
</div>
</section>
</div>
</div>
</div>
<a href="unit-testing.html" class="navigation navigation-prev " aria-label="Previous page"><i class="fa fa-angle-left"></i></a>
<a href="references.html" class="navigation navigation-next " aria-label="Next page"><i class="fa fa-angle-right"></i></a>
</div>
</div>
<script src="libs/gitbook-2.6.7/js/app.min.js"></script>
<script src="libs/gitbook-2.6.7/js/lunr.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
<script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
<script>
gitbook.require(["gitbook"], function(gitbook) {
gitbook.start({
"sharing": {
"github": false,
"facebook": true,
"twitter": true,
"google": false,
"weibo": false,
"instapper": false,
"vk": false,
"all": ["facebook", "google", "twitter", "weibo", "instapaper"]
},
"fontsettings": {
"theme": "white",
"family": "sans",
"size": 2
},
"edit": {
"link": "https://github.com/rstudio/bookdown-demo/edit/master/07-all_together.Rmd",
"text": "Edit"
},
"download": ["fp_tdd_data.pdf"],
"toc": {
"collapse": "subsection"
}
});
});
</script>
</body>
</html>