/
libr-datastep.Rmd
747 lines (621 loc) · 24.9 KB
/
libr-datastep.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
---
title: "Data Step Operations"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data Step Operations}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
Normally, R processes data column-by-column. The data step allows you
to process data row-by-row. Row-by-row processing of data is useful when you
have related columns, and wish to perform conditional logic on those
columns. The `datastep()` function allows you to realize this style of
data processing. It is particularly advantageous when you wish to perform deeply
nested conditional logic. It is also very useful for by-group
processing.
#### Example 1: Simple Data Step
Here is an example of a simple data step:
```{r eval=FALSE, echo=TRUE}
library(libr)
# Add some columns to mtcars using data step logic
df <- datastep(mtcars[1:10, 1:3], {
if (mpg >= 20)
mpgcat <- "High"
else
mpgcat <- "Low"
recdt <- as.Date("1974-06-10")
if (cyl == 8)
is8cyl <- TRUE
})
# View results
df
# mpg cyl disp recdt mpgcat is8cyl
# Mazda RX4 21.0 6 160.0 1974-06-10 High NA
# Mazda RX4 Wag 21.0 6 160.0 1974-06-10 High NA
# Datsun 710 22.8 4 108.0 1974-06-10 High NA
# Hornet 4 Drive 21.4 6 258.0 1974-06-10 High NA
# Hornet Sportabout 18.7 8 360.0 1974-06-10 Low TRUE
# Valiant 18.1 6 225.0 1974-06-10 Low NA
# Duster 360 14.3 8 360.0 1974-06-10 Low TRUE
# Merc 240D 24.4 4 146.7 1974-06-10 High NA
# Merc 230 22.8 4 140.8 1974-06-10 High NA
# Merc 280 19.2 6 167.6 1974-06-10 Low NA
```
### Keep, Drop, and Rename
The data step has parameters to perform basic shaping of the resulting
data frame. These parameters are 'keep', 'drop', and 'rename'. For example,
the above data step could have been performed by sending all columns into
the data step, and keeping only the desired columns. Using the `keep`
parameter also allows you to order the resulting columns.
#### Example 2: Keeping Data Step Variables
```{r eval=FALSE, echo=TRUE}
library(libr)
# Keep and order output columns
df <- datastep(mtcars[1:10,],
keep = c("mpg", "cyl", "disp", "mpgcat", "recdt"), {
if (mpg >= 20)
mpgcat <- "High"
else
mpgcat <- "Low"
recdt <- as.Date("1974-06-10")
if (cyl == 8)
is8cyl <- TRUE
})
df
# mpg cyl disp mpgcat recdt
# Mazda RX4 21.0 6 160.0 High 1974-06-10
# Mazda RX4 Wag 21.0 6 160.0 High 1974-06-10
# Datsun 710 22.8 4 108.0 High 1974-06-10
# Hornet 4 Drive 21.4 6 258.0 High 1974-06-10
# Hornet Sportabout 18.7 8 360.0 Low 1974-06-10
# Valiant 18.1 6 225.0 Low 1974-06-10
# Duster 360 14.3 8 360.0 Low 1974-06-10
# Merc 240D 24.4 4 146.7 High 1974-06-10
# Merc 230 22.8 4 140.8 High 1974-06-10
# Merc 280 19.2 6 167.6 Low 1974-06-10
```
### The Retain Parameter
The retain parameter allows you to define variables
that will be seeded with the value from the previous step. The retain
option is useful for creating cumulative values or for performing
conditions based on the value of the previous row.
#### Example 3: Drop, Retain, and Rename Parameters
```{r eval=FALSE, echo=TRUE}
library(libr)
df <- datastep(mtcars[1:10, ],
drop = c("disp", "hp", "drat", "qsec",
"vs", "am", "gear", "carb"),
retain = list(cumwt = 0 ),
rename = c(mpg = "MPG", cyl = "Cylinders", wt = "Wgt",
cumwt = "Cumulative Wgt"), {
cumwt <- cumwt + wt
})
df
# MPG Cylinders Wgt Cumulative Wgt
# Mazda RX4 21.0 6 2.620 2.620
# Mazda RX4 Wag 21.0 6 2.875 5.495
# Datsun 710 22.8 4 2.320 7.815
# Hornet 4 Drive 21.4 6 3.215 11.030
# Hornet Sportabout 18.7 8 3.440 14.470
# Valiant 18.1 6 3.460 17.930
# Duster 360 14.3 8 3.570 21.500
# Merc 240D 24.4 4 3.190 24.690
# Merc 230 22.8 4 3.150 27.840
# Merc 280 19.2 6 3.440 31.280
```
### By Group Processing
The `datastep()` function also has the capabilities of performing by-group
processing. A by-group is accomplished using the `by` parameter, and passing
a vector of column names that define the group. Once a by-group is
defined, the `first.` and `last.` automatic variables become active, which
allow you to identify the boundaries between groups. Note that, by default,
your data must be sorted properly before sending it into the data step. To
turn the sort check off, set the `sort_check` parameter to FALSE.
#### Example 4: By Groups
```{r eval=FALSE, echo=TRUE}
library(libr)
# Identify start and end of by-groups
df <- datastep(mtcars[1:10,],
keep = c("mpg", "cyl", "gear", "grp"),
by = c("gear"), sort_check = FALSE, {
if (first. & last.)
grp <- "Start - End"
else if (first.)
grp <- "Start"
else if (last.)
grp <- "End"
else
grp <- "-"
})
df
# mpg cyl gear grp
# Mazda RX4 21.0 6 4 Start
# Mazda RX4 Wag 21.0 6 4 -
# Datsun 710 22.8 4 4 End
# Hornet 4 Drive 21.4 6 3 Start
# Hornet Sportabout 18.7 8 3 -
# Valiant 18.1 6 3 -
# Duster 360 14.3 8 3 End
# Merc 240D 24.4 4 4 Start
# Merc 230 22.8 4 4 -
# Merc 280 19.2 6 4 End
```
### By Group Processing of Multiple Variables
If desired, you can pass multiple variables on the `by` parameter. When there
are multiple by groups, the `first.` and `last.` automatic variables described
above will represent an "or" combination of values for all by-variables. In
addition, automatic variables will be created for each variable in the by
group, similar to SAS®. Observe:
#### Example 5: Multiple By Groups
```{r eval=FALSE, echo=TRUE}
library(libr)
# Create sample data
df <- data.frame(HairEyeColor)[seq(2, 32, 2), ]
# Sort by groups
df <- sort(df, by = c("Sex", "Hair"))
# Identify start and end of by-groups
df2 <- datastep(df,
drop = c("Eye", "Freq"),
by = c("Sex", "Hair"), {
fSex <- first.Sex
lSex <- last.Sex
fHair <- first.Hair
lHair <- last.Hair
})
df2
# Hair Sex fSex lSex fHair lHair
# 1 Brown Male TRUE FALSE TRUE FALSE
# 2 Brown Male FALSE FALSE FALSE FALSE
# 3 Brown Male FALSE FALSE FALSE FALSE
# 4 Brown Male FALSE FALSE FALSE TRUE
# 5 Blond Male FALSE FALSE TRUE FALSE
# 6 Blond Male FALSE FALSE FALSE FALSE
# 7 Blond Male FALSE FALSE FALSE FALSE
# 8 Blond Male FALSE TRUE FALSE TRUE
# 9 Brown Female TRUE FALSE TRUE FALSE
# 10 Brown Female FALSE FALSE FALSE FALSE
# 11 Brown Female FALSE FALSE FALSE FALSE
# 12 Brown Female FALSE FALSE FALSE TRUE
# 13 Blond Female FALSE FALSE TRUE FALSE
# 14 Blond Female FALSE FALSE FALSE FALSE
# 15 Blond Female FALSE FALSE FALSE FALSE
# 16 Blond Female FALSE TRUE FALSE TRUE
```
The above `first.Sex`, `last.Sex`, `first.Hair`, and `last.Hair` variables
may also be used in conditions, functions, or any other expression
inside your datastep. Note that like `first.` and `last.` they are
dropped automatically at the end of the datastep. If you want to retain
their values, assign them to a new variable as shown above.
### Using Summary Functions
There may be times when you want to combine row-by-row conditional
processing with column-by-column vector operations. For example,
let's say you want to calculate a mean and then perform conditional
processing on that mean. This
situation can be handled using the `calculate` parameter on the `datastep()`
function. The function will execute the `calculate` block first, add any
assigned variables to the data frame, and then execute the data step. Below
is an example of such a scenario:
#### Example 6: Calculate Block
```{r eval=FALSE, echo=TRUE}
library(libr)
# Categorize mpg as above or below the mean
df <- datastep(mtcars,
keep = c("mpg", "cyl", "mean_mpg", "mpgcat"),
calculate = { mean_mpg = mean(mpg) },
{
if (mpg >= mean_mpg)
mpgcat <- "High"
else
mpgcat <- "Low"
})
df[1:10,]
# mpg cyl mean_mpg mpgcat
# Mazda RX4 21.0 6 20.09062 High
# Mazda RX4 Wag 21.0 6 20.09062 High
# Datsun 710 22.8 4 20.09062 High
# Hornet 4 Drive 21.4 6 20.09062 High
# Hornet Sportabout 18.7 8 20.09062 Low
# Valiant 18.1 6 20.09062 Low
# Duster 360 14.3 8 20.09062 Low
# Merc 240D 24.4 4 20.09062 High
# Merc 230 22.8 4 20.09062 High
# Merc 280 19.2 6 20.09062 Low
```
### Data Steps with `dplyr`
Note that the `datastep()` function is pipe-friendly, and can be combined
with **dplyr** functions in a data pipeline. Also note that the `datastep()`
function will recognize any group attributes added by the `group_by()`
function. Therefore, within a **dplyr** pipeline, it is not necessary to
use any `datastep` parameters. The following example recreates the above
data frame from Example 5, but with a **dplyr** pipeline.
#### Example 7: Data Pipeline
```{r eval=FALSE, echo=TRUE}
library(libr)
library(dplyr)
library(magrittr)
# Add datastep to dplyr pipeline
df <- mtcars %>%
select(mpg, cyl, gear) %>%
mutate(mean_mpg = mean(mpg)) %>%
datastep({
if (mpg >= mean_mpg)
mpgcat <- "High"
else
mpgcat <- "Low"
}) %>%
filter(row_number() <= 10)
df
# mpg cyl gear mean_mpg mpgcat
# 1 21.0 6 4 20.09062 High
# 2 21.0 6 4 20.09062 High
# 3 22.8 4 4 20.09062 High
# 4 21.4 6 3 20.09062 High
# 5 18.7 8 3 20.09062 Low
# 6 18.1 6 3 20.09062 Low
# 7 14.3 8 3 20.09062 Low
# 8 24.4 4 4 20.09062 High
# 9 22.8 4 4 20.09062 High
# 10 19.2 6 4 20.09062 Low
```
### Data Attributes
The **libr** package recognizes several useful data attributes that
are not normally recognized by other R functions. For example, it is
very convenient to assign _label_ and _description_ attributes to your columns,
so other people can understand what data the columns contain.
For this reason,
the `datastep()` function provides an _attrib_ parameter that allows you
to supply such attributes as part of a data step. Attributes are
assigned with a named list and the `dsattr()` object.
#### Example 8: Attributes
```{r eval=FALSE, echo=TRUE}
library(libr)
# Assign label attributes to all columns
df <- datastep(mtcars[1:10, ],
keep = c("mpg", "cyl", "mpgcat"),
calculate = { mean_mpg = mean(mpg) },
attrib = list(mpg = dsattr(label = "Miles Per Gallon"),
cyl = dsattr(label = "Cylinders"),
mpgcat = dsattr(label = "Mileage Category")), {
if (mpg >= mean_mpg)
mpgcat <- "High"
else
mpgcat <- "Low"
})
# View attributes in dictionary
dictionary(df)
# # A tibble: 3 x 10
# Name Column Class Label Description Format Width Justify Rows NAs
# <chr> <chr> <chr> <chr> <chr> <lgl> <int> <chr> <int> <int>
# 1 df mpg numeric Miles Per Gallon NA NA NA NA 10 0
# 2 df cyl numeric Cylinders NA NA NA NA 10 0
# 3 df mpgcat character Mileage Category NA NA 4 NA 10 0
```
### Data Step Array
As mentioned previously, R typically operates in a column-wise manner. That is,
R processes data column-by-column. But what if you need to get a sum or
mean across a row?
This situation is what led to the development of the **data step array**. The
data step array allows you to define a list of columns and
iterate over the list inside a data step. Data step arrays are defined with the
_arrays_ parameter, which accepts a named list of `dsarray()` objects.
To see the array in action, we'll use the _AirPassengers_ sample
data. This data shows international airline passengers by month between
1949 and 1960. The data looks like this:
```{r eval=FALSE, echo=TRUE}
AirPassengers
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1949 112 118 132 129 121 135 148 148 136 119 104 118
# 1950 115 126 141 135 125 149 170 170 158 133 114 140
# 1951 145 150 178 163 172 178 199 199 184 162 146 166
# 1952 171 180 193 181 183 218 230 242 209 191 172 194
# 1953 196 196 236 235 229 243 264 272 237 211 180 201
# 1954 204 188 235 227 234 264 302 293 259 229 203 229
# 1955 242 233 267 269 270 315 364 347 312 274 237 278
# 1956 284 277 317 313 318 374 413 405 355 306 271 306
# 1957 315 301 356 348 355 422 465 467 404 347 305 336
# 1958 340 318 362 348 363 435 491 505 404 359 310 337
# 1959 360 342 406 396 420 472 548 559 463 407 362 405
# 1960 417 391 419 461 472 535 622 606 508 461 390 432
```
This example illustrates how to create row totals, row means, and find the
top month using a data step array. The array has an indexer
to extract values. You can use the indexer to extract a single value or
a subset of values. An empty indexer will return all the values in the
array.
#### Example 9: Using a Data Step Array
```{r eval=FALSE, echo=TRUE}
library(libr)
# Create AirPassengers Data Frame
df <- as.data.frame(t(matrix(AirPassengers, 12,
dimnames = list(month.abb, seq(1949, 1960)))),
stringsAsFactors = FALSE)
# Use datastep array to get year tot, mean, and top month
dat <- datastep(df,
arrays = list(months = dsarray(names(df))),
attrib = list(Tot = 0, Mean = 0, Top = ""),
drop = "mth",
{
Tot <- sum(months[])
Mean <- mean(months[])
for (mth in months) {
if (months[mth] == max(months[])) {
Top <- mth
}
}
})
dat
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Tot Mean Top
# 1949 112 118 132 129 121 135 148 148 136 119 104 118 1520 126.6667 Aug
# 1950 115 126 141 135 125 149 170 170 158 133 114 140 1676 139.6667 Aug
# 1951 145 150 178 163 172 178 199 199 184 162 146 166 2042 170.1667 Aug
# 1952 171 180 193 181 183 218 230 242 209 191 172 194 2364 197.0000 Aug
# 1953 196 196 236 235 229 243 264 272 237 211 180 201 2700 225.0000 Aug
# 1954 204 188 235 227 234 264 302 293 259 229 203 229 2867 238.9167 Jul
# 1955 242 233 267 269 270 315 364 347 312 274 237 278 3408 284.0000 Jul
# 1956 284 277 317 313 318 374 413 405 355 306 271 306 3939 328.2500 Jul
# 1957 315 301 356 348 355 422 465 467 404 347 305 336 4421 368.4167 Aug
# 1958 340 318 362 348 363 435 491 505 404 359 310 337 4572 381.0000 Aug
# 1959 360 342 406 396 420 472 548 559 463 407 362 405 5140 428.3333 Aug
# 1960 417 391 419 461 472 535 622 606 508 461 390 432 5714 476.1667 Jul
```
In the example above, the "Tot", "Mean", and "Top" columns were all calculated
using the datastep array. These types of row-wise statistics are hard
to calculate otherwise.
### Filtering and Duplicating Rows
The datastep provides different ways to control which rows are output.
First, the function has a `where` parameter to pass a filter expression to
the datastep. The where clause will be executed at the end of datastep
processing. Pass in the where clause
using the `expression()` function. Like so:
```{r eval=FALSE, echo=TRUE}
# Prepare sample data
dat <- as.data.frame(HairEyeColor)
# Filter for black hair and blue eyes
res <- datastep(dat,
where = expression(Hair == "Black" & Eye == "Blue"),
{})
res
# Hair Eye Sex Freq
# 1 Black Blue Male 11
# 2 Black Blue Female 9
```
The datastep also recognizes the `delete()` and `output()` functions to
remove or duplicate rows from inside the datastep. These functions give
you conditional control over which rows are output.
```{r eval=FALSE, echo=TRUE}
# Delete rows with frequencies less than 25
res1 <- datastep(dat, {
if (Freq < 25)
delete()
})
res1
# Hair Eye Sex Freq
# 1 Black Brown Male 32
# 2 Brown Brown Male 53
# 3 Brown Blue Male 50
# 4 Blond Blue Male 30
# 5 Brown Hazel Male 25
# 6 Black Brown Female 36
# 7 Brown Brown Female 66
# 8 Brown Blue Female 34
# 9 Blond Blue Female 64
# 10 Brown Hazel Female 29
# Only output rows for brown-eyes and frequencies over 25
res2 <- datastep(dat, {
if (Eye == "Brown") {
if (Freq >= 25) {
output()
}
}
})
res2
# Hair Eye Sex Freq
# 1 Black Brown Male 32
# 2 Brown Brown Male 53
# 3 Black Brown Female 36
# 4 Brown Brown Female 66
```
You can also use the `output()` function to create
datasets from scratch, just like in SAS®. To create a dataset from
scratch, simply pass in an empty data frame and output the desired values.
```{r eval=FALSE, echo=TRUE}
# Create metadata
res3 <- datastep(data.frame(), {
name <- "mtcars"
rows <- nrow(mtcars)
cols <- ncol(mtcars)
output()
name <- "iris"
rows <- nrow(iris)
cols <- ncol(iris)
output()
name <- "beaver1"
rows <- nrow(beaver1)
cols <- ncol(beaver1)
output()
})
res3
# name rows cols
# 1 mtcars 32 11
# 2 iris 150 5
# 3 beaver1 114 4
```
### Set and Merge Operations
When working with data, joining datasets is an essential activity. While there
are many different functions in R to perform joins, the `datastep()`
"set" and "merge" parameters offer unusual flexibility.
The "set" parameter stacks two or more datasets. The "merge" parameter joins
two or more datasets. Together, these two parameters allow you to perform
the most common types of data combinations.
To illustrate, first let's create up some sample datasets. The datasets we
will create include one "region" dataset, and two "stores" datasets. Note
that the columns on the stores datasets are not identical.
```{r eval=FALSE, echo=TRUE}
# Create sample data
region <- read.table(header = TRUE, text = '
REGION NAME
R01 East
R02 West
R03 North
R04 South
', stringsAsFactors = FALSE)
# First stores dataset
stores1 <- read.table(header = TRUE, text = '
ID NAME SIZE REGION FRANCHISE
A01 "Eastern Lumber" L R01 T
A02 "Tri-City Hardwood" M R02 F
A05 "Reliable Hardware" S R01 T
', stringsAsFactors = FALSE)
# Extra column on this one
stores2 <- read.table(header = TRUE, text = '
ID NAME SIZE REGION
A03 "AAA Mills" S R05
A04 "Home and Yard" L R03
', stringsAsFactors = FALSE)
```
Despite not having the same columns, the two stores datasets can be set
using the `datastep()` function. The function will fill in the missing values
automatically. Like so:
```{r eval=FALSE, echo=TRUE}
# Set operation
allstores <- datastep(stores1, set = stores2, {})
# Extra values filled with NA
allstores
# ID NAME SIZE REGION FRANCHISE
# 1 A01 Eastern Lumber L R01 TRUE
# 2 A02 Tri-City Hardwood M R02 FALSE
# 3 A05 Reliable Hardware S R01 TRUE
# 4 A03 AAA Mills S R05 NA
# 5 A04 Home and Yard L R03 NA
```
Let's pretend we noticed the missing data, and decide to fill it in. We
can do that by merging the missing FRANCHISE values to the second stores
dataset. First let's create the missing data:
```{r eval=FALSE, echo=TRUE}
# Create small dataset of missing FRANCHISE values
franchises <- data.frame(FRANCHISE = c(F, F), stringsAsFactors = FALSE)
franchises
# FRANCHISE
# 1 FALSE
# 2 FALSE
```
Next we can merge in the missing data on "stores2", and set the two store
datasets again:
```{r eval=FALSE, echo=TRUE}
# Merge in missing FRANCHISE column
stores2mod <- datastep(stores2, merge = franchises, {})
stores2mod
# ID NAME SIZE REGION FRANCHISE
# 1 A03 AAA Mills S R05 FALSE
# 2 A04 Home and Yard L R03 FALSE
# Set again
allstores <- datastep(stores1, set = stores2mod, {})
# Now everything is aligned
allstores
# ID NAME SIZE REGION FRANCHISE
# 1 A01 Eastern Lumber L R01 TRUE
# 2 A02 Tri-City Hardwood M R02 FALSE
# 3 A05 Reliable Hardware S R01 TRUE
# 4 A03 AAA Mills S R05 FALSE
# 5 A04 Home and Yard L R03 FALSE
```
Observe that we did not have to specify a join condition on the merge. When
no "merge_by" is indicated, the datastep will simply append the new
columns to the right - without complaining. This behavior is very convenient.
Now let's do another join, but this time we will specify a join condition.
We will join in the store regions by the region ID. We will also set up
merge flags so we can see which rows were in which input dataset.
```{r eval=FALSE, echo=TRUE}
# Merge operation - Outer Join
res <- datastep(allstores, merge = region,
merge_by = "REGION",
merge_in = c("inA", "inB"), {})
# View results
res
# ID NAME.1 SIZE REGION FRANCHISE NAME.2 inA inB
# 1 A01 Eastern Lumber L R01 TRUE East 1 1
# 2 A05 Reliable Hardware S R01 TRUE East 1 1
# 3 A02 Tri-City Hardwood M R02 FALSE West 1 1
# 4 A04 Home and Yard L R03 FALSE North 1 1
# 5 A03 AAA Mills S R05 FALSE <NA> 1 0
# 6 <NA> <NA> <NA> R04 NA South 0 1
```
Notice three things:
1. The "Name" field appeared in both datasets, and
was therefore appended with suffixes to distinguish them.
2. The region value "R05" is in the stores dataset, but not in the
region dataset. It appears this value was coded incorrectly.
3. Region "R04" had no stores.
Let's try one last time to fix the above issues. We can fix the column names
with the "rename" parameter, and exclude rows with a "where" expression. Also,
we can recode "R05" to "R04" inside the datastep. Finally, we'll drop
the merge flags to clean up the columns.
```{r eval=FALSE, echo=TRUE}
# Merge operation - Left join and clean up
res <- datastep(allstores, merge = region,
merge_by = "REGION",
merge_in = c("inA", "inB"),
rename = c(NAME.1 = "STORE_NAME", NAME.2 = "REGION_NAME"),
where = expression(inA == TRUE),
drop = c("inA", "inB"),
{
if (REGION == "R05") {
REGION <- "R04"
NAME.2 <- "South"
}
})
#'
# View results
res
# ID STORE_NAME SIZE REGION FRANCHISE REGION_NAME
# 1 A01 Eastern Lumber L R01 TRUE East
# 2 A05 Reliable Hardware S R01 TRUE East
# 3 A02 Tri-City Hardwood M R02 FALSE West
# 4 A04 Home and Yard L R03 FALSE North
# 5 A03 AAA Mills S R04 FALSE South
```
### Datastep Performance
One weakness of the **libr** `datastep()` function is performance. The
function is far slower than the equivalent SAS® datastep. The performance
profile may limit the number of records you are able to reasonably process
with the `datastep()`.
One thing you can do to increase performance is to reduce the number of rows
and columns on the input data. You can perform this pre-filtering with Base R
or **Tidyverse** functions. This strategy is particularly recommended if
you were planning to subset the data anyway using the "where" or "keep" options.
The Base R `subset()` function is convenient to use because it is always available.
Here is an example showing how to reduce the size of the **iris** sample
dataframe using Base R `subset()` before sending it to a datastep.
#### Example 10: Increasing Performance
```
# Subset the input dataset first for only needed rows and columns
dat <- subset(iris, Species == 'versicolor', c('Petal.Length', 'Petal.Width')) |>
datastep({
if (Petal.Length < 3.5)
Petal.Size <- "Short"
else if (Petal.Length > 4.5)
Petal.Size <- "Long"
else
Petal.Size <- "Medium"
})
# View Some Results
dat[1:10, ]
# Petal.Length Petal.Width Petal.Size
# 1 4.7 1.4 Long
# 2 4.5 1.5 Medium
# 3 4.9 1.5 Long
# 4 4.0 1.3 Medium
# 5 4.6 1.5 Long
# 6 4.5 1.3 Medium
# 7 4.7 1.6 Long
# 8 3.3 1.0 Short
# 9 4.6 1.3 Long
# 10 3.9 1.4 Medium
```
Next: [Disclaimer](libr-disclaimer.html)