Permalink
Cannot retrieve contributors at this time
Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign up
Fetching contributors…

--- | |
title: An example of base::split() for looping through groups | |
author: Ariel Muldoon | |
date: '2019-11-27' | |
slug: split-example | |
categories: | |
- r | |
tags: | |
- loops | |
- list | |
description: The base::split() function allows you to "split" your data into defined groups, returning a list with one element per group. This post goes through an example to show the utility of split() for looping through groups. | |
draft: FALSE | |
--- | |
```{r setup, include = FALSE, message = FALSE, purl = FALSE} | |
knitr::opts_chunk$set(comment = "#") | |
devtools::source_gist("2500a85297b742c6f2fb3a14549f5851", | |
filename = 'render_toc.R') | |
set.seed(16) | |
dat = data.frame(id = rep(letters[1:3], each = 10), | |
var1 = round( runif(n = 30, min = 2, max = 5), 1), | |
var2 = round( runif(n = 30, min = 5, max = 25), 1) ) | |
``` | |
I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. "Simplest" could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using `base::split()` as a possible option since it is one I find fairly approachable. | |
I turns out I don't have a go-to example for how to get started with a `split()` approach. So here's a quick blog post about it! `r emo::ji("smile")` | |
## Table of Contents | |
```{r toc, echo = FALSE, purl = FALSE} | |
input = knitr::current_input() | |
render_toc(input) | |
``` | |
# Load R packages | |
I'll load **purrr** for looping through lists. | |
```{r, message = FALSE, warning = FALSE} | |
library(purrr) # 0.3.3 | |
``` | |
# A dataset with groups | |
I made a small dataset to use with `split()`. The `id` variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, `var1` and `var2`. | |
```{r} | |
dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, | |
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, | |
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), | |
var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, | |
3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, | |
3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, | |
22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, | |
6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, | |
11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, | |
-30L)) | |
head(dat) | |
``` | |
# Create separate data.frames per group | |
If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each `id`. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a `for()` loop. | |
This is a situation where I find `split()` to be really convenient. It splits the data by a defined group variable so we don't have to subset things manually. | |
The output from `split()` is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of `split()` handy for keeping track of groups in subsequent steps. | |
Here's an example, where I split `dat` by the `id` variable. | |
```{r} | |
dat_list = split(dat, dat$id) | |
dat_list | |
``` | |
# Looping through the list | |
Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer. | |
For example, if I want to fit a linear model of `var1` vs `var2` for each group I might do the looping with `purrr::map()` or `lapply()`. | |
Each element of the new list still has the grouping information attached via the list names. | |
```{r} | |
map(dat_list, ~lm(var1 ~ var2, data = .x) ) | |
``` | |
I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract $R^2$ for each group model fit. | |
```{r} | |
r2 = function(data) { | |
fit = lm(var1 ~ var2, data = data) | |
broom::glance(fit) | |
} | |
``` | |
The output of my `r2` function, which uses `broom::glance()`, is a data.frame. | |
```{r} | |
r2(data = dat) | |
``` | |
Since the function output is a data.frame, I can use `purrr::map_dfr()` to combine the output per group into a single data.frame. The `.id` argument creates a new variable to store the list names in the output. | |
```{r} | |
map_dfr(dat_list, r2, .id = "id") | |
``` | |
# Splitting by multiple groups | |
It is possible to split data by multiple grouping variables in the `split()` function. The grouping variables must be passed as a list. | |
Here's an example, using the built-in `mtcars` dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a `.` (but see the `sep` argument to control this). | |
```{r} | |
mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) ) | |
mtcars_cylam[1:2] | |
``` | |
If all combinations of groups are not present, the `drop` argument in `split()` allows us to drop missing combinations. By default combinations that aren't present are kept as 0-length data.frames. | |
# Other thoughts on split() | |
I feel like `split()` was a gateway function for me to get started working with lists and associated convenience functions like `lapply()` and `purrr::map()` for looping through lists. I think learning to work with lists and "list loops" also made the learning curve for [list-columns](https://r4ds.had.co.nz/many-models.html#list-columns-1) in data.frames and the `nest()`/`unnest()` approach of analysis-by-groups a little less steep for me. | |
# Just the code, please | |
```{r getlabels, echo = FALSE, purl = FALSE} | |
labs = knitr::all_labels() | |
labs = labs[!labs %in% c("setup", "toc", "getlabels", "allcode", "makescript")] | |
``` | |
```{r makescript, include = FALSE, purl = FALSE} | |
options(knitr.duplicate.label = "allow") # Needed to purl like this | |
input = knitr::current_input() # filename of input document | |
scriptpath = paste(tools::file_path_sans_ext(input), "R", sep = ".") | |
output = here::here("static", "script", scriptpath) | |
knitr::purl(input, output, documentation = 0, quiet = TRUE) | |
``` | |
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](`r paste0("/script/", scriptpath)`). | |
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE} | |
``` |