Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
title: An example of base::split() for looping through groups
author: Ariel Muldoon
date: '2019-11-27'
slug: split-example
- r
- loops
- list
description: The base::split() function allows you to "split" your data into defined groups, returning a list with one element per group. This post goes through an example to show the utility of split() for looping through groups.
draft: FALSE
```{r setup, include = FALSE, message = FALSE, purl = FALSE}
knitr::opts_chunk$set(comment = "#")
filename = 'render_toc.R')
dat = data.frame(id = rep(letters[1:3], each = 10),
var1 = round( runif(n = 30, min = 2, max = 5), 1),
var2 = round( runif(n = 30, min = 5, max = 25), 1) )
I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. "Simplest" could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using `base::split()` as a possible option since it is one I find fairly approachable.
I turns out I don't have a go-to example for how to get started with a `split()` approach. So here's a quick blog post about it! `r emo::ji("smile")`
## Table of Contents
```{r toc, echo = FALSE, purl = FALSE}
input = knitr::current_input()
# Load R packages
I'll load **purrr** for looping through lists.
```{r, message = FALSE, warning = FALSE}
library(purrr) # 0.3.3
# A dataset with groups
I made a small dataset to use with `split()`. The `id` variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, `var1` and `var2`.
dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4,
3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6,
3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4,
22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5,
6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1,
11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA,
# Create separate data.frames per group
If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each `id`. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a `for()` loop.
This is a situation where I find `split()` to be really convenient. It splits the data by a defined group variable so we don't have to subset things manually.
The output from `split()` is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of `split()` handy for keeping track of groups in subsequent steps.
Here's an example, where I split `dat` by the `id` variable.
dat_list = split(dat, dat$id)
# Looping through the list
Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer.
For example, if I want to fit a linear model of `var1` vs `var2` for each group I might do the looping with `purrr::map()` or `lapply()`.
Each element of the new list still has the grouping information attached via the list names.
map(dat_list, ~lm(var1 ~ var2, data = .x) )
I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract $R^2$ for each group model fit.
r2 = function(data) {
fit = lm(var1 ~ var2, data = data)
The output of my `r2` function, which uses `broom::glance()`, is a data.frame.
r2(data = dat)
Since the function output is a data.frame, I can use `purrr::map_dfr()` to combine the output per group into a single data.frame. The `.id` argument creates a new variable to store the list names in the output.
map_dfr(dat_list, r2, .id = "id")
# Splitting by multiple groups
It is possible to split data by multiple grouping variables in the `split()` function. The grouping variables must be passed as a list.
Here's an example, using the built-in `mtcars` dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a `.` (but see the `sep` argument to control this).
mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
If all combinations of groups are not present, the `drop` argument in `split()` allows us to drop missing combinations. By default combinations that aren't present are kept as 0-length data.frames.
# Other thoughts on split()
I feel like `split()` was a gateway function for me to get started working with lists and associated convenience functions like `lapply()` and `purrr::map()` for looping through lists. I think learning to work with lists and "list loops" also made the learning curve for [list-columns]( in data.frames and the `nest()`/`unnest()` approach of analysis-by-groups a little less steep for me.
# Just the code, please
```{r getlabels, echo = FALSE, purl = FALSE}
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "toc", "getlabels", "allcode", "makescript")]
```{r makescript, include = FALSE, purl = FALSE}
options(knitr.duplicate.label = "allow") # Needed to purl like this
input = knitr::current_input() # filename of input document
scriptpath = paste(tools::file_path_sans_ext(input), "R", sep = ".")
output = here::here("static", "script", scriptpath)
knitr::purl(input, output, documentation = 0, quiet = TRUE)
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](`r paste0("/script/", scriptpath)`).
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE}