Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md

README.md

title thumbnailImagePosition thumbnailImage coverImage metaAlignment coverMeta date categories tags
Aggregating across columns: data.table, aggregate, dplyr
left
center
out
2018-03-23 21:13:14 -0500
Personal projects
R
dplyr
data.table
aggregate
summary statistics
knitr::opts_chunk$set(echo = TRUE)

This week I'm not doing a data analysis project so much as a data cleaning project. One of the most common problems I come across in data cleaning is how to get summary statistics for various groups in the data. And it's also one of the most annoying problems, because I invariably forget how to do it, and end up having to go back to old code to copy and paste.


Data.table, base R, dplyr

At various times I have used the data.table package, base R "aggregate", or the dplyr package and let me voice my favor for dplyr at the beginning. It's very intuitive and works just as well as the other methods. Like the Dutch cleaning product brand HG, dplyr "doet wat het belooft" (Does what it promises). In this post I'm going to outline and evaluate each of the methods.


The data

For this project I'm going to use the npk data from the R Datasets package. It's a dataframe of 24 observations and 5 variables measuring the results of an experiment of how nitrogen (N), phosphate (P), and potassium (K) affect the crop yield of peas. This dataset is useful for today's project because data is grouped into block (1-6), or we could aggregate by whether crops were exposed to nitrogen, phosphate and/or potassium. The yield of peas (outcome variable) is in units of pounds per plot, so more is better.

Load the data

npk <- as.data.frame(npk)
str(npk)

I will now proceed to take the average yield per block and add that to the dataset as a column.


"mean"

First things first, if all we want to know if the mean yield for all pea crops, we can do this very easily with "mean"

Mean <- mean(npk$yield, na.rm=T) 
Mean

But usually, this is not very useful because what you actually want to know is the crop yield per block.


"data.table"

The very first way I learned to do this is somewhat cumbersome. It involved using the data.table package to create a data table of aggregated statistics, and then merging it (or not) with the original data.

# load the package
library(data.table)

# use data.table to transform the data frame to a data table. Careful, this command with rename your variables to V1, V2, ... Vn
npk_datatable <- data.table(npk$block, npk$yield)
# rename variables
npk_datatable$block <- npk_datatable$V1
npk_datatable$yield <- npk_datatable$V2
npk_datatable <- npk_datatable[,c("block","yield")]

# create an aggregated data set of the mean of "yield" per "block"
mean_datatable <- npk_datatable[, mean(yield, na.rm=T), by=block] 
mean_datatable

The variable is renamed from yield to V1. Aggravating, but there's nothing to do but change the name:

mean_datatable$mean_yield <- mean_datatable$V1
(mean_datatable <- mean_datatable[, c("block","mean_yield")])

"aggregate": Base R

You can do the exact same thing in base R using the aggregate command:

npk_aggregate <- aggregate(npk[, 5], list(npk$block), mean, na.rm=T)
npk_aggregate

Arg! I can feel my blood pressure rising every time R changes the names of my variables. Block is now Group.1 and mean_yield is x. Change names again:

npk_aggregate$mean_yield <- npk_aggregate$x
npk_aggregate$block <- npk_aggregate$Group.1
(npk_aggregate <- npk_aggregate[, c("block","mean_yield")])

These methods both work, but trying to remember them is about as Aggravating as trying to teach my grandma how to double click. Which is to say: very. Enter: dplyr and tidyverse to save the day!


"dplyr"

I started out writing this post not entirely convinced that dplyr is actually better than aggregate. But in the process of writing, I'm convinced. The syntax is just so readable and seems to make intuitive sense. Reproducing it for this post was super easy.

library(dplyr)
npk_dplyr <- npk %>%
  group_by(block) %>%
  summarize(mean(yield, na.rm=T))
npk_dplyr


"merge"

With all of these methods, if we want to append the aggregated information to the original data, we can do so using "merge". For example:

npk_datatable <- merge(npk, mean_datatable, intersect(names(npk), names(mean_datatable)))
head(npk_datatable)

And that's it! A brief illustration of the differences between data.table, base R, and dplyr for summarizing aggregated data.

This blog post can be found on GitHub.

You can’t perform that action at this time.