# EEP/IAS C118 - Spring 2024- Section 2

## Function of the Week: group_by()

This week, we will work review usage of the function `group_by()`, which was also covered in R Bootcamp part 1.

## Preamble & Load Dataset

Begin by importing the "sleep75.dta" dataset. Note that reading Stata data files (.dta or .DTA) requires the use of the `haven` package, specifically the `read_dta()` function. Let us also call in the `dplyr` function, which will come in handy later on.

In [None]:
# Load "haven" and "dplyr" package


In [None]:
# Load in "sleep75.dta" and save it as data-frame object "sleep75"


In [None]:
# Use head() function to explore the dataframe


## Summary stats by groups

Sometimes we may want to group our data by values of certain variables.

For example, we may want to know if the levels of income in our sample differ systematically by gender.

Below, we cover three methods for doing so.

Note: The relevant varibles in the dataframe are: `earns74` for income and `male` for gender.

### Method 1: Using `filter()`

If we want summary statistics from our _sleep75_ data split on the variable _male_, we could use `filter()` to subset the dataset by gender to create two separate datasets and run each through `summarise()`.

Note: `filter()` function comes from `dplyr` package.

In [None]:
# Create a new smaller dataframe with subset of observations from "sleep75" with only female observations

# Create a new smaller dataframe with subset of observations from "sleep75" with only male observations


In [None]:
# Use summarise() to find mean income for females only


In [None]:
# Use summarise() to find mean income for males only


However, this requires storing multiple objects in memory and running identical summarise() codes twice.

### Method 2: Using `group_by()`

Instead, we can perform the same task more cleanly with group by:

1 Group the data by the variable male: `sleep75 <- group_by(sleep75, male)`

2 Produce summary statistics using summarise(): `summarise(sleep75, "Count of Observations" = n(), "Average Income" = mean(earns74))`

This code produces a table of summary statistics, with one row for females (male = 0) and one row for males (male = 1), as opposed to the default of one row for the entire dataset.

**First column**: the variable we grouped by

**Other columns**: our custom summary statistics

**First row**: summary stats for women and other genders (male = 0)

**Second row**: summary stats for men (male = 1)

Let's say we want to further create variables based on group membership. We can do that too! If we want a variable equal to this mean of income conditional on gender, we can add it to our grouped data using `mutate()`, again from `dplyr` package:

`sleep75 <- mutate(sleep75, inc_avg = mean(earns74))`

In [None]:
# Take a look at your dataset. It added a new variable named 'inc_avg', and its values are 10956.2 for males and 8214 for females.


Note that all functions like mean(), sd(), min(), quantile(), etc.
behave this way with grouped data

Grouped output from summarise() will have as many rows as values of the grouping variable – we would have had 15 rows if we grouped
on years of education!

When done with grouped transformations, make sure to use ungroup() on your data to restore the functions to their normal behavior:

`sleep75 <- ungroup(sleep75)`

You can group by multiple variables, just separate them by commas (i.e. age then education level for all duplicated ages)

We could also want to create a new variable equal to the mean of income conditional on race

With `group_by()` in tidyverse package we can do both of these!