# 🚘 Analyzing SF traffic stops with `R`: Part 1

<img src="https://github.com/joshuagrossman/dsb-win-2023/blob/main/opp-munging-plotting/img/sf-traffic.jpg?raw=1" alt="traffic" width="600" align="left"/>

This is Part 1. Other parts can be found [here](https://www.jdgrossman.com).

## Introduction

In this series of tutorials, we'll use `R` to explore traffic stops in San Francisco (SF). In particular, we'll investigate whether there is evidence of racial discrimination in SF's policing practices.

> **Important note**: Policing can be a sensitive subject. It's important to remember that each row in our data represents a real interaction between a police officer and driver. Please keep this in mind as you work through the tutorial, and be sure to engage with the material to the extent you're comfortable.

By the end of the tutorials, you'll have foundational understanding of the following:
1. 📊 How to use `R` to explore tabular data and calculate descriptive statistics.
2. 📈 How to make an informative plot with `R`
2. ⚖️ How to approach questions about social policy with data.

Let's get started!

## ✅ Set up

While the core `R` language contains many useful functions (e.g., `sum` and `sample`), there is vast functionality built on top of `R` by community members.

Make sure to run the cell below. It imports additional useful functions, adjusts `R` settings, and loads in data.

In [47]:
# Load in additional functions
library(tidyverse)
library(lubridate)

# Use three digits past the decimal point
options(digits = 3)

# This is where the data is stored.
STOPS_PATH = "https://github.com/joshuagrossman/dsb-win-2023/raw/main/opp-munging-plotting/data/sf_stop_data.rds"

# Read in the data
stops <- read_rds(STOPS_PATH)

### 🖼️ The data frame

Data frames are like spreadsheets in Microsft Excel or Google Sheets: they have rows and columns, and each cell in the spreadsheet contains data.

Run the cell below to preview the `stops` data. What do you notice?

> 🔎 The `head` function allows us to see the first couple rows of a dataframe.

In [41]:
head(stops)

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation


In [43]:
summary(stops)

      date                 time                   location        
 Min.   :2009-01-01   Min.   :00:00:00.000000   Length:636161     
 1st Qu.:2010-06-21   1st Qu.:10:04:00.000000   Class :character  
 Median :2012-01-15   Median :15:18:00.000000   Mode  :character  
 Mean   :2012-05-04   Mean   :14:24:39.721889                     
 3rd Qu.:2014-01-27   3rd Qu.:19:35:00.000000                     
 Max.   :2016-06-30   Max.   :23:59:00.000000                     
                      NA's   :13                                  
      lat            lng         district              age       
 Min.   :36.3   Min.   :-124   Length:636161      Min.   : 10.0  
 1st Qu.:37.7   1st Qu.:-122   Class :character   1st Qu.: 27.0  
 Median :37.8   Median :-122   Mode  :character   Median : 35.0  
 Mean   :37.8   Mean   :-122                      Mean   : 38.1  
 3rd Qu.:37.8   3rd Qu.:-122                      3rd Qu.: 48.0  
 Max.   :39.7   Max.   :-120                      Max.   :100.0  
 N

⬆️ From the preview above, we might guess that each row in the `stops` dataframe represents a stop, and each column contains information about each stop.

> This guess is correct!

### 💭 Asking questions about the data

As an analyst, you might start with some basic questions:

1. How many stops (i.e., rows) are in the `stops` data?
2. What do we know about each stop?
3. When was the earliest stop?
4. What were the most commons reasons for stops?
5. Who is most likely to get stopped?

Let's start with the first question: how many rows are in the `stops` data?

In [6]:
nrow(stops)

Looks like we have information on approximately 640,000 stops.

What do we know about each stop?

In [7]:
colnames(stops)

It looks like we have the basics of each stop: time, location, demographics, and outcomes.

## 🚀 Exercise: Stop dates

When did the traffic stops in the `stops` data occur?

Use the `date` column in the `stops` data to get a sense of when stops typically occur. Write a comment explaining your results.

A few pointers:

> 💵 To extract a column from a data frame, use the `$` symbol. To retrieve column `age` from data frame `df`, we write `df$age`.

> You may find the following functions helpful: `sample`, `min`, `max`, `range`, and `print`. You can learn more about a function `f` by running `?f`.

In [49]:
## Your code here!
# ok i grouped it by the date but how can we group by month
stop_by_date = stops %>% count(date)
stop_by_date


date,n
<date>,<int>
2009-01-01,258
2009-01-02,314
2009-01-03,308
2009-01-04,345
2009-01-05,349
2009-01-06,379
2009-01-07,409
2009-01-08,375
2009-01-09,319
2009-01-10,283


In [50]:
# getting summary of the date, can do this for the entire data set
summary(stop_by_date) # we see start: 2009-01-01    end: 2016-06-30

      date                  n      
 Min.   :2009-01-01   Min.   :  1  
 1st Qu.:2010-09-23   1st Qu.:205  
 Median :2012-06-15   Median :247  
 Mean   :2012-07-31   Mean   :252  
 3rd Qu.:2014-03-07   3rd Qu.:296  
 Max.   :2016-06-30   Max.   :630  

In [51]:
max(stop_by_date$n)
#filtering the data to find what datee holds the highest number of stops
most_stops <- stop_by_date$date[stop_by_date$n == 630]
most_stops # 2009-03-20

Finding the earliest stop by using the sumary info

In [48]:
nrow(stops)

## 🚰 The pipe: `%>%`

Both of these lines of code do exactly the same thing:

In [32]:
# Method 1
print(nrow(stops))

# Method 2
stops %>%
    nrow() %>%
    print()

[1] 636161
[1] 636161


Why should we care? Read on to find out!

### The math of the pipe `%>%`

To process a dataset, we may have to use several functions. For example, we may want to use function `a`, then function `b`, and finally function `c`:

```
c(b(a(data)))
```

To understand what this code is doing, we have to read the code ⏪inside out⏩: we start with `a`, then apply `b`, then apply `c`.

🙀 If we start adding more functions, things gets messy:

```
f(e(d(c(b(a(data))))))
```


The pipe `%>%` allows us to turn our code inside out. This makes our code read more like a sentence:

```
# do a(), then b(), then c(), then d(), then e(), then f()

data %>% a() %>% b() %>% c() %>% d() %>% e() %>% f()
```

More readably:

```
data %>%
    a() %>%
    b() %>%
    c() %>%
    d() %>%
    e() %>%
    f()
```

The pipe pushes (i.e., pipes!) what's on the left of the pipe `%>%` into the first argument of the function on the right:

```
x %>% f() == f(x)
x %>% f(y) == f(x, y)
x %>% f(y, z) == f(x, y, z)
```

The pipe `%>%` really ☀️shines☀️ when you have a lot of steps!

## 📝 Adding new columns with `mutate`

Our data extends from 2009 to the first half of 2016. Suppose want to examine the most recent full year of data: 2015.

Problem: We don't have a `year` column. To add new columns, we use `mutate`.

🖥️ Usage: `mutate(data, new_col = f(existing_col))`
* `data`: the data frame
* `new_col`: name of the new column to add
* `f`: function to apply to existing column(s) to generate the new column
* `existing_col`: name of existing column

For example, here's how we could add a column to `stops` containing the first digit of the driver's age.

In [34]:
# adding in age_first_digit column
stops %>%
    mutate(age_first_digit = floor(age/10)) %>%
    head()

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,age_first_digit
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation,2
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation,4
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation,4
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation,2
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation,2
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation,3


❗❗❗Important note❗❗❗: Most `R` functions are "copy on modify". In other words, when we apply a function to data, `R` creates a copy of the data and then modifies the copy. The original data is unchanged.

So, `mutate` alone will not change the original data.

### 🚀 Exercise

1. Use `year()` and `mutate()` to add a new column called `yr` to our `stops` data.

> You can read about the `year()` function by running `?year`.

2. Assign the resulting data frame to a new variable called `stops_w_yr`.

3. Finally, run `count(stops_w_yr, yr)`.

> What do you think `count` does? Do you notice any patterns?

In [52]:
?year

In [62]:
# Your code here!
# adding in the yr column
stops_w_yr <- stops %>%
    mutate(yr = year(date))

In [63]:
# getting the count of stops per year
count(stops_w_yr, yr)

yr,n
<dbl>,<int>
2009,110269
2010,104254
2011,99476
2012,82362
2013,74144
2014,39752
2015,85689
2016,40215


## 📝 Selecting rows with `filter`

Now that we have a `yr` column, we want to limit our data to just the stops in 2015.

Problem: We have data from 2009 to 2016. To limit to specific rows, we use `filter`.

🖥️ Usage: `filter(data, condition)`
* `data`: the data frame
* `condition`: a boolean vector where TRUE indicates the rows in `data` to keep.

For example, here's how we could limit `stops` to drivers under 30 years old:

In [60]:
stops %>%
    filter(age < 30) %>%
    head()

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation
2009-01-01,10:44:00,19TH/SANTIAGO,37.7,-122,I,29,white,female,False,,False,Equipment violation
2009-01-01,10:55:00,LA SALLE @ NEWCOMB,37.7,-122,C,26,black,male,False,False,True,Moving violation
2009-01-01,01:10:00,CORDELIA AL & BROADWAY,37.8,-122,A,19,black,male,False,,False,Moving violation


### 🚀 Exercise

1. Use `filter()` to filter the `stops` data to just 2015. Assign the result to a variable called `stops_2015`.

2. In the previous exercise, we saw that there were a lot fewer stops in 2014 than expected. Figure out why.

3. For practice, filter to stops occurring in 2013 or 2014 among female drivers less than 30 years old or more than 60 years old.

In [64]:
# 1.
stops_2015 <- stops_w_yr %>%
    filter(yr <= 2015)
head(stops_2015)


date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation,2009
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation,2009
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation,2009
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation,2009
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation,2009
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation,2009


In [74]:
# 2.
# data from entire data set
count(stops_w_yr, gender)

gender,n
<fct>,<int>
male,448773
female,187388


In [82]:
summary(stops_2015[stops_2015$yr==2014,]) # we see that data is only collected until 5-30

      date                 time                   location        
 Min.   :2014-01-01   Min.   :00:00:00.000000   Length:39752      
 1st Qu.:2014-02-08   1st Qu.:10:09:00.000000   Class :character  
 Median :2014-03-15   Median :16:00:00.000000   Mode  :character  
 Mean   :2014-03-15   Mean   :14:37:52.078889                     
 3rd Qu.:2014-04-20   3rd Qu.:20:00:00.000000                     
 Max.   :2014-05-30   Max.   :23:59:00.000000                     
                                                                  
      lat            lng         district              age      
 Min.   :36.6   Min.   :-123   Length:39752       Min.   :13.0  
 1st Qu.:37.7   1st Qu.:-122   Class :character   1st Qu.:27.0  
 Median :37.8   Median :-122   Mode  :character   Median :36.0  
 Mean   :37.8   Mean   :-122                      Mean   :39.1  
 3rd Qu.:37.8   3rd Qu.:-122                      3rd Qu.:49.0  
 Max.   :38.8   Max.   :-121                      Max.   :99.0  
 NA's   :

In [81]:
summary(stops_2015[stops_2015$yr==2015,]) # compared to for example 2015 where data is collected until 12-31

      date                 time                   location        
 Min.   :2015-01-01   Min.   :00:00:00.000000   Length:85689      
 1st Qu.:2015-03-22   1st Qu.:09:21:00.000000   Class :character  
 Median :2015-06-22   Median :15:15:00.000000   Mode  :character  
 Mean   :2015-06-22   Mean   :14:15:34.952911                     
 3rd Qu.:2015-09-21   3rd Qu.:20:01:00.000000                     
 Max.   :2015-12-31   Max.   :23:59:00.000000                     
                                                                  
      lat            lng         district              age      
 Min.   :36.7   Min.   :-123   Length:85689       Min.   :10.0  
 1st Qu.:37.7   1st Qu.:-122   Class :character   1st Qu.:27.0  
 Median :37.8   Median :-122   Mode  :character   Median :36.0  
 Mean   :37.8   Mean   :-122                      Mean   :38.8  
 3rd Qu.:37.8   3rd Qu.:-122                      3rd Qu.:49.0  
 Max.   :39.7   Max.   :-120                      Max.   :99.0  
 NA's   :

## 📝 Aggregating data with `summarize()`

What was the average, median, maximum, and minimum age of drivers in 2015?

Problem: We want to aggregate the values in a column. To do this, we use `summarize()`.

In [83]:
# Old method.
mean(stops_2015$age)
median(stops_2015$age)
max(stops_2015$age)
min(stops_2015$age)

# New method!
stops_2015 %>%
    summarize(
        mean_age = mean(age),
        median_age = median(age),
        max_age = max(age),
        min_age = min(age)
    )

mean_age,median_age,max_age,min_age
<dbl>,<int>,<int>,<int>
,,,


😱 Uh oh. By default, `R` will return `NA` for aggregating functions if at least one element is `NA` (i.e., missing).

> The `na.rm=TRUE` argument will remove (`rm`) all `NA` values.

In [84]:
mean(c(1, 2, 3, 4, NA))
mean(c(1, 2, 3, 4, NA), na.rm=TRUE)

🔄 Let's try things one more time:

In [85]:
# Old method.
mean(stops_2015$age, na.rm=TRUE)
median(stops_2015$age, na.rm=TRUE)
max(stops_2015$age, na.rm=TRUE)
min(stops_2015$age, na.rm=TRUE)

# New method!
stops_2015 %>%
    summarize(
        mean_age = mean(age, na.rm=TRUE),
        median_age = median(age, na.rm=TRUE),
        max_age = max(age, na.rm=TRUE),
        min_age = min(age, na.rm=TRUE)
    )

mean_age,median_age,max_age,min_age
<dbl>,<int>,<int>,<int>
38.1,35,100,10


Neat! But, it's not groundbreaking. `summarize()` really ☀️ shines ☀️ when used with `group_by()`.

## 📝 Getting powerful with `group_by()` and `summarize()`

Here's where things get really interesting. The techniques in this section account for a **huge** chunk of most data science workflows.

Suppose I'm interested in the average age of drivers in each district.

> `unique(v)` returns the set of unique values in a vector `v`

> `sort(v)` sorts a vector `v` in numeric or alphabetical order.

In [86]:
sort(unique(stops_2015$district))

# Alternatively
stops_2015$district %>% unique %>% sort

You already have the tools to find the average age of drivers by district!

Looks a little scary though...

In [87]:
stops_2015 %>% filter(district=='A') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='B') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='C') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='D') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='E') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='F') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='G') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='H') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='I') %>% pull(age) %>% mean(na.rm=TRUE)
stops_2015 %>% filter(district=='J') %>% pull(age) %>% mean(na.rm=TRUE)

# 😓

We now know the average age in each district, but there are some issues:
- We had to write a lot of repeated code.
- What if there were 100 districts? Or 1,000,000 districts?
- The results aren't labeled. We'd have to write even more code to label the output.

Here's another way to answer the question, but with less code:

In [88]:
stops_2015 %>%
    group_by(district) %>%
    summarize(avg_age = mean(age, na.rm=TRUE))

district,avg_age
<chr>,<dbl>
A,38.5
B,37.5
C,36.9
D,37.1
E,38.1
F,38.5
G,40.3
H,37.8
I,38.5
J,37.8


# 😮

The next section will explain the magic of grouping.

### 📝 The mechanics of `group_by()`

It's **very** common to calculate an aggregate statistic (e.g., `sum` or `mean`) for different groups (e.g., district or class year).

The *split-apply-combine* paradigm handles these situations:
- **Split** the data by group into mini-datasets
- **Apply** a function to each mini-dataset
- **Combine** the mini-datasets back together

🖼️ A visual:

<img src="https://github.com/joshuagrossman/dsb-win-2023/blob/main/opp-munging-plotting/img/split-apply-combine.drawio.png?raw=1" alt="splitapplycombine" width="600" align="left"/>

#### 📝 Splitting with `group_by`

`group_by` handles the *splitting* step.

Problem: The data isn't grouped. To split the data, we use `group_by`.

🖥️ Usage: `group_by(data, column)`
* `data`: the data frame
* `column`: the name of the column to group by.

Let's try grouping the `stops` data by district.

In [89]:
stops_2015_grouped = stops_2015 %>%
    group_by(district)

head(stops_2015_grouped)

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation,2009
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation,2009
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation,2009
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation,2009
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation,2009
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation,2009


Wait a second. This looks exactly the same as the regular data:

In [90]:
head(stops_2015)

date,time,location,lat,lng,district,age,race,gender,arrested,contraband_found,searched,reason_for_stop,yr
<date>,<time>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<fct>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>
2009-01-01,10:10:00,1736 PALOU,37.7,-122,C,22,black,female,False,False,True,Equipment violation,2009
2009-01-01,10:15:00,THRIFT/PLYMTH,37.7,-122,I,44,black,male,False,,False,Moving violation,2009
2009-01-01,10:20:00,FLORIDA/19TH,37.8,-122,D,45,white,female,False,,False,Equipment violation,2009
2009-01-01,10:20:00,19TH AVE/MORAGAE,37.8,-122,I,27,white,male,False,,False,Equipment violation,2009
2009-01-01,10:36:00,19TH/LINCOLN,37.8,-122,I,29,white,male,False,,False,Equipment violation,2009
2009-01-01,10:40:00,16TH&JULIAN,37.8,-122,D,34,black,male,False,,False,Moving violation,2009


❗❗Important note❗❗: `group_by` doesn't actually change the underlying data. It invisibly groups the data in the background.

> There is a subtle indication that the data is grouped. If you look at the top of the grouped data frame, you'll see `A grouped_df`. At the top of the ungrouped data, you'll see `A tibble`.

> A *tibble* is a data frame with some extra features.

#### 📝 Applying and combining with `summarize()`

`summarize()` *applies* an aggregating function to each mini-dataset. It then *combines* the mini-datasets.

We've already seen `summarize()` in action:

In [93]:
stops_2015 %>%
    summarize(
        avg_age = mean(age, na.rm=TRUE),
        n_rows = n()
    )

avg_age,n_rows
<dbl>,<int>
38.1,595946


Let's try `summarize()` with grouped data.

> As a bonus, we can also calculate the size of each group with the `n()` function.

In [94]:
stops_2015 %>%
    group_by(district) %>%
    summarize(
        avg_age = mean(age, na.rm=TRUE),
        num_stops_in_district = n()
    )

district,avg_age,num_stops_in_district
<chr>,<dbl>,<int>
A,38.5,51808
B,37.5,74705
C,36.9,70332
D,37.1,62082
E,38.1,46063
F,38.5,40231
G,40.3,62611
H,37.8,81614
I,38.5,74811
J,37.8,31689


That's all there is to it!

### 🚀 Exercise

1. Use `group_by()` and `summarize()` to calculate, by district, (1) the number of stops, (2) the proportion of stops that resulted in a search, and (3) the proportion of **searches** (not stops) that resulted in an arrest. What can you conclude from the results?

2. Redo part 1, but group by race instead of district. What do you conclude from the result?

3. Redo part 1, but group by district **and** race. What is your interpretation of the results?

In [95]:
# Your code here!
# (1) number of stops by district
stops_2015 %>%
  group_by(district) %>%
    summarize(
      num_stops = n()
    )

district,num_stops_in_dist
<chr>,<int>
A,51808
B,74705
C,70332
D,62082
E,46063
F,40231
G,62611
H,81614
I,74811
J,31689


In [99]:
# (2) number of stops by district including proportion that resulted in search
stops_2015 %>%
  group_by(district) %>%
    summarize(
      num_stops_in_dist = n(),
      num_searches = sum(searched, na.rm = TRUE),
      prop_search = num_searches /num_stops_in_dist,

    )


district,num_stops_in_dist,num_searches,prop_stops_of_dist
<chr>,<int>,<int>,<dbl>
A,51808,2106,0.0407
B,74705,4079,0.0546
C,70332,7939,0.1129
D,62082,4417,0.0711
E,46063,3088,0.067
F,40231,1212,0.0301
G,62611,1353,0.0216
H,81614,5143,0.063
I,74811,2211,0.0296
J,31689,2787,0.0879


In [100]:
# final after adding all three conditions to the table
stops_2015 %>%
  group_by(district) %>%
    summarize(
      num_stops_in_dist = n(),
      num_searches = sum(searched, na.rm = TRUE),
      prop_search_from_stop = num_searches /num_stops_in_dist,
      num_arrests_search_arrest = sum(searched & arrested, na.rm = TRUE),
      prop_search_arrest = num_arrests_search_arrest / num_searches
    )

district,num_stops_in_dist,num_searches,prop_search_from_stop,num_arrests_search_arrest,prop_search_arrest
<chr>,<int>,<int>,<dbl>,<int>,<dbl>
A,51808,2106,0.0407,301,0.143
B,74705,4079,0.0546,458,0.112
C,70332,7939,0.1129,667,0.084
D,62082,4417,0.0711,811,0.184
E,46063,3088,0.067,359,0.116
F,40231,1212,0.0301,177,0.146
G,62611,1353,0.0216,173,0.128
H,81614,5143,0.063,499,0.097
I,74811,2211,0.0296,322,0.146
J,31689,2787,0.0879,295,0.106


In [101]:
# 2. grouping by race
stops_2015 %>%
  group_by(race) %>%
    summarize(
      num_stops_in_dist = n(),
      num_searches = sum(searched, na.rm = TRUE),
      prop_search_from_stop = num_searches /num_stops_in_dist,
      num_arrests_search_arrest = sum(searched & arrested, na.rm = TRUE),
      prop_search_arrest = num_arrests_search_arrest / num_searches
    )

race,num_stops_in_dist,num_searches,prop_search_from_stop,num_arrests_search_arrest,prop_search_arrest
<chr>,<int>,<int>,<dbl>,<int>,<dbl>
asian/pacific islander,104089,1803,0.0173,284,0.1575
black,99008,15892,0.1605,1549,0.0975
hispanic,76765,7049,0.0918,775,0.1099
other,70711,2326,0.0329,293,0.126
white,245373,7265,0.0296,1161,0.1598


In [103]:
# 3. grouping by district and race
stops_2015 %>%
  group_by(district, race) %>%
    summarize(
      num_stops_in_dist = n(),
      num_searches = sum(searched, na.rm = TRUE),
      prop_search_from_stop = num_searches /num_stops_in_dist,
      num_arrests_search_arrest = sum(searched & arrested, na.rm = TRUE),
      prop_search_arrest = num_arrests_search_arrest / num_searches
    )

[1m[22m`summarise()` has grouped output by 'district'. You can override using the
`.groups` argument.


district,race,num_stops_in_dist,num_searches,prop_search_from_stop,num_arrests_search_arrest,prop_search_arrest
<chr>,<chr>,<int>,<int>,<dbl>,<int>,<dbl>
A,asian/pacific islander,8947,172,0.0192,30,0.1744
A,black,5396,620,0.1149,72,0.1161
A,hispanic,4228,373,0.0882,36,0.0965
A,other,10743,279,0.026,45,0.1613
A,white,22494,662,0.0294,118,0.1782
B,asian/pacific islander,9851,196,0.0199,25,0.1276
B,black,11042,1732,0.1569,182,0.1051
B,hispanic,8507,761,0.0895,56,0.0736
B,other,10452,330,0.0316,38,0.1152
B,white,34853,1060,0.0304,157,0.1481


## Concluding remarks

The method used in the final exercise is called an **outcome test**. Someone actually won a Nobel Prize for this kind of work!

Here's what we'll do in the rest of the tutorial:
- Use 📊plots📈 to reduce the cognitive burden of reading long tables.
- Learn how to combine data from multiple sources
- Dig deeper into our results. Can we say anything about racial/ethnic discrimination based on our results? What additional tests can we conduct? How can we clearly present our findings?
