# POLSCI 3

## Week 2, Lecture 2: Subsetting Data

In this notebook, we will cover subsetting data with the `subset()` function and then taking the mean of variables within those subsets.

## Subsetting in R with `subset()`

Sometimes when we work with large datasets, we want to take a look at a specific *subset* of that data.

For our example today, we'll use a pretty cool dataset. This dataset was gathered by <a href="https://onlinelibrary.wiley.com/doi/10.1111/ajps.12618?af=R" target="_blank">Shoub et al. (2021)</a>.

Let's start by reading in a dataset, as we learned in Lecture 1.

In [1]:
officerdata <- read.csv('ps3_fl_officers_large.csv')
head(officerdata) # head() just shows us the first six rows of a dataset. This dataset is too long to print!

Unnamed: 0_level_0,search_occur,contra,driver_age,driver_race,officer_female,officer_id
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<int>,<chr>
1,0,0,above 60,,0,4a6ccf9467
2,0,0,under 35,,0,f1bc8a7a74
3,0,0,under 35,POC,0,07e6718e7d
4,0,0,between 35 and 60,White,0,000f298db7
5,0,0,between 35 and 60,White,0,4f549e4570
6,0,0,under 35,,0,99d2f19d4f


This dataset contains real data on officers and drivers from nearly 860,000 police traffic stops. Each row represents one time that an officer stopped someone. Here is more information about the variables:

- <code>search_occur</code>: Whether or not a search was conducted by the officer at that stop (0 = no search, 1 =  search)
- <code>contra</code>: Whether or not a contraband (illegal items such as illegal drugs or guns) is found by the officer at that stop (0 = no contraband, 1 = contraband). *Officers only can find contraband if they conduct a search.*
- <code>driver_age</code>: Age of driver (years)
- <code>driver_race</code>: Race of driver (White = white; POC = non-white)
- <code>officer_female</code>: Officer gender (0 = male, 1 = female)

### How `subset()` works

Here's how `subset()` works:

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == accepted.value)`

This line takes `original.dataset`, subsets it to rows (observations) when `variable.in.dataset` equals `accepted.value`, and saves that subset in `name.of.new.subset.dataset`.

If the variable is a **string** (letters and words) variable, you need to wrap it in quotations, like this (single quotes `'` and double quotes `"` both work):

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == 'accepted.value')`
`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == "accepted.value")`

### Hypothetical use of `subset()`

Suppose we have a `oskiStore` dataset:

<table>
<thead>
  <tr>
    <th>Month</th>
    <th>Sweaters</th>
    <th>Hoodies</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Jan</td>
    <td>220</td>
    <td>75</td>
  </tr>
  <tr>
    <td>Feb</td>
    <td>195</td>
    <td>90</td>
  </tr>
  <tr>
    <td>March</td>
    <td>175</td>
    <td>80</td>
  </tr>
  <tr>
    <td>April</td>
    <td>220</td>
    <td>60</td>
  </tr>
</tbody>
</table>

If we run `oski.many.sweaters <- subset(oskiStore, Sweaters == 220)`, then `oski.many.sweaters` will look like this:

<table>
<thead>
  <tr>
    <th>Month</th>
    <th>Sweaters</th>
    <th>Hoodies</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Jan</td>
    <td>220</td>
    <td>75</td>
  </tr>
  <tr>
    <td>April</td>
    <td>220</td>
    <td>60</td>
  </tr>
</tbody>
</table>

In this example, `oski.many.sweaters` is a entirely new dataset, and we can do all the same things with it that we could do to `oskiStore`.

Likewise, if we run `oskiApril <- subset(oskiStore, Month == 'April')`, then `oskiApril` will look like this:

<table>
<thead>
  <tr>
    <th>Month</th>
    <th>Sweaters</th>
    <th>Hoodies</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>April</td>
    <td>220</td>
    <td>60</td>
  </tr>
</tbody>
</table>



### Real example use of `subset()`

Let's say we want to subset the `officerdata` in order to look at the data specifically from traffic stops by female officers.

*Hint: You'll need to do exactly this in Activity Notebook 2!*

In [None]:
female.officer.stops <- subset(officerdata, officer_female == 1)
head(female.officer.stops)

We can subset based on values of other variables, too.

In [None]:
search.occured <- subset(officerdata, search_occur == 1)
head(search.occured)

## Gathering Statistics: Using `mean()` within a subset

To gather more statistics, we can run `mean()` on just this subset of data. Running `mean()` on subsets of the data will give us the means (of whatever variable we take the mean of) within those subsets.

#### Mean 

Let's compare the means of contrabands found among stops between male and female officers. 

In [None]:
female.officer.stops <- subset(officerdata, officer_female == 1)
mean(female.officer.stops$contra)

In [None]:
male.officer.stops <- subset(officerdata, officer_female == 0)
mean(male.officer.stops$contra)

Now let's compare the means of this rate between drivers under 35 and between 35 and 60.

To do this, we need to subset on a **string** variable. We can see above that the `driver_age` variable is a string because it has letters and words.

In [None]:
drivers.under35 <- subset(officerdata, driver_age == 'under 35')
mean(drivers.under35$contra)

In [None]:
drivers.35to60 <- subset(officerdata, driver_age == 'between 35 and 60')
mean(drivers.35to60$contra)

We can also turn these into percentages when printing them by just multiplying them by 100:

In [None]:
prop.drivers.35to60.withcontra <- mean(drivers.35to60$contra)
prop.drivers.35to60.withcontra * 100

## More about next class

This was *Week 2, Lecture Notebook 2*. In class, you'll work on *Week 2, Activity Notebook 2*. In that notebook, you'll use what you'll learned in this notebook to answer very similar problems. Because it's the first week, the notebook will not count towards your grade.

- Every "Day 2" of a week, you'll work on *Activity Notebook 2* in your groups.
- You will have 45 minutes to answer the questions, so until 5:55 pm. (You can start early at 5:00 pm if you want.)
- In these group notebooks, some of the questions will tell you whether you're getting the right answer or not as you go.
- These notebooks will also have open ended questions. These we will manually grade later and can't be automatically graded, of course.
- Finally, we'll regroup and go over the right answers as a class at the end of class.

If you need help during class:

- Check out the R Cheat Sheet in bCourses.
- Click ask for help while in your breakout room.

Final reminder: you do **NOT** need to turn in this lecture notebook. You only need to turn in in-class activities and problem sets.