**A simple API**

The Census Bureau makes its data available with an API. The American Community Survey is a popular data product that took over from the longform. You can see the data sets they make available and the questions, say, for the ACS [here](https://www.census.gov/data/developers/data-sets.html). You will need an API key to interact with the data. You can get it [here](https://api.census.gov/data/key_signup.html).

Now, we can call this API from R and pull the data into a table like we're used to. Next time, we'll make maps from the data! But first, we look at one package that exposes the Census API in R. It is called `tidycensus`.

In [None]:
install.packages("tidycensus")

In [None]:
library(tidycensus)
library(tidyverse)

Use the link above and get your own API key. Put the string of characters below.

In [None]:
census_api_key("")

And let's look at some questions. 

In [None]:
pop <- get_acs(geography = "tract", variables = "B00001_001E",state = "NY", county = "New York")

In [None]:
head(pop)

**Strings**

Today we are going to look at text, or "strings of character," and the ways we might extract data from them. We will consider simple matching to a more elaborate pattern language. Our text or document will again be the [commutations web page](https://www.justice.gov/pardon/obama-commutations). We saw yesterday how we work with text coming from an HTML page, finding the different pieces using hints provided by the structure of the page. SelectorGadget proved helpful, as did the tempermental `rvest`. 

We'll start by reviewing a popular library for working with strings of characters called `stringr.` Here is a [short vignette about stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html).

**Installing a package**

We will start by installing the package. This brings new code to your computer from [CRAN, the Comprehensive R Archive Network.](https://cran.r-project.org) You only have to do this once, or rather, any time improvements to the code are published -- improvements that you want to take advantage of. Packages are R's way of inviting community involvement. People from different universities, research groups, companies, the general public... contribute new code to extend the reach of R to new data types or to introduce new ways of analyzing or visualizing data.

Installing a package is like installing an app in the sense that new software will be downloaded to your computer. As with an app, new versions appear as the authors continue to refine and update their software. The command `old.packages()` tells you which packages on your computer have newer versions on CRAN, and `update.packages()` can be used to install the updates. Note that you might not want to automatically update a package when a new version appears. Sometimes the changes can be significant and you might not be ready to learn something new. 

Let's install the package `stringr` (if you did this in the last class, there's no need to do it again). I've noticed in class that some people had to restart their kernels (go to the "Kernel" tab in the notebook menu and select "Restart") after they installed the package. 

(Oh and **"kernel"** is a general term for a computing service. Here we are "shift-enter" sending commands to R that computes something for us and returns a result. We refer to this as an R kernel. Restarting it kills the program and starts it up again.)

OK onto `stringr`...

In [None]:
install.packages('stringr', repos='http://cran.us.r-project.org')

Once a package is installed, we can load it with the `library()` command. This lets us take all the great work offered by the package author and use it. While you only have to install  a package once, you have to call the `library()` command in every session you want to use its functionality. 

Here we load up `stringr.`

In [None]:
library(stringr)

**Simple string manipulations**

As we have seen begining with our first morning, we often transform character strings in various ways to create a regular data set. It could require extracting substrings, concatenating strings together, or trimming off meaningless characters. The `stringr` library is good for this, providing a single naming convention for all its functions. We take you through some of what you can do.

First, in R, we create a string of characters by surrounding them with double or single quotes. Here we store two strings giving one the name `disdate` and the other the name `sent` (for district/date and sentence).

In [None]:
disdate = "District of South Carolina; March 29, 2004"
sent = "Life imprisonment; 10 years' supervised release"

Now, let's create a new string by concatenating several together. Here we concatenate `disdate` and `sent` to make one big string. Call it `several`  and then print it out.

In [None]:
several <- str_c(disdate,sent)
several

Look at the number of characters, or the "length" of each string...

In [None]:
str_length(several)

Often we want to create fixed substrings. That is, you can specify the start and end, but the end might be negative, meaning counting backwards from the last charcter in the string.

The first 10 characters...

In [None]:
str_sub(disdate,1,10)

... the fifth character to the end...

In [None]:
str_sub(disdate,5)

... the fifth character from the right to the end...

In [None]:
str_sub(disdate,-5)

... and finally, character 12 up until the third from the right.

In [None]:
str_sub(disdate,12,-3)

Less often used, but important, you duplicate a string with, well, `str_dup()`.

In [None]:
str_dup("Term",10) 

We can remove so-called "white space" at the start or end of the string with `str_trim()` -- this is particularly useful when you are
processing data "scraped" from a web site as HTML ignores blank spaces at the beginning or end of the text in a tag.

In [None]:
str_trim(" Term     ")

**Regular expressions**

The package `stringr` also includes some pretty elegant functions for defining and extracting patterns of characters in strings. They all use a mini-language called "regular expressions" to specify the patterns.

If you look over the sentences given out on the [commutations web page](https://www.justice.gov/pardon/obama-commutations), you'll see that some include fines. Looking over a few of them (search for "fine" on the page), you see that they have a common form. They are a dollar sign \$ followed by a series of numbers that might include a comma. 

Regular expressions are a way of specifying patterns in text. I uploaded a document to our GitHub site that describes this mini language for patterns and we will go over it in a few minutes. For the moment, it's enough to know that regular expressions have syntax to specify matching a dollar sign, then any digit from 0 to 9, or a comma, (the so called "character class" defined by [0-9,]) that occurs one or more times (that's a +). 

A regular expression is just a string that specifies a pattern like this. In the box below we create a vector of four sentences, two with fines and two without. We also specify the regular expression string that defines a fine as we did in the paragraph above. We store the pattern in the string variable `fine`.

In [None]:
sents <- c("262 months' imprisonment; seven years' supervised release; $500 fine", 
          "262 months' imprisonment; five years' supervised release", 
          "180 months' imprisonment; five years' supervised release", 
          "Life imprisonment; five years' supervised release; $250 fine"
         )

fine <- "\\$[0-9,]+"

Given a pattern and vector of strings, we start by simply testing which strings contain the pattern. Think of this as a kind of search engine query. The function `str_detect()` returns a boolean vector of `TRUE`s and `FALSE`s, indicating whether or not a string contains a pattern. This will be handy in `filter()`, for example. 

The function `str_subset()` returns just the strings that have the pattern. 

In [None]:
str_detect(sents,fine)

So the first sentence has a fine, the second and third do not, and the fourth does -- TRUE, FALSE, FALSE, TRUE. Here are just the first and fourth strings with the pattern.

In [None]:
str_subset(sents,fine)

Again, the first and fourth strings refer to a fine and are correctly detected (whew). We can also determine the character number in each string where the pattern first appears and where the pattern ends. 

In [None]:
str_locate(sents,fine)

So, look at the first sentence. 

>262 months' imprisonment; seven years' supervised release; $500 fine

Count from the first character of the string, the 2. Count over 60 characters (I know, I know). So the "2" is the first, the "6" is the second, the "m" in "months'" is the fifth and the dollar sign is the 60th. The last zero in "500" is the 63rd. Do the same for the last sentence and make sure that everything lines up.

Now, if we know where the pattern starts and stops, we can extract just that data with a substring command. There is a built-in function for this called `str_extract().`

In [None]:
str_extract(sents,fine)

Notice you get an NA or missing value when the pattern does not exist in the string. Hence sentences two and three have NA values. 

Finally, we can replace the identified pattern for something else. Here we replace the fine with the word "No".

In [None]:
str_replace(sents,fine,"No")

Finally, we can count the number of occurrences of a pattern. Here we just count the number of zeroes. We can use much more elaborate patterns here.

In [None]:
sents

In [None]:
str_count(sents,"0")

Now, if we change gears a little and you look over the [commutations web page](https://www.justice.gov/pardon/obama-commutations), you'll see that there are fines as well as forfeitures and other penalties that are specified in dollars. So our regular expression is too loose if we are just looking for fines. We might want to follow our dollar-number/comma pattern with the word "fine".

(We will need `dplyr` for the next section so let's load it up.)

In [None]:
library(dplyr)

**Regular Expressions**

Before we leave this, we will run through the basics of regular expressions. Each of the expressions in the accompanying PDF I've circulated can be used "as is" with the slight caveat that you have to treat backslashes gently. A backslash, as you will see, in regular expression land means to "escape" a special character, making it mean itself. So \$ means the end of the line in a regular expression, and \\\$ means dollars. Because R uses the backslash as an escape character to define "character constants" like a tab, "\t", or a newline, "\n" we have to force R to read the backslash as a backslash. That means you have to, well, escape it with a backslash -- making the backslash mean a backslash. Oy. Hence our double backslashes in our pattern for fines. 

This sounds bad but it's an easy rule to remember. In R, we just double any backslash.

For practice, we are going to look at lines that come from a series of emails released by Jeb Bush while he was governor of Florida. The site was taken down sometime last year but [here is a capture from the Internet Archive](https://web.archive.org/web/20150324042711/http://jebbushemails.com/home). We are going to use sentences pulled from these emails as a test case for our work on regular expressions. First, download the data from [our GitHub site](https://github.com/cocteau/D4D/raw/master/data/jeb.csv) and put it in the same folder as your notebook or just read it directly from the site as we do below.

In [None]:
jeb <- read.csv("https://github.com/cocteau/D4D/raw/master/data/jeb.csv")
head(jeb)

In [None]:
pattern <- "Elian"
filter(jeb,str_detect(sentences,pattern))

In [None]:
pattern = "^I hope"
filter(jeb,str_detect(sentences,pattern))

Let's now walk through [the PDF on our GitHub](https://github.com/cocteau/D4D/raw/master/miscPDF/D4D_Regular_Expressions.pdf) page describing regular expressions in more detail. If you want to try something out, you can just replace the `pattern` variable below.

In [None]:
pattern = ""
filter(jeb,str_detect(sentences,pattern))

**After detour to PDF**

One small note -- the functions we have seen so far take a string that represents a regular expression. If there are options we would like to include, like making a match no matter what the "case" of the letters we would wrap the pattern in a function called `regex()` that allows for options. Here is an example.

In [None]:
pattern = regex("miami", ignore_case = TRUE)
filter(jeb,str_detect(sentences,pattern))

**Back to the Clemency Initiative**

We left our web scraping exercise looking at [Obama's commutations](https://www.justice.gov/pardon/obama-commutations). We found the data spread across a number of tables that we glued together. Each row represented an `item` and its `description`. Remember there were five rows per commutation. 

In [None]:
commutations <- read.csv("https://github.com/cocteau/D4D/raw/master/data/commutations.csv")
head(commutations,10)

Execute the code below to get a sense of what we have in each field (or just read the web page). Actually, how do these two "experiences" of the data differ?

In [None]:
sample_n(commutations,10)

Again, here are all the `item` fields.

In [None]:
count(commutations,item)

Building on what we did in the last session, we can extract all the sentences. We can use `dplyr` to select just the rows there the `item` entry is `Sentence:`. Notice how we are still using text that is formatted to be read, as a document. It's `Sentence:` (the word sentence and then a colon) and not just `Sentence`.

In [None]:
sents <- filter(commutations,item=="Sentence:")
sample_n(sents,10)

Or we can use our new regular expression skills and pull items with the pattern `Sentence`. (We can leave off the colon as the string `Sentence` doesn't appear in the other items.)

In [None]:
sents <- filter(commutations,str_detect(item,"Sentence"))
sample_n(sents,10)

We can also skim our data set for fines. Ah but notice that the pattern we specified is a little too loose. We also get "$3,634.75 restitution". We need to decide if we want to keep that field or not. 

In [None]:
pattern <- "\\$[0-9,]+"
sample_n(filter(sents,str_detect(description,fine)),10)

Now we keep just district and date of the sentence. We store it in `dd` which we see has 1713 entries. We also see that some tables don't have District/Date -- how many are missing?

In [None]:
dd <- filter(commutations,str_detect(item, "District/Date"))
dim(dd)

In [None]:
sample_n(dd,10)

Now, we are going to use `mutate()` to add a column to `dd.` It will be formed by extracting the date portion of the `description`'s. We pull the date part by looking for everything after a semi-colon. Looking at the dates above, this seems like a good guess. 

Oh and in this case the regular expression specifies any character up to a semicolon, followed by a space and then any character to the end of the string. The trailing text is put into a group and our `dates` column replaces the whole description string with just this characters in this group.

In [None]:
dd <- mutate(dd,dates=(str_replace(description,"^.+; (.+)$","\\1")))
head(dd)

Change the above code and pull out the years of the convictions and have a look at how many occurred each year.

In [None]:
# Your code here



The `dates` column is made up of strings. It is hard for us to compare them as they are. That is, we would like to do things like see if there were periods of convictions that were higher than others. Right now, we just have strings. We could pull out the years as we did above, but even working with months is awkward because displays will list the months in alphabetical order.

A new class to the rescue! The libraray `lubridate` takes dates and makes them more computable, if you will. Let's see.

In [None]:
library(lubridate)

In [None]:
d1 <- mdy("March 6, 2008")
class(d1)

This object lets us do a lot more with dates than just put them on a calendar. For example...

In [None]:
d2 <- mdy("October 8, 2008")
d2-d1

... or take one date...

In [None]:
d2

... and find the date 500 days before. 

In [None]:
d2-500

This kind of calculation could be horrible if we had to count squares on a calendar. What makes this direct is the way a moment in time can be represented computationally. Not only do we have a "character" view of the date that is readable, but we also have a "numeric" value. 

How would you describe a date as a number?

In [None]:
as.numeric(d2)

Working with dates gives us every reason why we want to compute... we can create "objects" that encapsulate operations or concepts that respond to our data handling needs. So, for example, here is the week of the year that `d2` fell in...

In [None]:
week(d2)

... or the day of the week. 

In [None]:
wday(d2)
wday(d1)

There is a great web page and cheat sheet for `lubridate` at [this web site](https://lubridate.tidyverse.org/). Oh and I think I pointed you to [this one](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf) on the same site for `dplyr`.

Now, let's create the column of dates in `dd` not as characters but as date objects.

In [None]:
dd <- mutate(dd,dates=(mdy(str_replace(description,"^.+; (.+)$","\\1"))))
head(dd)

And look at what failed to parse...

In [None]:
filter(dd,is.na(dates))

Now we see that the date might involve multiple dates! We'll need to clean these up manually. For the moment, let's make a histogram of the dates. 

In [None]:
library(ggplot2)

In [None]:
ggplot(dd,aes(x=dates))+geom_histogram(bins=25,color="white")

In [None]:
# Execute this first!
#
# Our new data set has a lot of columns, so we want the
# notebook to display more... 30, say.

options(repr.matrix.max.cols=30)

**New from old**

We have been studying [the web page of Obama's commutations](https://www.justice.gov/pardon/obama-commutations). Our ultimate goal is to create a data frame that would let us operate on the data more conveniently. So how do we take the free text of [the commutations web page](https://www.justice.gov/pardon/obama-commutations) and systematically fill in a more structured data set? As we go through this process, it's a good idea to also consult the [Clemency Initiative](https://www.justice.gov/pardon/clemency-initiative) for federal inmates, its goals and what it accomplished under Obama. 

Here are some post-mortems of the project.

> http://thehill.com/homenews/administration/315107-obama-issues-final-round-of-sentence-commutations
<br><br>
https://www.washingtonpost.com/world/national-security/obama-grants-final-330-commutations-to-nonviolent-drug-offenders/2017/01/19/41506468-de5d-11e6-918c-99ede3c8cafa_story.html?utm_term=.5c57a8437c9e
<br><br>
https://www.justice.gov/pardon/clemency-statistics

Now, looking over the data, what columns would you like to extract? We have dates, great. But what else? 


Your ideas here



We now read in a version of the data that has been transformed from `commutations` with its `item` and `description` columns to something more friendly. Oh and since we are reading from a CSV, R will automatically take our lovely dates and read them as strings. If you download `newcomms2.csv` from our GitHub site and open it in a spreadsheet, you'll see that the date_1 column has entries like "2008-03-05", for example.

In [None]:
newcommutations = read.csv("https://github.com/cocteau/D4D/raw/master/data/newcomms2.csv",as.is=TRUE)

In [None]:
sample_n(newcommutations,10)

To turn the dates into date objects we use `ymd()` from `lubridate` (year-month-date, to match our `date_1` etc format). 

In [None]:
newcommutations = mutate(newcommutations,date=ymd(date))
newcommutations = mutate(newcommutations,date_2=ymd(date_2))
newcommutations = mutate(newcommutations,date_3=ymd(date_3))

Above, we saw how we might use regular expressons to extract facts about the free text elements of our web page. Let's look at the states where people were convicted. This means we need a way to look at the `district_date` field for each inmate and extract the state name. 

The easiest regular expression is just a set of literals. So the expression "New Jersey" is asking us to match the character string "New Jersey" exactly - so an "N" then an "e" then a "w" and so on. 

As we have done many times before by now, we could use the `stringr` command `str_detect()` to create a boolean (TRUE/FALSE) vector that is TRUE if "New Jersey" is in the "district_date" field and FALSE otherwise. The command `filter()` can then be used with this boolean vector to select just the rows with "New Jersey" in the `district_date` field.

Here are just the commutations from New Jersey.

In [None]:
filter(newcommutations,str_detect(district_date,"New Jersey"))

If we want commutations from either New Jersey or Massachusetts, we could join the two words in our regular expression into a pair of alternatives, separating them with a vertical bar "|". The expression is "New Jersey|Massachusetts". 

In [None]:
filter(newcommutations,str_detect(district_date,"New Jersey|Massachusetts"))

The expression "New Jersey|Massachusetts" can be elaborated a little. We probably don't need this for state names, but if we were matching colors like "red" and "green" in text, we have to be careful because "green" would match "greenwich" and "red" might match "hundred". So if we want to match words, we could surround them with a "character class" that represents a word boundary. That's "\\b". We mentioned that is also special character classes like "\\w" representing any word character. This would include "a-zA-Z" for example.

So, intead of "New Jersey|Massachusetts", we might use "\\bNew Jersey\\b|\\bMassachusetts\\b" which suggests looking for word boundaries around "New Jersey" or "Massachusetts". You can see the difference by looking at [regexper.com](https://regexper.com/#%5CbNew%20Jersey%5Cb%7C%5CbMassachusetts%5Cb). 

One last thing, because R uses the backslash for an escape character, we have to double all backslashes. So whatever works in regexper.com, we need to double the backslashes for entering them into R. Here's the "or" for our two states.

In [None]:
filter(newcommutations,str_detect(district_date,"\\bNew Jersey\\b|\\bMassachusetts\\b"))

OK, so if we want to look for any state, we don't want to have to keep typing state names. For this purpose,we can use a built-in data set in R called `state.name.` It is, well, a vector of strings, the names of the 50 states. 

We load this and a number of other data sets about the 50 states using the `data()` command. It is similar to `library()` but for the data sets that come with R. You can get a complete list using the command without any arguments. (To close the window list, click the "x" in the upper right corner below.)

In [None]:
data()

This is a pretty esoteric collection of data. It represents a mix of data that are important for data analysis as well for statistics education. There are "classic" data sets that instructors use in classes, and the most popular ones drift into languages like R.

Let's just look at the data on the states. While we have data sets like "states.abb" for abbreviations, we also have state names.

In [None]:
data(state)
state.name

In [None]:
state.abb

With a little work, we'd also see that there are some district that aren't strictly in states. There's Puerto Rico, Guam and DC, for example. We can add these to our states.

In [None]:
state.name = c(state.name,"Puerto Rico","Guam","District of Columbia","U.S. Army Court Martial")

To be safe, we are going to make all our matches case insensitive for the state names. We then need to craft a regular expression that was essential a series of "or" conditions -- arkansas or alabama or alaska... We will fashion this from `state.name` using the command in `stringr` called str_c(). It takes vectors of strings and glues them together. 

Here we stuck the state names between two "\\\b" character classes and separated each state with a vertical bar. Again, the "\\\b" means a word boundary (like a space or some punctuation) and the vertical bar means "or". 

The expression below can be visualized using [regexper.com](https://regexper.com/#%5Cbalabama%5Cb%7C%5Cbalaska%5Cb%7C%5Cbarizona%5Cb%7C%5Cbarkansas%5Cb%7C%5Cbcalifornia%5Cb%7C%5Cbcolorado%5Cb%7C%5Cbconnecticut%5Cb%7C%5Cbdelaware%5Cb%7C%5Cbflorida%5Cb%7C%5Cbgeorgia%5Cb%7C%5Cbhawaii%5Cb%7C%5Cbidaho%5Cb%7C%5Cbillinois%5Cb%7C%5Cbindiana%5Cb%7C%5Cbiowa%5Cb%7C%5Cbkansas%5Cb%7C%5Cbkentucky%5Cb%7C%5Cblouisiana%5Cb%7C%5Cbmaine%5Cb%7C%5Cbmaryland%5Cb%7C%5Cbmassachusetts%5Cb%7C%5Cbmichigan%5Cb%7C%5Cbminnesota%5Cb%7C%5Cbmississippi%5Cb%7C%5Cbmissouri%5Cb%7C%5Cbmontana%5Cb%7C%5Cbnebraska%5Cb%7C%5Cbnevada%5Cb%7C%5Cbnew%20hampshire%5Cb%7C%5Cbnew%20jersey%5Cb%7C%5Cbnew%20mexico%5Cb%7C%5Cbnew%20york%5Cb%7C%5Cbnorth%20carolina%5Cb%7C%5Cbnorth%20dakota%5Cb%7C%5Cbohio%5Cb%7C%5Cboklahoma%5Cb%7C%5Cboregon%5Cb%7C%5Cbpennsylvania%5Cb%7C%5Cbrhode%20island%5Cb%7C%5Cbsouth%20carolina%5Cb%7C%5Cbsouth%20dakota%5Cb%7C%5Cbtennessee%5Cb%7C%5Cbtexas%5Cb%7C%5Cbutah%5Cb%7C%5Cbvermont%5Cb%7C%5Cbvirginia%5Cb%7C%5Cbwashington%5Cb%7C%5Cbwest%20virginia%5Cb%7C%5Cbwisconsin%5Cb%7C%5Cbwyoming%5Cb).

In [None]:
reg = str_c("\\b",state.name,"\\b",collapse="|")
print(reg)

We then pass this to another `stringr` function called `str_extract()`. It will extract the data that matches our plattern. Notice that we again take the `district_date`.

In this case, a match means we have found one of the states' names. Here are all 945 state names found in the "district_date" field. 

In [None]:
str_extract(newcommutations$district_date,regex(reg,ignore_case=TRUE))

This is how we made a new column in our data set consisting of state names (that plus correcting one spelling error, "Wisconson"). We called the variables "state," "state_2" and "state_3" corresponding to the first, second and third districts associated with an inmates' prison sentence. I've commented out these lines of code since your data set already has these columns.

Just to give you a chance to check things out, the last line of code here creates a sample of the commutations again.

In [None]:
# newcommutations <- mutate(newcommutations,state=str_extract(tolower(district_date),reg))
# newcommutations <- mutate(newcommutations,state_2=str_extract(tolower(district_date_2),reg))
# newcommutations <- mutate(newcommutations,state_3=str_extract(tolower(district_date_3),reg))

sample_n(newcommutations,10)

In [None]:
arrange(count(newcommutations,state),desc(n))

Next let's look at sentencing. We might try looking at sections of the U.S. code mentioned in the "offense" fields. We can search for "§", the "section sign" and created a regular expression with just this character. how many of the inmates in our data set have this character in their offense?

In [None]:
# put your code here


In addition, we can look at the words in the ofense listing. Here is a test for heroin or LSD offenses. 

In [None]:
druglist <- "\\blsd\\b|\\bheroin\\b"
drugs <- filter(newcommutations,!(str_detect(tolower(offense),druglist)))
sample_n(drugs,5)

Here is a more elaborate list of drugs. This time, we use the "!" to turn our TRUEs into FALSEs and look for offenses that don't include one of the drug names. Let's try that out and read the sentences. 

Also, have a look at the regular expression on [regexper.com](https://regexper.com/#%5Cbcrack%5Cb%7C%5Cblsd%5Cb%7C%5Cbphencyclidine%5Cb%7C%5Cbnarcotic%5Cb%7C%5Cbdrug%5Cb%7C%5Cbcontrolled%20substance%5Cb%7C%5Cbheroin%5Cb%7C%5Cbcocaine%5Cb%7C%5Cbmari%28j%7Ch%29uana%5Cb%7C%5Cbmethamphetamine%5Cb)

In [None]:
druglist <- "\\bphencyclidine\\b|\\bnarcotic\\b|\\bdrug\\b|\\bcontrolled substance\\b|\\bheroin\\b|\\bcocaine\\b|\\bmari(j|h)uana\\b|\\bmethamphetamine\\b"
nodrugs <- filter(newcommutations,!(str_detect(tolower(offense),druglist)))
nodrugs

I came up with the list of drugs by looking at samples of the "left overs" like in the "nodrugs" data frame. This list could clearly be added to. Kenneth Isaacs, for example, was committed of distributing hydromorphone, an opiod pain medication. 

See if a drug is mentioned among the "nodrugs" offenses and add it to the regular explression above. Regenerate the "nodrugs" data frame. What else needs to be added to our regular expression?

In [None]:
# Your code here



We can now create a column that has TRUE/FALSE whether a particular drug is mentioned. Here we add references to cocaine (is the word "cocaine" sufficient to find these offenses? Fix it if not.)

In [None]:
newcommutations <- mutate(newcommutations,cocaine=str_detect(offense,regex("cocaine|crack",ignore_case=TRUE)))
sample_n(newcommutations,5)

Next, we can make a breakdown of cocaine offenses by state. 

In [None]:
table(newcommutations$state,newcommutations$cocaine)

What do you observe? Are there any states that seem to have a different pattern than the others? If so, extract this state and tell me what offenses are frequent instead.

In [None]:
# Put your code here



In talking this over with Vice, they expressed interest in so-called ["851 enhancements"](https://www.avvo.com/legal-guides/ugc/21-u-s-c-851-federal-sentence-enhancements). Here is a description of 21 U.S.C. 851.

>A person who is charged in federal court may face an enhanced sentence if he or she has previously been convicted of a felony drug offense. 21 U.S.C. 851, also known as Section 851, is a subdivision of the Controlled Substances Act, which authorizes federal prosecutors to use a defendant's prior felony drug conviction to subject the defendant to an increased sentence in a current case. Federal prosecutors can use a prior drug conviction to enhance sentences in current drug, firearm or immigration cases. 

Create a data frame called `eight51` that is a subset of `newcommutations` that contains all the offenses that have a reference to section 851. Use `arrange()` to sort it by the dates in `date_1` so that the 851 enhancements are ordered in time from oldest to newest.

In [None]:
# Put your code here



Assuming you created a data frame called `eight51`, you can use the following code to tell you about the years that people were sentenced with this enhancement. 

In [None]:
count(eight51,year(date_1))

Now, tell me which states had this 851 indicator. 

Use an expression like the one above to find out what states had inmates receive the 851 enhancement.

In [None]:
# Put your code here

