In [1]:
library(dslabs) #loading library `dslabs`
data(murders) #dataset `murders`

In [2]:
#loading this library for functions
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




mutate
filter
select
%>%
To change a data table by adding a new column, or changing an existing one, we use the mutate() function.
To filter them data by subsetting rows, we use the function filter().
To subset the data by selecting specific columns, we use the select() function.
We can perform a series of operations by sending the results of one function to another function using the pipe operator, %>%.
Code

In [3]:
head(murders)

Unnamed: 0_level_0,state,abb,region,population,total
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>
1,Alabama,AL,South,4779736,135
2,Alaska,AK,West,710231,19
3,Arizona,AZ,West,6392017,232
4,Arkansas,AR,South,2915918,93
5,California,CA,West,37253956,1257
6,Colorado,CO,West,5029196,65


# Selection

In [4]:
#read the full documentation for detail
help(select)

In [5]:
args(select)

Tidyverse selections implement a dialect of R where operators make it easy to select variables:

`:` for selecting a range of consecutive variables.

`!` for taking the complement of a set of variables.

`&` and `|` for selecting the intersection or the union of two sets of variables.

`c()` for combining selections.

**`select()`**: chosing specific columns

In [6]:
#select column state, region, total
select(murders, state, region, total)

state,region,total
<chr>,<fct>,<dbl>
Alabama,South,135
Alaska,West,19
Arizona,West,232
Arkansas,South,93
California,West,1257
Colorado,West,65
Connecticut,Northeast,97
Delaware,South,38
District of Columbia,South,99
Florida,South,669


In [7]:
#equivalent using pipe
murders %>% select(state, region, total)

state,region,total
<chr>,<fct>,<dbl>
Alabama,South,135
Alaska,West,19
Arizona,West,232
Arkansas,South,93
California,West,1257
Colorado,West,65
Connecticut,Northeast,97
Delaware,South,38
District of Columbia,South,99
Florida,South,669


In [8]:
#slicing
murders %>% select(abb:population)

abb,region,population
<chr>,<fct>,<dbl>
AL,South,4779736
AK,West,710231
AZ,West,6392017
AR,South,2915918
CA,West,37253956
CO,West,5029196
CT,Northeast,3574097
DE,South,897934
DC,South,601723
FL,South,19687653


In [9]:
#select columns that are not abb, region, population
murders %>% select(!(abb:population))

state,total
<chr>,<dbl>
Alabama,135
Alaska,19
Arizona,232
Arkansas,93
California,1257
Colorado,65
Connecticut,97
Delaware,38
District of Columbia,99
Florida,669


In [10]:
#select columns that are not abb, total
murders %>% select(!c(abb, total))

state,region,population
<chr>,<fct>,<dbl>
Alabama,South,4779736
Alaska,West,710231
Arizona,West,6392017
Arkansas,South,2915918
California,West,37253956
Colorado,West,5029196
Connecticut,Northeast,3574097
Delaware,South,897934
District of Columbia,South,601723
Florida,South,19687653


In [11]:
#select columns starts with 't' or 'p'
murders %>% select(starts_with('p') | starts_with('t'))

population,total
<dbl>,<dbl>
4779736,135
710231,19
6392017,232
2915918,93
37253956,1257
5029196,65
3574097,97
897934,38
601723,99
19687653,669


# Mutation: Adding, modifying and removing rows, columns

**`mutate()`**

In [12]:
#adding column rate = total / population * 1000000
#adding a column rank to rank from highest rate to lowest rate

mutate(murders, rate = total / population * 100000, rank = rank(-rate))

state,abb,region,population,total,rate,rank
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Alabama,AL,South,4779736,135,2.8244238,23
Alaska,AK,West,710231,19,2.675186,27
Arizona,AZ,West,6392017,232,3.6295273,10
Arkansas,AR,South,2915918,93,3.1893901,17
California,CA,West,37253956,1257,3.3741383,14
Colorado,CO,West,5029196,65,1.2924531,38
Connecticut,CT,Northeast,3574097,97,2.7139722,25
Delaware,DE,South,897934,38,4.2319369,6
District of Columbia,DC,South,601723,99,16.4527532,1
Florida,FL,South,19687653,669,3.3980688,13


In [13]:
#equivalent using pipe
murders %>% mutate(rate = total / population * 100000, rank = rank(-rate))

state,abb,region,population,total,rate,rank
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Alabama,AL,South,4779736,135,2.8244238,23
Alaska,AK,West,710231,19,2.675186,27
Arizona,AZ,West,6392017,232,3.6295273,10
Arkansas,AR,South,2915918,93,3.1893901,17
California,CA,West,37253956,1257,3.3741383,14
Colorado,CO,West,5029196,65,1.2924531,38
Connecticut,CT,Northeast,3574097,97,2.7139722,25
Delaware,DE,South,897934,38,4.2319369,6
District of Columbia,DC,South,601723,99,16.4527532,1
Florida,FL,South,19687653,669,3.3980688,13


<hr>

**`transmute()`**

`mutate()` will always return the new variables appended to a copy of the original data. If you want to return only the new variables, use `transmute()`

In [14]:
murders %>% transmute(rate = total / population * 10000, rank = min_rank(desc(rate)))

rate,rank
<dbl>,<int>
0.28244238,23
0.2675186,27
0.36295273,10
0.31893901,17
0.33741383,14
0.12924531,38
0.27139722,25
0.42319369,6
1.64527532,1
0.33980688,13


# Filter

**`filter()`**

In [15]:
#select states in 'Southeast' or 'West'
filter(murders, region %in% c('Southeast', 'West'))

state,abb,region,population,total
<chr>,<chr>,<fct>,<dbl>,<dbl>
Alaska,AK,West,710231,19
Arizona,AZ,West,6392017,232
California,CA,West,37253956,1257
Colorado,CO,West,5029196,65
Hawaii,HI,West,1360301,7
Idaho,ID,West,1567582,12
Montana,MT,West,989415,12
Nevada,NV,West,2700551,84
New Mexico,NM,West,2059179,67
Oregon,OR,West,3831074,36


# Pipe

In [16]:
#adding column rate and rank
#filter states having rate < 1
#then output the values of column state, rate, rank

murders %>% mutate(rate = total / population * 10000, rank = rank(-rate)) %>% filter(rate < 1) %>% select(state, rate, rank)

state,rate,rank
<chr>,<dbl>,<dbl>
Alabama,0.28244238,23
Alaska,0.2675186,27
Arizona,0.36295273,10
Arkansas,0.31893901,17
California,0.33741383,14
Colorado,0.12924531,38
Connecticut,0.27139722,25
Delaware,0.42319369,6
Florida,0.33980688,13
Georgia,0.37903226,9



# Offset

**`lead()`** and **`lag()`**

# Ranking

min_rank, percent_rank, row_number, dense_rank, cum_dist, ntitle


# Aggregate

**`summarize`**

In [17]:
#Can only use functions that returns a single value
s <- heights %>% summarize(mean = mean(height), standard_deviation = sd(height))
s

mean,standard_deviation
<dbl>,<dbl>
68.32301,4.078617


In [18]:
s$mean

# Group by

**`group_by`**, **`ungroup`**

In [19]:
heights %>% group_by(sex) %>% summarize(mean = mean(height), standard_deviation = sd(height))

`summarise()` ungrouping output (override with `.groups` argument)



sex,mean,standard_deviation
<fct>,<dbl>,<dbl>
Female,64.93942,3.760656
Male,69.31475,3.611024


# Counting

**`n()`**

# Funcions

measure of location: mean, median
measure of spread: md, mad, IQR
measure of rank: min, quantile, max
measure of position: first, nth, last
count: n(), n_distinct(), count()

not_cancelled

# The Dot placeholder

In [20]:
#get the values of column `height` in dataframe `heights`
heights %>% .$height

In [21]:
head(murders)

Unnamed: 0_level_0,state,abb,region,population,total
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>
1,Alabama,AL,South,4779736,135
2,Alaska,AK,West,710231,19
3,Arizona,AZ,West,6392017,232
4,Arkansas,AR,South,2915918,93
5,California,CA,West,37253956,1257
6,Colorado,CO,West,5029196,65


In [22]:
#return average muder rate per 10000 in US
murders %>% summarize(rate = sum(total) / sum(population) * 100000) %>% .$rate

# Sorting Data Tables

**`arrange()`**: sort a data frame by column(s)

In [23]:
#sort by the total number of crimes, in ascending order
murders %>% arrange(total) %>% head()

Unnamed: 0_level_0,state,abb,region,population,total
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>
1,Vermont,VT,Northeast,625741,2
2,North Dakota,ND,North Central,672591,4
3,New Hampshire,NH,Northeast,1316470,5
4,Wyoming,WY,West,563626,5
5,Hawaii,HI,West,1360301,7
6,South Dakota,SD,North Central,814180,8


**`desc()`**:Transform a vector into a format that will be sorted in descending order. 

In [24]:
#sort the above example descending
murders %>% arrange(desc(total)) %>% head()

Unnamed: 0_level_0,state,abb,region,population,total
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>
1,California,CA,West,37253956,1257
2,Texas,TX,South,25145561,805
3,Florida,FL,South,19687653,669
4,New York,NY,Northeast,19378102,517
5,Pennsylvania,PA,Northeast,12702379,457
6,Michigan,MI,North Central,9883640,413


You can sort by multiple levels

In [25]:
#sort by total descending, if tie sort by population ascending
murders %>% arrange(desc(total), population) %>% head()

Unnamed: 0_level_0,state,abb,region,population,total
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>
1,California,CA,West,37253956,1257
2,Texas,TX,South,25145561,805
3,Florida,FL,South,19687653,669
4,New York,NY,Northeast,19378102,517
5,Pennsylvania,PA,Northeast,12702379,457
6,Michigan,MI,North Central,9883640,413


**`top_n`()**: return top results ranked by a given variable

In [26]:
#top 10 highest #crimes
murders %>% top_n(10, total)

state,abb,region,population,total
<chr>,<chr>,<fct>,<dbl>,<dbl>
California,CA,West,37253956,1257
Florida,FL,South,19687653,669
Georgia,GA,South,9920000,376
Illinois,IL,North Central,12830632,364
Louisiana,LA,South,4533372,351
Michigan,MI,North Central,9883640,413
Missouri,MO,North Central,5988927,321
New York,NY,Northeast,19378102,517
Pennsylvania,PA,Northeast,12702379,457
Texas,TX,South,25145561,805
