Get this document and a version with empty code chunks at the template repository on github: <https://github.com/VT-Hydroinformatics/2-Programming-Basics>

## Introduction

We have messed around with plotting a bit and you've seen a little of what R can do. So now let's review or introduce you to some basics. Even if you have worked in R before, it is good to be remind of/practice with this stuff, so stay tuned in!

This exercise covers most of the same principles as two chapters in R for Data Science

Workflow: basics (<https://r4ds.hadley.nz/workflow-basics>)

Data transformation (<https://r4ds.hadley.nz/data-transform>)

## You can use R as a calculator

If you just type numbers and operators in, R will spit out the results

In [1]:
1 + 2

## You can create new objects using \<-

Yea yea, = does the same thing. But use \<-. We will call \<- assignment or assignment operator. When we are coding in R we use \<- to assign values to objects and = to set values for parameters in functions. Using \<- helps us differentiate between the two. Norms for formatting are important because they help us understand what code is doing, especially when stuff gets complex.

Oh, one more thing: Surround operators with spaces. Don't code like a gorilla.

x \<- 1 looks better than x\<-1 and if you disagree you are wrong. :)

You can assign single numbers or entire chunks of data using \<-

So if you had an object called my_data and wanted to copy it into my_new_data you could do:

my_new_data \<- my_data

You can then recall/print the values in an object by just typing the name by itself.

In the code chunk below, assign a 3 to the object "y" and then print it out.

In [2]:
y <- 3
y

If you want to assign multiple values, you have to put them in the function c() c means combine. R doesn't know what to do if you just give it a bunch of values with space or commas, but if you put them as arguments in the combine function, it'll make them into a vector.

Any time you need to use several values, even passing as an argument to a function, you have to put them in c() or it won't work.

In [3]:
a <- c(1,2,3,4)
a

When you are creating objects, try to give them meaningful names so you can remember what they are. You can't have spaces or operators that mean something else as part of a name. And remember, everything is case sensitive.

Assign the value 5.4 to water_pH and then try to recall it by typing "water_ph"

In [4]:
water_pH <- 5.4

#water_ph

You can also set objects equal to strings, or values that have letters in them. To do this you just have to put the value in quotes, otherwise R will think it is an object name and tell you it doesn't exist.

Try: name \<- "JP" and then name \<- JP

What happens if you forget the ending parenthesis?

Try: name \<- "JP

R can be cryptic with it's error messages or other responses, but once you get used to them, you know exactly what is wrong when they pop up.

In [5]:
name <- "JP"
#name <- JP

## Using functions

![](images/Function%20syntax.png)

As an example, let's try the seq() function, which creates a sequence of numbers.

In [6]:
seq(from = 1, to = 10, by = 1)

#or

seq(1, 10, 1)

#or

seq(1, 10)

#what does this do
seq(10,1)

## Read in some data.

For the following demonstration we will use the RBI data from a sample of USGS gages we used last class. First we will load the tidyverse library, everything we have done so far is in base R.

Important: read_csv() is the tidyverse csv reading function, the base R function is read.csv(). read.csv() will not read your data in as a tibble, which is the format used by tidyverse functions.

In [7]:
library(tidyverse)

rbi <- read_csv("Flashy_Dat_Subset.csv")

"package 'tidyverse' was built under R version 4.3.3"


"package 'readr' was built under R version 4.3.3"


"package 'dplyr' was built under R version 4.3.3"


"package 'forcats' was built under R version 4.3.3"


"package 'lubridate' was built under R version 4.3.3"


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     


── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


[1mRows: [22m[34m49[39m [1mColumns: [22m[34m26[39m


[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (4): STANAME, STATE, CLASS, AGGECOREGION
[32mdbl[39m (22): site_no, RBI, RBIrank, DRAIN_SQKM, HUC02, LAT_GAGE, LNG_GAGE, PPTA...



[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Wait, hold up. What is a tibble?

Good question. It's a fancy way to store data that works well with tidyverse functions. Let's look at the rbi tibble.

In [8]:
head(rbi)

site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7,1,47.23739,-68.58264,ME,Ref,⋯,0.202,10.0,-2.49,0.269,-2.7,504.7,36.9,0.102,20,10
1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7,1,44.93694,-67.73611,ME,Ref,⋯,0.131,11.9,-0.85,0.123,-0.6,554.2,39.5,0.046,11,10
1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6,1,44.60797,-67.93524,ME,Ref,⋯,0.344,12.2,0.06,0.873,1.4,553.1,38.2,0.047,20,10
1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9,1,46.14306,-68.63361,ME,Ref,⋯,0.231,10.4,-2.13,0.216,-1.5,513.0,36.4,0.07,11,10
1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2,1,45.50097,-68.30596,ME,Ref,⋯,0.554,11.7,-1.49,0.251,-1.2,540.8,37.2,0.033,20,10
1031300,0.29718786,489,"Piscataquis River at Blanchard, Maine",304.4,1,45.26722,-69.58389,ME,Ref,⋯,0.431,11.0,-2.46,0.268,-1.7,495.8,40.2,0.03,13,10


Now read in the same data with read.csv() which will NOT read the data as a tibble. How is it different? Output each one in the Console.

Knowing the data type for each column is super helpful for a few reasons.... let's talk about them.

Types: int, dbl, fctr, char, logical

In [9]:
rbi_NT <- read.csv("Flashy_Dat_Subset.csv")

head(rbi_NT)

Unnamed: 0_level_0,site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
Unnamed: 0_level_1,<int>,<dbl>,<int>,<chr>,<dbl>,<int>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
1,1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7,1,47.23739,-68.58264,ME,Ref,⋯,0.202,10.0,-2.49,0.269,-2.7,504.7,36.9,0.102,20,10
2,1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7,1,44.93694,-67.73611,ME,Ref,⋯,0.131,11.9,-0.85,0.123,-0.6,554.2,39.5,0.046,11,10
3,1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6,1,44.60797,-67.93524,ME,Ref,⋯,0.344,12.2,0.06,0.873,1.4,553.1,38.2,0.047,20,10
4,1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9,1,46.14306,-68.63361,ME,Ref,⋯,0.231,10.4,-2.13,0.216,-1.5,513.0,36.4,0.07,11,10
5,1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2,1,45.50097,-68.30596,ME,Ref,⋯,0.554,11.7,-1.49,0.251,-1.2,540.8,37.2,0.033,20,10
6,1031300,0.29718786,489,"Piscataquis River at Blanchard, Maine",304.4,1,45.26722,-69.58389,ME,Ref,⋯,0.431,11.0,-2.46,0.268,-1.7,495.8,40.2,0.03,13,10


## Data wrangling in dplyr

If you forget syntax or what the following functions do, here is an excellent cheat sheet: <https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf>

We will demo five functions below:

-   **filter()** - returns rows that meet specified conditions
-   **arrange()** - reorders rows
-   **select()** - pull out variables (columns)
-   **mutate()** - create new variables (columns) or reformat existing ones
-   **summarize()** - collapse groups of values into summary stats

With all of these, the first argument is the data and then the arguments after that specify what you want the function to do.

![](images/dplyr%20functions.png)

## Filter

Write an expression that returns data in rbi for the state of Maine (ME)

Operators:\
== equal\
!= not equal\
\>= , \<= greater than or equal to, less than or equal to\
\>, \< greater than or less then\
%in% included in a list of values\
& and\
\| or

In [10]:
filter(rbi, STATE == "ME")

site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7,1,47.23739,-68.58264,ME,Ref,⋯,0.202,10.0,-2.49,0.269,-2.7,504.7,36.9,0.102,20,10
1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7,1,44.93694,-67.73611,ME,Ref,⋯,0.131,11.9,-0.85,0.123,-0.6,554.2,39.5,0.046,11,10
1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6,1,44.60797,-67.93524,ME,Ref,⋯,0.344,12.2,0.06,0.873,1.4,553.1,38.2,0.047,20,10
1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9,1,46.14306,-68.63361,ME,Ref,⋯,0.231,10.4,-2.13,0.216,-1.5,513.0,36.4,0.07,11,10
1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2,1,45.50097,-68.30596,ME,Ref,⋯,0.554,11.7,-1.49,0.251,-1.2,540.8,37.2,0.033,20,10
1031300,0.29718786,489,"Piscataquis River at Blanchard, Maine",304.4,1,45.26722,-69.58389,ME,Ref,⋯,0.431,11.0,-2.46,0.268,-1.7,495.8,40.2,0.03,13,10
1031500,0.3204495,545,"Piscataquis River near Dover-Foxcroft, Maine",769.0,1,45.17501,-69.3147,ME,Ref,⋯,0.773,11.5,-2.03,0.514,-1.2,512.5,39.3,0.025,20,10
1037380,0.31804018,537,"Ducktrap River near Lincolnville, Maine",39.0,1,44.32917,-69.06083,ME,Ref,⋯,0.209,12.7,1.55,0.236,1.7,586.6,35.9,0.041,11,10
1044550,0.24157998,360,"Spencer Stream near Grand Falls, Maine",499.8,1,45.31361,-70.24167,ME,Ref,⋯,0.661,9.5,-2.85,0.571,-2.3,465.4,41.2,0.075,10,10
1047000,0.34368775,608,"Carrabassett River near North Anson, Maine",909.1,1,44.8692,-69.9551,ME,Ref,⋯,1.325,12.0,-1.86,1.019,-0.6,503.1,39.4,0.052,20,10


**Multiple conditions**

How many gages are there in Maine with an rbi greater than 0.25

In [11]:
filter(rbi, STATE == "ME" & RBI > 0.25)

site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1031300,0.2971879,489,"Piscataquis River at Blanchard, Maine",304.4,1,45.26722,-69.58389,ME,Ref,⋯,0.431,11.0,-2.46,0.268,-1.7,495.8,40.2,0.03,13,10
1031500,0.3204495,545,"Piscataquis River near Dover-Foxcroft, Maine",769.0,1,45.17501,-69.3147,ME,Ref,⋯,0.773,11.5,-2.03,0.514,-1.2,512.5,39.3,0.025,20,10
1037380,0.3180402,537,"Ducktrap River near Lincolnville, Maine",39.0,1,44.32917,-69.06083,ME,Ref,⋯,0.209,12.7,1.55,0.236,1.7,586.6,35.9,0.041,11,10
1047000,0.3436877,608,"Carrabassett River near North Anson, Maine",909.1,1,44.8692,-69.9551,ME,Ref,⋯,1.325,12.0,-1.86,1.019,-0.6,503.1,39.4,0.052,20,10
1054200,0.491654,805,"Wild River at Gilead, Maine",181.0,1,44.39044,-70.97964,ME,Ref,⋯,1.502,12.3,-1.46,0.712,-0.6,517.7,39.0,0.028,20,10
1055000,0.4500171,762,"Swift River near Roxbury, Maine",250.6,1,44.64275,-70.58878,ME,Ref,⋯,0.803,12.0,-1.43,0.407,-0.4,498.0,39.4,0.023,20,10
1057000,0.3258137,561,"Little Androscoggin River near South Paris, Maine",190.9,1,44.30399,-70.53968,ME,Ref,⋯,0.536,12.5,-0.49,0.297,0.0,559.3,36.2,0.029,20,10


## Arrange

Arrange sorts by a column in your dataset.

Sort the rbi data by the RBI column in ascending and then descending order

In [12]:
arrange(rbi, RBI)

arrange(rbi, desc(RBI))

site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1305500,0.04639169,18,SWAN RIVER AT EAST PATCHOGUE NY,21.3,2,40.76704,-72.99372,NY,Non-ref,⋯,0.051,16.0,6.32,0.094,6.1,697.2,19.8,0.02,20,10
1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7,1,47.23739,-68.58264,ME,Ref,⋯,0.202,10.0,-2.49,0.269,-2.7,504.7,36.9,0.102,20,10
1306460,0.05872622,37,CONNETQUOT BK NR CENTRAL ISLIP NY,55.7,2,40.77204,-73.15872,NY,Non-ref,⋯,0.027,15.8,6.48,0.083,6.4,704.1,19.8,0.015,20,10
1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2,1,45.50097,-68.30596,ME,Ref,⋯,0.554,11.7,-1.49,0.251,-1.2,540.8,37.2,0.033,20,10
1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9,1,46.14306,-68.63361,ME,Ref,⋯,0.231,10.4,-2.13,0.216,-1.5,513.0,36.4,0.07,11,10
1117468,0.1719865,244,"BEAVER RIVER NEAR USQUEPAUG, RI",25.3,1,41.4926,-71.62812,RI,Ref,⋯,0.118,15.0,4.46,0.289,4.7,647.5,25.7,0.044,20,10
1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6,1,44.60797,-67.93524,ME,Ref,⋯,0.344,12.2,0.06,0.873,1.4,553.1,38.2,0.047,20,10
1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7,1,44.93694,-67.73611,ME,Ref,⋯,0.131,11.9,-0.85,0.123,-0.6,554.2,39.5,0.046,11,10
1162500,0.21330919,311,"PRIEST BROOK NEAR WINCHENDON, MA",49.7,1,42.68259,-72.11508,MA,Ref,⋯,0.179,13.7,0.76,0.367,0.4,567.4,31.0,0.027,20,10
1117370,0.22982547,338,QUEEN R AT LIBERTY RD AT LIBERTY RI,50.5,1,41.53899,-71.56867,RI,Ref,⋯,0.151,15.0,4.67,0.101,4.6,649.7,25.7,0.047,11,10


site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1311500,0.85601055,1017,VALLEY STREAM AT VALLEY STREAM NY,18.1,2,40.66371,-73.70458,NY,Non-ref,⋯,0.028,16.4,7.22,0.166,7.4,737.5,16.0,0.022,20,10
1054200,0.49165397,805,"Wild River at Gilead, Maine",181.0,1,44.39044,-70.97964,ME,Ref,⋯,1.502,12.3,-1.46,0.712,-0.6,517.7,39.0,0.028,20,10
1187300,0.48701984,800,"HUBBARD RIVER NR. WEST HARTLAND, CT.",53.9,1,42.03732,-72.93899,MA,Ref,⋯,0.36,13.5,0.97,0.143,0.9,581.9,31.9,0.014,20,10
1105600,0.4842607,797,"OLD SWAMP RIVER NEAR SOUTH WEYMOUTH, MA",12.7,1,42.19038,-70.94477,MA,Non-ref,⋯,0.033,15.0,4.71,0.027,4.8,654.2,23.9,0.064,20,10
1055000,0.45001714,762,"Swift River near Roxbury, Maine",250.6,1,44.64275,-70.58878,ME,Ref,⋯,0.803,12.0,-1.43,0.407,-0.4,498.0,39.4,0.023,20,10
1195100,0.43028357,744,"INDIAN RIVER NEAR CLINTON, CT.",14.8,1,41.30593,-72.5312,CT,Ref,⋯,0.157,15.7,4.28,0.21,4.6,657.6,23.8,0.015,20,10
1181000,0.41996286,732,"WEST BRANCH WESTFIELD RIVER AT HUNTINGTON, MA",243.5,1,42.23731,-72.89565,MA,Ref,⋯,0.853,14.4,1.19,0.203,1.3,596.7,31.8,0.038,20,10
1350000,0.41411511,721,SCHOHARIE CREEK AT PRATTSVILLE NY,612.5,2,42.31953,-74.43654,NY,Ref,⋯,0.91,13.5,-0.16,0.59,1.3,523.1,33.4,0.049,20,10
1121000,0.40433769,710,"MOUNT HOPE RIVER NEAR WARRENVILLE, CT.",70.3,1,41.84371,-72.16897,CT,Ref,⋯,0.373,15.0,2.65,0.234,2.7,619.1,26.8,0.021,20,10
1169000,0.39529735,688,"NORTH RIVER AT SHATTUCKVILLE, MA",230.6,1,42.63842,-72.72509,MA,Ref,⋯,0.548,13.5,0.79,0.326,1.3,564.4,34.3,0.022,20,10


## Select

There are too many columns! You will often want to do this when you are manipulating the structure of your data and need to trim it down to only include what you will use.

Select Site name, state, and RBI from the rbi data

Note they come back in the order you put them in in the function, not the order they were in in the original data.

You can do a lot more with select, especially when you need to select a bunch of columns but don't want to type them all out. But we don't need to cover all that today. For a taste though, if you want to select a group of columns you can specify the first and last with a colon in between (first:last) and it'll return all of them. Select the rbi columns from site_no to DRAIN_SQKM.

In [13]:
select(rbi, STANAME, STATE, RBI)

select(rbi, site_no:DRAIN_SQKM)

STANAME,STATE,RBI
<chr>,<chr>,<dbl>
"Fish River near Fort Kent, Maine",ME,0.05837454
"Old Stream near Wesley, Maine",ME,0.20797008
"Narraguagus River at Cherryfield, Maine",ME,0.19805382
"Seboeis River near Shin Pond, Maine",ME,0.13151299
"Mattawamkeag River near Mattawamkeag, Maine",ME,0.11350485
"Piscataquis River at Blanchard, Maine",ME,0.29718786
"Piscataquis River near Dover-Foxcroft, Maine",ME,0.3204495
"Ducktrap River near Lincolnville, Maine",ME,0.31804018
"Spencer Stream near Grand Falls, Maine",ME,0.24157998
"Carrabassett River near North Anson, Maine",ME,0.34368775


site_no,RBI,RBIrank,STANAME,DRAIN_SQKM
<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7
1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7
1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6
1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9
1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2
1031300,0.29718786,489,"Piscataquis River at Blanchard, Maine",304.4
1031500,0.3204495,545,"Piscataquis River near Dover-Foxcroft, Maine",769.0
1037380,0.31804018,537,"Ducktrap River near Lincolnville, Maine",39.0
1044550,0.24157998,360,"Spencer Stream near Grand Falls, Maine",499.8
1047000,0.34368775,608,"Carrabassett River near North Anson, Maine",909.1


## Mutate

Use mutate to add new columns based on additional ones. Common uses are to create a column of data in different units, or to calculate something based on two columns. You can also use it to just update a column, by naming the new column the same as the original one (but be careful because you'll lose the original one!). I commonly use this when I am changing the datatype of a column, say from a character to a factor or a string to a date.

Create a new column in rbi called T_RANGE by subtracting T_MIN_SITE from T_MAX_SITE

In [14]:
mutate(rbi, T_RANGE = T_MAX_SITE - T_MIN_SITE)

site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09,T_RANGE
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7,1,47.23739,-68.58264,ME,Ref,⋯,10.0,-2.49,0.269,-2.7,504.7,36.9,0.102,20,10,12.7
1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7,1,44.93694,-67.73611,ME,Ref,⋯,11.9,-0.85,0.123,-0.6,554.2,39.5,0.046,11,10,12.5
1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6,1,44.60797,-67.93524,ME,Ref,⋯,12.2,0.06,0.873,1.4,553.1,38.2,0.047,20,10,10.8
1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9,1,46.14306,-68.63361,ME,Ref,⋯,10.4,-2.13,0.216,-1.5,513.0,36.4,0.07,11,10,11.9
1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2,1,45.50097,-68.30596,ME,Ref,⋯,11.7,-1.49,0.251,-1.2,540.8,37.2,0.033,20,10,12.9
1031300,0.29718786,489,"Piscataquis River at Blanchard, Maine",304.4,1,45.26722,-69.58389,ME,Ref,⋯,11.0,-2.46,0.268,-1.7,495.8,40.2,0.03,13,10,12.7
1031500,0.3204495,545,"Piscataquis River near Dover-Foxcroft, Maine",769.0,1,45.17501,-69.3147,ME,Ref,⋯,11.5,-2.03,0.514,-1.2,512.5,39.3,0.025,20,10,12.7
1037380,0.31804018,537,"Ducktrap River near Lincolnville, Maine",39.0,1,44.32917,-69.06083,ME,Ref,⋯,12.7,1.55,0.236,1.7,586.6,35.9,0.041,11,10,11.0
1044550,0.24157998,360,"Spencer Stream near Grand Falls, Maine",499.8,1,45.31361,-70.24167,ME,Ref,⋯,9.5,-2.85,0.571,-2.3,465.4,41.2,0.075,10,10,11.8
1047000,0.34368775,608,"Carrabassett River near North Anson, Maine",909.1,1,44.8692,-69.9551,ME,Ref,⋯,12.0,-1.86,1.019,-0.6,503.1,39.4,0.052,20,10,12.6


When downloading data from the USGS through R, you have to enter the gage ID as a character, even though they are all made up of numbers. So to practice doing this, update the site_no column to be a character datatype

In [15]:
mutate(rbi, site_no = as.character(site_no))

site_no,RBI,RBIrank,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,CLASS,⋯,T_MAXSTD_BASIN,T_MAX_SITE,T_MIN_BASIN,T_MINSTD_BASIN,T_MIN_SITE,PET,SNOW_PCT_PRECIP,PRECIP_SEAS_IND,FLOWYRS_1990_2009,wy00_09
<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1013500,0.05837454,35,"Fish River near Fort Kent, Maine",2252.7,1,47.23739,-68.58264,ME,Ref,⋯,0.202,10.0,-2.49,0.269,-2.7,504.7,36.9,0.102,20,10
1021480,0.20797008,300,"Old Stream near Wesley, Maine",76.7,1,44.93694,-67.73611,ME,Ref,⋯,0.131,11.9,-0.85,0.123,-0.6,554.2,39.5,0.046,11,10
1022500,0.19805382,286,"Narraguagus River at Cherryfield, Maine",573.6,1,44.60797,-67.93524,ME,Ref,⋯,0.344,12.2,0.06,0.873,1.4,553.1,38.2,0.047,20,10
1029200,0.13151299,183,"Seboeis River near Shin Pond, Maine",444.9,1,46.14306,-68.63361,ME,Ref,⋯,0.231,10.4,-2.13,0.216,-1.5,513.0,36.4,0.07,11,10
1030500,0.11350485,147,"Mattawamkeag River near Mattawamkeag, Maine",3676.2,1,45.50097,-68.30596,ME,Ref,⋯,0.554,11.7,-1.49,0.251,-1.2,540.8,37.2,0.033,20,10
1031300,0.29718786,489,"Piscataquis River at Blanchard, Maine",304.4,1,45.26722,-69.58389,ME,Ref,⋯,0.431,11.0,-2.46,0.268,-1.7,495.8,40.2,0.03,13,10
1031500,0.3204495,545,"Piscataquis River near Dover-Foxcroft, Maine",769.0,1,45.17501,-69.3147,ME,Ref,⋯,0.773,11.5,-2.03,0.514,-1.2,512.5,39.3,0.025,20,10
1037380,0.31804018,537,"Ducktrap River near Lincolnville, Maine",39.0,1,44.32917,-69.06083,ME,Ref,⋯,0.209,12.7,1.55,0.236,1.7,586.6,35.9,0.041,11,10
1044550,0.24157998,360,"Spencer Stream near Grand Falls, Maine",499.8,1,45.31361,-70.24167,ME,Ref,⋯,0.661,9.5,-2.85,0.571,-2.3,465.4,41.2,0.075,10,10
1047000,0.34368775,608,"Carrabassett River near North Anson, Maine",909.1,1,44.8692,-69.9551,ME,Ref,⋯,1.325,12.0,-1.86,1.019,-0.6,503.1,39.4,0.052,20,10


## Summarize

Summarize will perform an operation on all of your data, or groups if you assign groups.

Use summarize to compute the mean, min, and max rbi

In [16]:
summarize(rbi, meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

meanrbi,maxrbi,minrbi
<dbl>,<dbl>,<dbl>
0.3156739,0.8560106,0.04639169


Now use the group function to group by state and then summarize in the same way as above

In [17]:
rbistate <- group_by(rbi, STATE)
summarize(rbistate, meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

STATE,meanrbi,maxrbi,minrbi
<chr>,<dbl>,<dbl>,<dbl>
CT,0.3663652,0.4302836,0.29470192
MA,0.3666678,0.4870198,0.21330919
ME,0.2690651,0.491654,0.05837454
NH,0.3363972,0.3684461,0.26472423
NY,0.3415242,0.8560106,0.04639169
RI,0.200906,0.2298255,0.1719865
VT,0.299268,0.3649163,0.23084529


## Multiple operations with pipes

The pipe operator \|\> allows you to perform multiple operations in a sequence without saving intermediate steps. Not only is this more efficient, but structuring operations with pipes is also more intuitive than nesting functions within functions (the other way you can do multiple operations). The \|\> pipe is included in base R, if you see code elsewhere that has a %\>% pipe, that is the original pipe, from the magrittr package. It was incorporated into base R and is now \|\> but works the same!

**Let's say we want to tell R to make a PB&J sandwich by using the pbbread(), jbread(), and joinslices() functions and the data "ingredients". If we do this saving each step if would look like this:**

> sando \<- pbbread(ingredients)

> sando \<- jbread(sando)

> sando \<- joinslices(sando)

**If we nest the functions together we get this**

> joinslice(jbread(pbbread(ingredients)))

Efficient... but tough to read/interpret

**Using the pipe it would look like this**

> ingredients\|\>\
> pbbread() \|\>\
> jbread() \|\>\
> joinslices()

Much easier to follow!

**When you use the pipe, it basically takes whatever came out of the first function and puts it into the data argument for the next one**

**so rbi \|\> group_by(STATE) is the same as group_by(rbi, STATE)**

Take the groupby and summarize code from above and perform the operation using the pipe

In [18]:
rbi |>
  group_by(STATE) |>
  summarize(meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

STATE,meanrbi,maxrbi,minrbi
<chr>,<dbl>,<dbl>,<dbl>
CT,0.3663652,0.4302836,0.29470192
MA,0.3666678,0.4870198,0.21330919
ME,0.2690651,0.491654,0.05837454
NH,0.3363972,0.3684461,0.26472423
NY,0.3415242,0.8560106,0.04639169
RI,0.200906,0.2298255,0.1719865
VT,0.299268,0.3649163,0.23084529


## Save your results to a new tibble

We have just been writing everything to the screen so we can see what we are doing... In order to save anything we do with these functions to work with it later, we just have to use the assignment operator (\<-) to store the data.

One kind of awesome thing about the assignment operator is that it works both ways...

x \<- 3 and 3 -\> x do the same thing (WHAT?!)

So you can do the assignment at the beginning of the end of your dplyr workings, whatever you like best.

Use the assignment operator to save the summary table you just made.

In [19]:
stateRBIs <- rbi |>
  group_by(STATE) |>
  summarize(meanrbi = mean(RBI), maxrbi = max(RBI), minrbi = min(RBI))

# Notice when you do this it doesn't output the result... 
# You can see what you did by clickon in stateRBIs in your environment panel
# or just type stateRBIs

stateRBIs

STATE,meanrbi,maxrbi,minrbi
<chr>,<dbl>,<dbl>,<dbl>
CT,0.3663652,0.4302836,0.29470192
MA,0.3666678,0.4870198,0.21330919
ME,0.2690651,0.491654,0.05837454
NH,0.3363972,0.3684461,0.26472423
NY,0.3415242,0.8560106,0.04639169
RI,0.200906,0.2298255,0.1719865
VT,0.299268,0.3649163,0.23084529


## What about NAs?

We will talk more about this when we discuss stats, but some operations will fail if there are NA's in the data. If appropriate, you can tell functions like mean() to ignore NAs. You can also use drop_na() if you're working with a tibble. But be aware if you use that and save the result, drop_na() gets rid of the whole row, not just the NA. Because what would you replace it with.... an NA?

In [20]:
x <- c(1,2,3,4,NA)
mean(x, na.rm = TRUE)

## What are some things you think I'll ask you to do for the activity next class?