layout | element | title | language |
---|---|---|---|
page |
notes |
Working with Tabular Data |
R |
Remember to
download
surveys.csv
.download
species.csv
.download
portal_mammals.sqlite
.Consider removing the
dplyr
package so you can demonstrate installing it.
- Linux users: you may not want to do this because the source install is slow
- Main way that reusable code is shared in R
- Combination of code, data, and documentation
- R has a rich ecosystem of packages for data manipulation & analysis
- Download and install packages with the R console:
install.packages("dplyr")
- Using a package:
- Load all of the functions in the package:
library("dplyr")
- Load all of the functions in the package:
- Modern data manipulation library for R
- Does a lot of the same things we've learned to do in SQL.
- Start a new project (modeling good practice)
surveys <- read.csv("surveys.csv")
- Select:
select(surveys, year, month, day)
- Filter:
filter(surveys, species_id == "DS")
filter(surveys, species_id == "DS", year > 1995)
filter(surveys, species_id == "DS" | species_id == "DM")
- Mutate:
mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
Do [Exercise 2 - Shrub Volume Data Basics]({{ site.baseurl }}/exercises/Dplyr-shrub-volume-data-basics-R).
- Group by:
group_by(surveys, species_id)
- Different looking kind of
data.frame
- Source, grouping, and data type information
surveys_by_species <- group_by(surveys, species_id)
- Grouping with
summarize()
:summarize(surveys_by_species, avg_weight = mean(weight))
- Real data problem:
mean(weight)
whenweight
has missing values (NA
)- Returns
NA
mean(weight, na.rm=TRUE)
- Returns
Do [Exercise 3 - Shrub Volume Aggregation]({{ site.baseurl }}/exercises/Dplyr-shrub-volume-aggregation-R).
inner_join
indplyr
works likeJOIN
in SQL
species <- read.csv("species.csv")
combined <- inner_join(surveys, species, by = "species_id")
head(combined)
Do [Exercise 4 - Shrub Volume Join]({{ site.baseurl }}/exercises/Dplyr-shrub-volume-join-R).
- Combine a series of data manipulation actions
- Intermediate variables
- Step-wise approach
- Can get cumbersome with lots of variable objects in the environment
surveys_DS <- filter(surveys, species_id == "DS")
surveys_DS_by_yr <- group_by(surveys_DS, year)
avg_weight_DS_by_yr <- summarize(surveys_DS_by_yr,
avg_weight = mean(weight, na.rm=TRUE))
- Pipes:
- Operator:
%>%
- Operation:
%>%
takes the output of one command and passes it as input to the next commandx %>% f(y)
translates tof(x, y)
surveys %>% filter(species_id == "DS")
- Operator:
surveys %>%
filter(species_id == "DS") %>%
group_by(year) %>%
summarize(avg_weight = mean(weight, na.rm=TRUE))
Do [Exercise 5 - Fix the Code]({{ site.baseurl }}/exercises/Dplyr-fix-the-code-R).
- We can also use
dplyr
to access data directly from a database.- No need to export files from the database
- Lets the database do the heavy lifting
- Faster
- No RAM limits
- Need to install the
dbplyr
package
library(DBI)
portaldb <- dbConnect(RSQLite::SQLite(), "portal_mammals.sqlite")
surveys <- tbl(portaldb, "surveys")
surveys
species <- tbl(portaldb, "species")
portal_data <- inner_join(surveys, species, by = "species_id") %>%
select(year, month, day, genus, species)
- Can also extract data directly using SQL
query <- "SELECT year, month, day, genus, species
FROM surveys JOIN species
USING(species_id)"
portal_data <- dbGetQuery(portaldb, query)
- Either of these runs the query in the database
Do [Exercise 6 - Links to Databases]({{ site.baseurl }}/exercises/Dplyr-link-to-databases-R).
- Speed example using Breeding Bird Survey of North America data
- ~85 million cells (>250 MB)
# Loading from SQLite completes instantly
bbs_sqlite <- dbConnect(RSQLite::SQLite(), "bbs.sqlite")
bbs_counts <- tbl(bbs_sqlite, "breed_bird_survey_counts")
bbs_counts
# Loading from csv takes 30 seconds
bbs_counts_csv <- read.csv("BBS_counts.csv")
- Queries and data manipulation functions return similar results with various
headings (
Source: SQL
) - Number of rows is unknown as shown by
??
- Queries and data manipulation results will remain in the external database.
- Use
collect()
to store results in a local data frame (# A tibble
).
portal_data <- inner_join(surveys, species, by = "species_id") %>%
select(year, month, day, genus, species) %>%
collect()
- If you want to store a table from R in the database use
copy_to()