Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reviewing chapter_3 #70

Merged
merged 7 commits into from
Aug 15, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
146 changes: 75 additions & 71 deletions 03-attribute-operations.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,78 +2,87 @@

## Prerequisites {-}

- This chapter requires **tidyverse** and **sf**:
- This chapter requires the packages **tidyverse** and **sf**:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to be explicit, thanks for clarifying that for readers.


```{r, message=FALSE}
library(sf)
library(tidyverse)
```

- You must have loaded the `world` and `worldbank_df` data which are loaded automatically by the **spData** package:
- We will also make use of the the `world` and `worldbank_df` data sets. Note that loading the **spData** package automatically attaches these data sets to your global environment:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, great attention to detail in the description, thank for that.


```{r, results='hide'}
library(spData)
```

## Introduction

Attribute data is non-spatial information associated with geographic data.
In the context of simple features, introduced in the previous chapter, this means a data frame with a column for each variable and one row per geographic feature stored in the `geom` list-column of `sf` objects.
Attribute data is non-spatial information, e.g., the name of a bus station, associated with geographic data, e.g. the coordinate of this bus station.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not keen on using acronyms such as i.e. or e.g. mid-text, especially when it's surrounded by enclosing commas - does not flow great. I propose the line is changed to the following:

Attribute data is non-spatial information associated with geographic (geometry) data.
A bus station, for example, could be represented by a field containing it's name (attribute data), associated with its latitude and longitude position (geometry data).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll remember that! Changed as requested.

Simple features (see previous chapter) store attribute data in a dataframe with each column corresponding to a variable and each row to one observation, e.g., a bus station.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 e.g.s in quick succession! I suggest a small change:

Simple features, described in the previous chapter, store attribute data in a data frame, with each column corresponding to a variable (such as 'name') and each row to one observation (such as an individual bus station).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, incorporated that.

In addition, a special column, mostly named `geom` or `geometry`, stores the spatial information of an **sf**-object, e.g., the coordinate of the bus station.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A third e.g.! Suggestion (we've already said that the geometry contains the coordinates but it should link to the next sentence):

In addition, a special column, usually named `geom` or `geometry`, stores the geometry data of **sf** objects.
For a bus station, that would likely be a single point representing its centroid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will make sure to avoid using e.g. and i.e. :-). Changed that.

By contrast, a line or a polygon consist of multiple points.
Still, these points only correspond to one row in the attribute table.
This works since **sf** stores the geometry in the form of a list.
The list elements correspond to the number of observations in the attribute table.
But each list element can contain more than one coordinate if required or even another list as it is the case for polygons with holes (see previous sections).
This structure enables multiple columns to represent a range of attributes for thousands of features (one row per feature).

There is a strong overlap between geographical and non-geographical operations:
non-spatial subset, aggregate and join each have their geographical equivalents.
The subsetting functions `[` from base R and `filter()` from the **tidyverse**, for example, can also be used for spatial subsetting: the skills are cross-transferable.
This chapter therefore provides the foundation for Chapter \@ref(spatial-data-operations), in terms of structure and input data.
The subsetting functions `[` from base R and `filter()` from the **tidyverse**, for example, are also applicable to spatial data: the skills are cross-transferable.
This chapter, therefore, provides the foundation for Chapter \@ref(spatial-data-operations) in terms of structure and input data.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean by foundation in terms of structure?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the structure of c4 mirrors that of c3. Does this sound any better?

This chapter therefore provides the basis for Chapter \@ref(spatial-data-operations).

You could also say something about it mirroring the structure if you can find the right form of words.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. In have just clarified this.


As outlined in Chapter \@ref(spatial-class), support for simple features in R is provided by the **sf** package.
**sf** ensures simple feature objects work well with generic R functions such as `plot()` and `summary()`.
The reason for this is that simple features have their own class, which behave simultaneously as geographic data objects (e.g. plotting as maps) and square tables (e.g. with attribute columns referred to with the `$` operator).
As outlined in Chapter \@ref(spatial-class), **sf** provided the support for simple features in R.
Additionally, **sf** added methods to generic R functions such as `plot()` and `summary()` to work with simple features. To convince yourself run for example `methods("summary")` and/or `methods("plot")`.
<!--The reason for this is that simple features have their own class, which behave simultaneously as geographic data objects (e.g., plotting as maps) and square tables (e.g., with attribute columns referred to with the `$` operator).-->
Copy link
Collaborator Author

@jannes-m jannes-m Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this means, can you please clarify (The reason for this is that simple features...)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that commented bit can safely be deleted: we discuss the fact that sf objects are also data frames at some length. Not sure To convince yourself is the best form of words - maybe this would be a more appropriate sentence to replace lines 35:37:

As outlined in Chapter \@ref(spatial-class), **sf** provided support for simple features in R and made them work with generic R functions such as `plot()` and `summary()` (as can be seen by executing `methods("summary")` and/or `methods("plot")`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I deleted the commented part and also adopted your wording. Thanks.


The trusty `data.frame` (and extensions to it such as the `tibble` class used in the tidyverse) is a workhorse for data analysis in R.
Extending this system to work with spatial data has many advantages,
meaning that all the accumulated know-how in the R community for handling data frames to be applied to geographic data which contain attributes.
The reliable `data.frame` (and modifications of it such as the `tibble` class used in the tidyverse) is the basis for data analysis in R.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say modifications to it rather than modifications of it as the class is modified by an external force (the programmer). Otherwise I think this adjustment to the text is an improvement, thanks for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Extending this system to work with spatial data has many advantages.
The most important one is that the accumulated know-how in the R community for handling data frames is transferable to geographic attribute data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace is transferable with can be transferred because it's still contingent on knowing how to program with data frames, hence the importance of learning about attribute data operations and reading this chapter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good point! This was just my habit of avoiding the passive voice. Good example when that's no good...


Before proceeding to perform various attribute operations of a dataset, it is worth taking time to think about its basic parameters.
In this case, the `world` object contains 10 non-geographical columns (and one geometry list-column) with data for almost 200 countries.
This can be be checked using base R functions for working with tabular data such as `nrow()` and `ncol()`:
Before proceeding to perform various attribute operations on a dataset, it is advisable to explore its structure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (more concise, informal and hopefully friendly):

Before proceeding to perform various attribute operations on a dataset, let's explore its structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed that.

To find out more about the structure of our use case dataset `world`, we use base R functions for working with tabular data such as `nrow()` and `ncol()`:

```{r}
dim(world) # it is a 2 dimensional object, with rows and columns
nrow(world) # how many rows?
ncol(world) # how many columns?
```

Our dataset contains ten non-geographical columns (and one geometry list-column) with almost 200 rows representing the world's countries.

Extracting the attribute data of an `sf` object is the same as removing its geometry:

```{r}
world_df = st_set_geometry(world, NULL)
class(world_df)
```

This can be useful if the geometry column causes problems, e.g. by occupying large amounts of RAM, or to focus attention on the non-spatial data.
This can be useful if the geometry column causes problems, e.g., by occupying large amounts of RAM, or to focus the attention on the attribute data.
For most cases, however, there is no harm in keeping the geometry column because non-spatial data operations on `sf` objects act only on the attribute data.
For this reason, being good at working with attribute data in geographic data is the same being proficient at handling data frames in R.
For many applications, the most effective and intuitive way of working with data frames is with the **dplyr** package, as we will see in the next
For this reason, being good at working with attribute data of spatial objects is the same as being proficient at handling data frames in R.
For many applications, **dplyr** offers the most effective and most intuitive approach of working with data frames, as we will see in the next
section.^[
Unlike objects of class `Spatial` defined by the **sp** package, `sf` objects are also compatible with **dplyr** and **data.table** packages, which provide fast and powerful functions for data manipulation (see [Section 6.7](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table) of @gillespie_efficient_2016).
Unlike objects of class `Spatial` of the **sp** package, `sf` objects are also compatible with the packages **dplyr** and **data.table** (at least in theory). Both packages provide fast and powerful functions for data manipulation (see [Section 6.7](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table) of @gillespie_efficient_2016).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if sf-object really work well with data.table, I guess they sometimes do, and sometimes not. Edzer also said at the UseR-conference that if somebody would like to see sf working with data.table, he is happy to include corresponding pull requests (he did the same with the tidyverse).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - maybe just delete the bit about data.table: there is no point mentioning it as we do not use it in the book and it could cause confusion. Suggest:

Unlike objects of class `Spatial` of the **sp** package, `sf` objects are also compatible with the **tidyverse** packages **dplyr** and **ggplot2**. The former provides fast and powerful functions for data manipulation (see [Section 6.7](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table) of @gillespie_efficient_2016) and the latter provides powerful plotting capabilities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I have incorporated that. Thanks again.

This chapter focuses on **dplyr** because of its intuitive function names and ability to perform multiple chained operations using the pipe operator.]

## Attribute subsetting

Because simple feature objects are also data frames, you can use a wide range of functions (from base R and packages) for subsetting them, based on attribute data.
Because simple feature objects are also data frames (run `class(world)`to verify), you can use a wide range of functions (from base R and other packages) for subsetting them.
Base R subsetting functions include `[`, `subset()` and `$`.
**dplyr** subsetting functions include `select()`, `filter()`, and `pull()`.
Both sets of functions preserve the spatial components of the data.

The `[` operator subsets rows and columns.
It requires two arguments, one for rows (observations) and one for columns (variables), and is appended to the object name, e.g. `object[rows, columns]`,
which can be either numeric, indicating position, or character, indicating row or column names.
Leaving an argument empty returns all, meaning `object[rows,]` returns just the rows of interest for all columns.
This functionality is demonstrated below (results not shown - try running this on your own computer to check the output is as expected):
The `[` operator can subset both rows and columns.
You use indices to specify the elements you wish to extract from an object, e.g., `object[i, j]` with `i` and `j` representing rows and columns.
<!-- you can also use `[`(world, 1:6, 1) -->
The indices can be either numeric, indicating position, or character strings, indicating row or column names.
Leaving `i` or `j` empty, simply returns all rows or columns.
For instance, `object[1:5, ]` returns the first five rows and all columns.
Below, we demonstrate how to use base R subsetting (results not shown - try running this on your own computer to check the output is as expected):

```{r, eval=FALSE}
world[1:6,] # subset rows by position
world[1:6, ] # subset rows by position
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great formatting fix. I only realised recently that this is good practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you like it. And it's good that we agree on a consistent coding style. Though again I am always happy to adopt yours as well such as the = instead of <-. Consistency it the important thing.

```

```{r, eval=FALSE}
Expand All @@ -84,88 +93,92 @@ world[, 1:3] # subset columns by position
world[, c("name_long", "lifeExp")] # subset columns by name
```

The `[` subsetting operator also accepts `logical` vectors corresponding to some criteria which returns `TRUE` or `FALSE`.
The following code chunk, for example, creates a new object, `small_countries`, which only contains nations whose surface area is below 100,000 km^2^:
The `[` subsetting operator also accepts `logical` vectors consisting of `TRUE` and `FALSE` elements.
The following code chunk, for example, creates a new object, `small_countries`, which only contains nations whose surface area is smaller than 100,000 km^2^:

```{r}
sel_area = world$area_km2 < 10000
summary(sel_area)
small_countries = world[sel_area,]
small_countries = world[sel_area, ]
```

Note that we created the intermediary `sel_object` to illustrate the process and demonstrate that only 7 countries are 'small' by this definition.
Note that we created the intermediary `sel_object`, a logical vector, for illustration purposes, and to show that only seven countries match our query.
A more concise command, that omits the intermediary object, generates the same result:

```{r}
small_countries = world[world$area_km2 < 10000,]
small_countries = world[world$area_km2 < 10000, ]
```

Another way to generate the same result is with the base R function `subset()`:
Another the base R function `subset()` provides yet another way to achieve the same result:

```{r, eval=FALSE}
small_countries = subset(world, area_km2 < 10000)
```

The `$` operator retrieves a variable by its name and returns a vector:
You can use the `$` operator to select a specific variable by its name. The result is a vector:

```{r, eval=FALSE}
world$name_long
```

<!-- , after the package has been loaded: [or - it is a part of tidyverse] -->
**dplyr** makes working with data frames easier and is compatible with `sf` objects.
The main **dplyr** functions that help with attribute subsetting are `select()`, `slice()`, `filter()` and `pull()`.

The `select()` function picks columns by name or position.
Base R functions are essential, and we recommend that you have a working knowledge of them.
However, **dplyr** often makes working with data frames easier.
Moreover, **dplyr** is usually much faster than base R since it makes use of C++ in the background.
This comes in especially handy when working with large data sets.
As a special bonus, **dplyr** is compatible with `sf` objects.
The main **dplyr** subsetting functions are `select()`, `slice()`, `filter()` and `pull()`.

The `select()` function selects columns by name or position.
For example, you could select only two columns, `name_long` and `pop`, with the following command:

```{r}
world1 = select(world, name_long, pop)
head(world1, n = 2)
```

This function allows a range of columns to be selected using the `:` operator:
`select()` also allows subsetting of a range of columns with the help of the `:` operator:

```{r, eval=FALSE}
# all columns between name_long and pop (inclusive)
world2 = select(world, name_long:pop)
head(world2, n = 2)
```

Specific columns can be omitted using the `-` operator:
Omit specific columns with the `-` operator:

```{r, eval=FALSE}
# all columns except subregion and area_km2 (inclusive)
world3 = select(world, -subregion, -area_km2)
head(world3, n = 2)
```

`select()` can be also used to both subset and rename columns in a single line, for example:
Conveniently, `select()` lets you subset and rename columns at the same time, for example:

```{r}
world4 = select(world, name_long, population = pop)
head(world4, n = 2)
```

This is more concise than the base R equivalent (which saves the result as an object called `world5` to avoid overriding the `world` dataset created previously):
This is more concise than the base R equivalent:

```{r, eval=FALSE}
world5 = world[c("name_long", "pop")] # subset columns by name
names(world5)[3] = "population" # rename column manually
world5 = world[, c("name_long", "pop")] # subset columns by name
names(world5)[2] = "population" # rename column manually
```

The `select()` function works with a number of special functions that help with more complicated selection, such as `contains()`, `starts_with()`, `num_range()`.
More details could be find on the function help page - `?select`.
The `select()` function works with a number of special functions that help with more advanced subsetting operations such as `contains()`, `starts_with()` and `num_range()`.
Find out more about the details on the function's help page - `?select`.

`slice()` is the equivalent of `select()` but work for rows.
`slice()` is the row-equivalent of `select()`.
The following code chunk, for example, selects the 3^rd^ to 5^th^ rows:

```{r, eval=FALSE}
slice(world, 3:5)
```

`filter()` is **dplyr**'s equivalent of base R's `subset()` function.
It keeps only rows matching given criteria, e.g. only countries with a very high average life expectancy:
It keeps only rows matching given criteria, e.g., only countries with a very high average of life expectancy:

```{r, eval=FALSE}
# only countries with a life expectation larger than 82 years
Expand All @@ -185,34 +198,24 @@ knitr::kable(data_frame(Symbol = operators, Name = operators_exp))
<!-- add warning about = vs == -->
<!-- add info about combination of &, |, ! -->

The *pipe* operator (` %>% `), which passes the output of one function into the first argument of the next function, is commonly used in **dplyr** data analysis workflows.
This works because the fundamental **dplyr** functions (or 'verbs', like `select()`) all take a data frame object in and spit a data frame object out.
Finally, we would like to introduce the special *pipe* operator (` %>% `) of the **magrittr** package.
The *pipe* operator feeds ('pipes forward') the output of one function into the first argument of the next function.
Combining many functions together with pipes is called *chaining* or *piping*.
The advantage over base R for complex data processing operations is that this approach prevents nested functions and is easy to read because there is a clear order and modularity to the work (a piped command can be commented out, for example).

The example below shows yet another way of creating the renamed `world` dataset, using the pipe operator:

```{r}
world7 = world %>%
select(name_long, continent)
```

Note that this can also be written without the pipe operator because, in the above code, the `world` object is simply 'piped' into the first argument of `select()`.
The equivalent **dplyr** code without the pipe operator is:
For example, let us first take the `world` dataset, then let us select the two columns named `name_long` and `continent`, and then we just would like to have returned the first five rows.

```{r}
world8 = select(world, name_long, continent)
world %>%
select(name_long, continent) %>%
slice(1:5)
```

`pull()` retrieves a single variable by name or position and returns a vector:

```{r, eval=FALSE}
world %>%
pull(name_long)
```
The pipe operator supports an intuitive data analysis workflow (first do that, then do that, then...).
It also lets you read this workflow from left to right, and avoids less easier to read nesting, i.e., to read workflows from the inside to the outside as is commonly the case when using base R.
Another advantage over the nesting approach is that you can easily comment out certain parts of a pipe.
**dplyr** works especially well with the pipe operator because its fundamental functions (or 'verbs', like `select()`) expect a data frame object as input and also return one.^[If you want **dplyr** to return a vector, use `pull`.]

<!--
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure if the subsequent two pipe examples are really needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove them then ; )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the two pipe examples.

The pipe operator can be used for many data processing tasks with attribute data.

```{r}
# 1,000,000,000 could be expressed as 1e9 in the scientific notation
world %>%
Expand All @@ -227,14 +230,15 @@ world$pop[is.na(world$pop)] = 0 # set NAs to 0
world_few_rows = world[world$pop > 1e9,]
```

The ` %>% ` operator works the best for combining many operations.
For example, we want to (1) rename the `name_long` column into a `name` column, (2) picks only `name`, `subregion` and `gdpPercap` and (3) subset countries from "Eastern Asia" with gross domestic product per capita larger than 30,000$:
Overall, the ` %>% ` operator works best for combining many operations.^[However, note that too many pipes can make your code also less easier to read and reproduce.]
For example, we would like to (1) rename the `name_long` column to `name`, (2) we only select the columns `name`, `subregion` and `gdpPercap` and (3) we only keep countries from "Eastern Asia" with a gross domestic product per capita that is larger than 30,000$:

```{r}
world %>%
select(name = name_long, subregion, gdpPercap) %>%
filter(subregion == "Eastern Asia", gdpPercap > 30000)
filter(subregion == "Eastern Asia" & gdpPercap > 30000)
```
-->

## Attribute data aggregation

Expand Down