Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reviewing chapter_3 #70

Merged
merged 7 commits into from Aug 15, 2017
Merged

reviewing chapter_3 #70

merged 7 commits into from Aug 15, 2017

Conversation

jannes-m
Copy link
Collaborator

I have also started to review chapter_3. For more information, see specific comments.

The subsetting functions `[` from base R and `filter()` from the **tidyverse**, for example, can also be used for spatial subsetting: the skills are cross-transferable.
This chapter therefore provides the foundation for Chapter \@ref(spatial-data-operations), in terms of structure and input data.
The subsetting functions `[` from base R and `filter()` from the **tidyverse**, for example, are also applicable to spatial data: the skills are cross-transferable.
This chapter, therefore, provides the foundation for Chapter \@ref(spatial-data-operations) in terms of structure and input data.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean by foundation in terms of structure?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the structure of c4 mirrors that of c3. Does this sound any better?

This chapter therefore provides the basis for Chapter \@ref(spatial-data-operations).

You could also say something about it mirroring the structure if you can find the right form of words.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. In have just clarified this.

The reason for this is that simple features have their own class, which behave simultaneously as geographic data objects (e.g. plotting as maps) and square tables (e.g. with attribute columns referred to with the `$` operator).
As outlined in Chapter \@ref(spatial-class), **sf** provided the support for simple features in R.
Additionally, **sf** added methods to generic R functions such as `plot()` and `summary()` to work with simple features. To convince yourself run for example `methods("summary")` and/or `methods("plot")`.
<!--The reason for this is that simple features have their own class, which behave simultaneously as geographic data objects (e.g., plotting as maps) and square tables (e.g., with attribute columns referred to with the `$` operator).-->
Copy link
Collaborator Author

@jannes-m jannes-m Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this means, can you please clarify (The reason for this is that simple features...)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that commented bit can safely be deleted: we discuss the fact that sf objects are also data frames at some length. Not sure To convince yourself is the best form of words - maybe this would be a more appropriate sentence to replace lines 35:37:

As outlined in Chapter \@ref(spatial-class), **sf** provided support for simple features in R and made them work with generic R functions such as `plot()` and `summary()` (as can be seen by executing `methods("summary")` and/or `methods("plot")`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I deleted the commented part and also adopted your wording. Thanks.

section.^[
Unlike objects of class `Spatial` defined by the **sp** package, `sf` objects are also compatible with **dplyr** and **data.table** packages, which provide fast and powerful functions for data manipulation (see [Section 6.7](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table) of @gillespie_efficient_2016).
Unlike objects of class `Spatial` of the **sp** package, `sf` objects are also compatible with the packages **dplyr** and **data.table** (at least in theory). Both packages provide fast and powerful functions for data manipulation (see [Section 6.7](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table) of @gillespie_efficient_2016).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if sf-object really work well with data.table, I guess they sometimes do, and sometimes not. Edzer also said at the UseR-conference that if somebody would like to see sf working with data.table, he is happy to include corresponding pull requests (he did the same with the tidyverse).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - maybe just delete the bit about data.table: there is no point mentioning it as we do not use it in the book and it could cause confusion. Suggest:

Unlike objects of class `Spatial` of the **sp** package, `sf` objects are also compatible with the **tidyverse** packages **dplyr** and **ggplot2**. The former provides fast and powerful functions for data manipulation (see [Section 6.7](https://csgillespie.github.io/efficientR/data-carpentry.html#data-processing-with-data.table) of @gillespie_efficient_2016) and the latter provides powerful plotting capabilities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I have incorporated that. Thanks again.


<!--
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure if the subsequent two pipe examples are really needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove them then ; )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the two pipe examples.

@Robinlovelace
Copy link
Collaborator

Great you've started this - please polish anything else in c1 first though so we can merge that and reduce the number of PRs and increase my headspace.


## Attribute data aggregation

<!-- https://github.com/ropenscilabs/skimr ?? -->

As demonstrated in chapter \@ref(spatial-class), `summary()` provides a quick summary of the spatial and non-spatial components of spatial objects.
Enter the following command to for an overview of the `world` object and all its variables (result not shown):
<!-- As demonstrated in chapter \@ref(spatial-class), `summary()` provides a quick summary of the spatial and non-spatial components of spatial objects.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison is a bit unfair. The summary function is more generic in nature and can be applied to a multitude of classes for summary statistics.
dplyr::summarize is basically an aggregation function, hence, a comparison with tapply, aggregate or by would be fairer. Ok, you can also use dplyr::summarize for summary statistics but so you can with aggregate, etc. So I suggest to either drop the summary-comparison or to compare with a base R aggregate function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I delete the summary part. And since we explain aggregate later on, there is no need to mention it here, ok?

@@ -275,15 +255,15 @@ world_continents = world %>%
world_continents
```

`sf` objects are well-integrated with the **tidyverse**, as illustrated by the fact that the aggregated objects preserve the geometry of the original `world` object.
`sf` objects are well-integrated with the **tidyverse**, as illustrated by the fact that the aggregated objects preserve the geometry of the original `world` object.^[Such a spatial aggregation of polygon data is know as "to dissolve polygons" in the GIS world.]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why some borders are still preserved? The same happens if using a GIS in the background:

test <- run_qgis("saga:polygondissolvebyattribute", POLYGONS = world, FIELD_1 = "continent",
                  DISSOLVED = "out.shp", BND_KEEP = "False", load_output = TRUE)

Hence, I guess the input geometry is somewhat unclean...

Copy link
Collaborator Author

@jannes-m jannes-m Aug 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And though we perform attribute aggregation here, it is also a spatial operation (dissolving). Borrowing from this blog:

nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
# add an arbitrary grouping variable
nc_groups <- nc %>% 
  mutate(group = sample(LETTERS[1:3], nrow(.), replace = TRUE))
# average area by group
nc_mean_area <- nc_groups %>% 
  group_by(group) %>% 
  summarise(area_mean = mean(AREA))
# plot
ggplot(nc_mean_area) +
  geom_sf(aes(fill = area_mean)) +
  scale_fill_distiller("Area", palette = "Greens") +
  ggtitle("Mean area by group") +
  theme_bw()

Notice that in addition to the attribute data being aggregated, the geometries have been aggregated as well. All geometries in each group have been combined together and the boundaries between adjacent geometries dissolved. Internally, the function st_union() is used to achieve this.

So I suggest to point this out clearly or to move the entire aggregation subsection to chapter 4.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good plan - it's important to note that it does a spatial data operation 'under the hood' which is clever.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so what about adding this sentence:
What is more, under the hood sf is already doing a spatial aggregation of polygon data which is known as 'dissolving polygons' in the GIS world - an operation we will explain in more detail in the the next chapter.

Copy link
Collaborator

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great set of changes @jannes-m - thanks for the attention-to-detail. I suggest that after a few changes, based on my comments below, we merge this PR later today. Let me know when you think it's 'done' (as with c1 we can always revisit contents).

@@ -2,78 +2,88 @@

## Prerequisites {-}

- This chapter requires **tidyverse** and **sf**:
- This chapter requires the packages **tidyverse** and **sf**:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to be explicit, thanks for clarifying that for readers.


```{r, message=FALSE}
library(sf)
library(tidyverse)
```

- You must have loaded the `world` and `worldbank_df` data which are loaded automatically by the **spData** package:
- We will also make use of the the `world` and `worldbank_df` data sets. Note that loading the **spData** package automatically attaches these data sets to your global environment:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, great attention to detail in the description, thank for that.


```{r, results='hide'}
library(spData)
```

## Introduction

Attribute data is non-spatial information associated with geographic data.
In the context of simple features, introduced in the previous chapter, this means a data frame with a column for each variable and one row per geographic feature stored in the `geom` list-column of `sf` objects.
Attribute data is non-spatial information, e.g., the name of a bus station, associated with geographic data, e.g. the coordinate of this bus station.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not keen on using acronyms such as i.e. or e.g. mid-text, especially when it's surrounded by enclosing commas - does not flow great. I propose the line is changed to the following:

Attribute data is non-spatial information associated with geographic (geometry) data.
A bus station, for example, could be represented by a field containing it's name (attribute data), associated with its latitude and longitude position (geometry data).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll remember that! Changed as requested.

Attribute data is non-spatial information associated with geographic data.
In the context of simple features, introduced in the previous chapter, this means a data frame with a column for each variable and one row per geographic feature stored in the `geom` list-column of `sf` objects.
Attribute data is non-spatial information, e.g., the name of a bus station, associated with geographic data, e.g. the coordinate of this bus station.
Simple features (see previous chapter) store attribute data in a dataframe with each column corresponding to a variable and each row to one observation, e.g., a bus station.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 e.g.s in quick succession! I suggest a small change:

Simple features, described in the previous chapter, store attribute data in a data frame, with each column corresponding to a variable (such as 'name') and each row to one observation (such as an individual bus station).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, incorporated that.

In the context of simple features, introduced in the previous chapter, this means a data frame with a column for each variable and one row per geographic feature stored in the `geom` list-column of `sf` objects.
Attribute data is non-spatial information, e.g., the name of a bus station, associated with geographic data, e.g. the coordinate of this bus station.
Simple features (see previous chapter) store attribute data in a dataframe with each column corresponding to a variable and each row to one observation, e.g., a bus station.
In addition, a special column, mostly named `geom` or `geometry`, stores the spatial information of an **sf**-object, e.g., the coordinate of the bus station.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A third e.g.! Suggestion (we've already said that the geometry contains the coordinates but it should link to the next sentence):

In addition, a special column, usually named `geom` or `geometry`, stores the geometry data of **sf** objects.
For a bus station, that would likely be a single point representing its centroid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will make sure to avoid using e.g. and i.e. :-). Changed that.

The trusty `data.frame` (and extensions to it such as the `tibble` class used in the tidyverse) is a workhorse for data analysis in R.
Extending this system to work with spatial data has many advantages,
meaning that all the accumulated know-how in the R community for handling data frames to be applied to geographic data which contain attributes.
The reliable `data.frame` (and modifications of it such as the `tibble` class used in the tidyverse) is the basis for data analysis in R.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say modifications to it rather than modifications of it as the class is modified by an external force (the programmer). Otherwise I think this adjustment to the text is an improvement, thanks for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

meaning that all the accumulated know-how in the R community for handling data frames to be applied to geographic data which contain attributes.
The reliable `data.frame` (and modifications of it such as the `tibble` class used in the tidyverse) is the basis for data analysis in R.
Extending this system to work with spatial data has many advantages.
The most important one is that the accumulated know-how in the R community for handling data frames is transferable to geographic attribute data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace is transferable with can be transferred because it's still contingent on knowing how to program with data frames, hence the importance of learning about attribute data operations and reading this chapter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good point! This was just my habit of avoiding the passive voice. Good example when that's no good...

Before proceeding to perform various attribute operations of a dataset, it is worth taking time to think about its basic parameters.
In this case, the `world` object contains 10 non-geographical columns (and one geometry list-column) with data for almost 200 countries.
This can be be checked using base R functions for working with tabular data such as `nrow()` and `ncol()`:
Before proceeding to perform various attribute operations on a dataset, it is advisable to explore its structure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (more concise, informal and hopefully friendly):

Before proceeding to perform various attribute operations on a dataset, let's explore its structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed that.


```{r, eval=FALSE}
world[1:6,] # subset rows by position
world[1:6, ] # subset rows by position
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great formatting fix. I only realised recently that this is good practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you like it. And it's good that we agree on a consistent coding style. Though again I am always happy to adopt yours as well such as the = instead of <-. Consistency it the important thing.

@@ -554,10 +536,11 @@ world %>%
```

## Removing spatial information
<!-- Shouln't that be part of chapter 2-->
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't removing spatial information be part of chapter 2?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are dealing with attribute operations in chapter 3, removing spatial information (which is a column) is very well placed here. I made this comment when I was in a rush (not a good idea) and did so because in the back of my mind I had that there was something on the geometry column in chapter 2. But we can just put a reference there to chapter 3.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great, cheers for the feedback.

A new `sf` object will be a result of these joins.
However, the reverse order is also possible and will result in a `data.frame` object.
Most of the following join examples will have a `sf` object as the first argument and a `data.frame` object as the second argument which results in a new `sf` object.
However, the reverse order is also possible and will give you back a `data.frame` object.
This is mostly beyond the scope of this book, but we encourage you to try it.

### Left joins

Copy link
Collaborator Author

@jannes-m jannes-m Aug 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could think about just presenting the two most important join-types such as the inner- and left join (supported by st_join), and leave the rest as an excercise to the reader mentioning again the join chapter in the R for data science book.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that is a good plan - the inner and left are indeed the ones I use most. I think that would be a great exercise for the reader. @Nowosad sound like a plan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It makes perfect sense. I'm going to adjust it after the raster part in the second chapter will be completed.

@jannes-m
Copy link
Collaborator Author

Yes, it would be good to merge the PR to make sure that our branches do not diverge that much. I can still open a new PR when I am adding the raster stuff.

Merge remote-tracking branch 'upstream/master' into chapter_3

# Conflicts:
#	03-attribute-operations.Rmd
@Robinlovelace Robinlovelace merged commit dc12197 into geocompx:master Aug 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants