/
censusxy.Rmd
executable file
·216 lines (153 loc) · 14.9 KB
/
censusxy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "Using the censusxy Package"
author: "Branson Fox and Christopher Prener, Ph.D."
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{censusxy}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## Overview
The `censusxy` package is designed to provide easy access to the [U.S. Census Bureau Geocoding Tools](https://geocoding.geo.census.gov/geocoder/) in `R`.
### Motivation
There do not exist many packages for free or reproducible geocoding in the `R` environment. However, the Census Bureau Geocoding Tools allow for both unlimited free geocoding as well as an added level of reproducibility compared to commercial geocoders. Many geospatial workflows involve a large quantity of addresses, hence our core focus is on batch geocoding.
### Responsible Use
The U.S. Census Bureau makes their geocoding API available without any API key, and this package allows for virtually unlimited batch geocoding. Please use this package responsibly, as others will need use of this API for their research.
### Installation
### Installing censusxy
The easiest way to get `censusxy` is to install it from CRAN:
```r
install.packages("censusxy")
```
Alternatively, the development version of `censusxy` can be accessed from GitHub with `remotes`:
```r
# install.packages("remotes")
remotes::install_github("chris-prener/censusxy")
```
### Installing Suggested Dependencies
Since the package does not need `sf` for its basic functionality, it is a suggested dependency rather than a required one. However, many users will want to map these data as `sf` objects, and we therefore recommend users install `sf`. Windows and macOS users should be able to install `sf` without significant issues unless they are building from source. Linux users will need to install several open source spatial libraries to get `sf` itself up and running.
If you want to use these `sf`, you can either install it individually (faster) or install all of the suggested dependencies at once (slower, will also give you a number of other packages you may or may not want):
```r
## install sf only
install.packages("sf")
## install all suggested dependencies
install.packages("censusxy", dependencies = TRUE)
```
## Geocoding Considerations
### censusxy Performance
#### Server Availability
One of the biggest drawbacks of the Census Bureau's geocoders is their availability and speed. Users of `censusxy` have reported issues to us where `censusxy` appears not to work on their computers. Generally speaking, the issues are with servers on the Census Bureau's end. If you notice that the package does not seem to be working, take a break for a few hours and try again. You will hopefully find that your code, with no changes, executes correctly and returns geocoded data. If you notice continued issues, please do not hesitate to [open an issue for us to investigate](https://github.com/chris-prener/censusxy/issues/new/choose).
#### Processing Speed
Another performance issue with `censusxy` relates to speed. The issue, generally speaking, is not within the package itself but rather with external factors. These include your computer's processing power, the quality of your internet connection, and - most importantly - the speed that the Census Bureau's servers process your request. We have taken two steps to improve `censusxy`'s performance over using the Census Bureau's geocoding web interface. The first is that we extract unique addresses in your data and only send those for geocoding. Thus, a large geocoding job may shrink considerably if it contains repetitive addresses.
The second step we take is to break your data into "chunks" if you have more than 1,000 records. This is below the maximum number of rows that the Census Bureau accepts for geocoding. `censusxy` sends these as separate geocoding requests before putting your data back together when geocoding is completed. One of the biggest things you can do to improve the performance of `censusxy` is to utilize our *optional* parallel processing features. When parallel processing is used, these chunks of data can be sent out simultaneously for geocoding.
#### Data Quality
A third performance issue with `censusxy` relates to the quality of input data. Street addresses that are not formatted according to USPS standards may cause an entire "chunk" of data to be returned without being geocoded. If you notice many `NA` values for location, check a few with the single line geocoder:
```r
censusxy::cxy_single("3700 Lindell Blvd", "St. Louis", "MO", 63139)
```
If you get invalid data back, or an error like `The server, while working as a gateway to get a response needed to handle the request, got an invalid response`, there is something wrong with your specific address. One thing to watch for are non-ASCII characters, such as `Lindèll` in the example below:
```r
censusxy::cxy_single("3700 Lindèll Blvd", "St. Louis", "MO", 63139)
```
This will return the error described above or an `NA` for `cxy_status` if you are using `cxy_geocode()`.
If you get valid data back from the API with `cxy_single()`, it is more than likely that another address in the same "chunk" has a formatting problem that is causing errors. Pre-processing your data to eliminate problematic addresses (such as those that contain point of interest information or non-ASCII characters) can help eliminate this problem.
Another strategy is to break up your data into smaller chunks prior to sending it to be geocoded. Then, geocode these chunks individually to try and isolate the problematic addresses. For example, we can use the following `base` functions to create chunks of $n = 200$ observations:
```r
batches <- split(stl_homicides, (seq(nrow(stl_homicides))-1) %/% 200 )
```
These could then be passed individually to `cxy_geocode()` using `batches[[1]]` and so on.
### Picking a Suitable API
The Census Geocoder contains 4 primary functions, 3 for single address geocoding, and 1 for batch geocoding. For interactive use cases, a Shiny application for example, the single line geocoder is recommended. For large quantities of addresses, the batch endpoint is favorable. If your use case is locating coordinates within census geometries, only a single coordinate function is available for this task.
### Picking a Return Type
If you are interested in census geometries (composed of FIPS codes for state, county, tract and block), you should specify 'geographies' in the `return` argument. This also necessitates the use of a vintage.
### Consider the Benchmark and Vintage
Vintage is only important to consider if you would like Census geographies to accompany your coordinate data. It has no impact on geocoding coordinates (location). You can obtain a data frame of valid benchmarks and vintages with their respective functions. For vintages, you must supply the name or ID of the benchmark you have chosen.
```r
# return benchmarks
cxy_benchmarks()
# return vintages
cxy_vintages(benchmark = 'Public_AR_Current')
```
If you would like to return data including Census geographies, set `return` to `"geographies"`.
Setting a specific benchmark can also be used to reproduce geocoding results. Changes to the data the Census Bureau supplies for geocoding may mean that the specific coordinates returned for a given address are modified. Using the latest benchmark may mean that addresses geocoded earlier are returned at a different coordinate. Using benchmarks allows you replicate earlier geocodes.
If you are not concerned about Census geographies or being able to reproduce the exact coordinates for each supplied, the functions will default to the latest benchmark.
### Parallelization
This is an *optional* set of tools to take advantage of multiple threads if your computer has a multi-core processor. The `parallel` argument can be used to specify the number of cores you would like to dedicate to geocoding tasks. Once you execute the function, your batch geocoding request will be automatically distributed across your cores to return results faster. The function will not allow you to specify more cores than are available, and will instead default to the maximum number of available cores.
Since this is *optional*, there are additional dependencies that are not automatically installed. See the installation instructions above for details on ensuring your `R` environment is ready for parallel processing.
### Selection a Class and Output
There are two options for `class`, which governs the type of object `censusxy` returns. By default, a `"dataframe"` is returned. However, users may *optionally* set `class` to `"sf"`, which will return the results as an `sf` object. Doing so allows for an immediate preview of your data on a map and for these data to be exported in a geospatial format, like `.geojson` or `.shp`. However, returning an `sf` object will mean that `censusxy` can only return addresses successfully matched by the geocoder. A helpful message denoting how many rows were removed will print in the console.
Since returning an `sf` object is *optional*, there are additional dependencies that are not automatically installed. See the installation instructions above for details on ensuring your `R` environment is ready for parallel processing.
You may also specify `output` as `"simple"` or `"full"`. Simple returns only coordinates (and a GEOID if `return = "geographies"`) and this is suitable for most use cases. If you desire all of the raw output from the geocoder, please specify full instead.
### Setting a Timeout Parameter
The function contains an argument for timeout, which specifies how many minutes until the API query ends as an error. In this implementation, it is per 1000 addresses, not the whole batch size. It is set to default at 30 minutes, which should be appropriate for most internet speeds.
If a batch times out, the function will terminate, and you will lose any geocoding progress.
Be cautious that batches taking a long time may allow your computer to sleep, which may cause a batch to never return. Turning off sleep behavior for your operating system while geocoding is recommended.
## Usage
### Importing Data
If you plan on using the single address geocoding tools, your data do not need to be in any specific class. To use the batch geocoder, your data must be in a data.frame (or equivalent class). This package provides data on homicides in St Louis City between 2008-2018 as example data in two forms - a small table with 24 addresses (`stl_homicides_small`) and a larger table with more records (`stl_homicides`)
```r
# load package
library(censusxy)
# load sample data
df <- stl_homicides_small
```
### Parsing Addresses
For use of the batch API, your address data needs to be structured. Meaning, your data contains separate columns for street address, city, state and zip-code. You may find the [`postmastr`](https://github.com/slu-openGIS/postmastr) package useful for this task, though it remains early in its development life cycle. Only street address is mandatory, but omission of city, state or zip code drastically lowers the speed and accuracy of the batch geocoder.
### Batch Geocoding
In this example, we will use the included `stl_homicides_small` data, which we loaded above and assigned to the object `df`, show the full process for batch geocoding with `cxy_geocode()`:
```r
df_sf <- cxy_geocode(df, address = "street_address",
city = "city",
state = "state",
zip = "postal_code",
class = "dataframe")
```
Since we have selected `"dataframe"` for our `class`, all input addressed will be returned. If you use `class = "sf"` (and have `sf` installed), you will receive only matched addresses.
```r
df_sf <- cxy_geocode(df, address = "street_address",
city = "city",
state = "state",
zip = "postal_code",
class = "sf")
```
Though `sf` objects only returned matched addresses, one advantage of using this *optional* set of features is that output returned as an `sf` object can be previewed with a package like [`mapview`](https://cran.r-project.org/package=mapview):
```r
mapview::mapview(df_sf)
```
```{r exampleMap1, echo=FALSE, out.width = '100%'}
knitr::include_graphics("../man/figures/homicide_example.png")
```
Note that, like `sf` and the parallel processing dependencies, `mapview` is a suggested package and must be installed separately (or by using `install.packages("censusxy", dependencies = TRUE)`).
### Single Address Tools
If you have individual addresses you would like to geocode, there are two additional functions to be aware. `cxy_single()` can geocode a single address that has already been parsed into separate entries for `street`, `city`, `state`, and `zip`:
```r
cxy_single(street = "20 N Grand Blvd", city = "St. Louis", state = "MO", zip = 63103)
```
Alternatively, you can pass un-parsed addresses using `cxy_oneline()`. This example returns Census geography data along with the default location information:
```r
cxy_oneline(address = "3700 Lindell Blvd, St. Louis, MO 63108", return = "geographies", vintage = "Current_Current")
```
### Coordinate Pairs
Finally, if you know pairs of coordinates but would like to find out what Census geographies they fall within, you can use `cxy_geography()`:
```r
cxy_geography(lon = -90.23324, lat = 38.63593)
```
This function is useful for a small number of addresses, but returns a *significant* amount of data for each point location. If you have a large number of points and need a specific value like county or congressional district, using `tigris` to obtain those data and then `sf::st_intersection()` to perform a spatial join will be a more effective solution that performs this task faster and results in clearer output.
### Iteration
For a handful of addresses, you may want to iterate using these functions. Two examples using base R are provided here:
```r
addresses <- c("20 N Grand Blvd, St. Louis MO 63103", "3700 Lindell Blvd, St. Louis, MO 63108")
# with a loop
geocodes <- vector("list", length = 2)
for (i in seq_along(addresses)){
geocodes[[i]] <- cxy_oneline(addresses[i])
}
# with apply
geocodes <- lapply(addresses, cxy_oneline)
```
## Getting Help
* If you are new to `R` itself, welcome! Hadley Wickham's [*R for Data Science*](https://r4ds.had.co.nz) is an excellent way to get started with data manipulation in the tidyverse, which `censusxy` is designed to integrate seamlessly with.
* If you are new to spatial analysis in `R`, we strongly encourage you check out the excellent new [*Geocomputation in R*](https://geocompr.robinlovelace.net) by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow.
* If you have questions about using `censusxy`, you are encouraged to use the [RStudio Community forums](https://community.rstudio.com). Please create a [`reprex`](https://reprex.tidyverse.org/) before posting. Feel free to tag Chris (`@chris.prener`) in any posts about `censusxy`.
* If you think you've found a bug, please create a [`reprex`](https://reprex.tidyverse.org/index.html) and then open an issue on [GitHub](https://github.com/chris-prener/censusxy/issues/new/choose).