-
Notifications
You must be signed in to change notification settings - Fork 0
/
doc.Rmd
422 lines (302 loc) · 31.3 KB
/
doc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
title: "Approximate Geocoding"
subtitle: "Fuzzy-matched record linkage of geographic point data"
author:
- "J. Andrew Harris"
- "Cole Tanigawa-Lau"
# abstract: Social scientists have rapidly seized on the research potential of GIS data. However, the analyses that spatial data afford almost always require some amount of record linkage to join quantities of interest to the relevant spatial covariates. In locations with a still-nascent spatial data industry, researchers' typical problems managing messy data are compounded with the difficulties of working with imprecise or unstandardized GIS files. We outline methods of record linkage specialized to the context of spatial data and subsequently used to assign geographic coordinates to all Kenyan polling stations from 1997 to 2013. Through a generalizable example of this automated geocoding process, we provide what we hope to be a helpful guide for rigorous, systematic record linkage in the world of messy spatial data.
keywords: "geocoding, fuzzy matching, record linkage"
bibliography: references.bib
nocite: |
@sp, @maptools, @rgeos, @rgdal, @raster, @spatstat, @parallel, @matrixStats, @stringdist, @stringr, @purrr, @dplyr
output:
html_document:
toc: true
toc_float: true
number_sections: true
theme: "flatly"
---
<style>
body {
text-align: justify}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Motivation
Vast improvements in processing power have opened the potential of spatial data analysis to the average social science researcher, laying the foundations for empirical work that has incorporated increasingly complex analyses of spatial relationships. Both the public and private sectors in many highly developed countries have advanced their spatial data capabilities in parallel with technological innovations, producing high-quality GIS products and services to meet the rising demand. In the political science literature, these data have yielded numerous high-powered, micro-level studies of behavior and political phenomena.
Many developing countries do not have such robust support for researchers hoping to conduct spatial analyses. Development institutions have produced high-resolution raster interpolations on a near-global scale for quantities like population density, literacy, and mortality; however, key political covariates exist only at much higher levels of aggregation. Government-issued shapefiles for administrative boundaries tend to be of much lower quality and are subject to change over time, with or without a formal delimitation procedures. Accurate geographic coordinates are simply unavailable for many types of point data.
The dearth of readily available, high-quality spatial data in the developing world, we believe, dramatically increases the relative demand for such a product. These data are not wholly inaccessible. With experience admittedly limited to Kenya, we would conjecture that spatial data in fact do exist for many developing countries. The problem is that these data are distributed sparsely across various people, departments, and organizations to the extent that significant time and personal effort are required to collect the data files even before any matching processes begin. As long as researchers cannot simply download these data in prepackaged form, though, they remain an untapped resource for those willing to put in the hours required to link the mass of geocoded points, available from an inconvenient number of sources, to the lists of schools, health clinics, markets, administrative offices, and polling centers that social scientists hope to study.
The remainder of this document discusses in detail our process of matching location names to geographic points.
<!-- [Background] relates our methods to standardized strategies of record linkage. -->
[Example Data] describes the sample of polling stations available to download and the various points and polygons required for the linkage process. [Geocoding] explains the techniques we've found helpful in our matching method.
[Imputing points] discusses our process of stochastically imputing point coordinates for polling stations we could not accurately geocode.
We assume of the reader an intermediate-level understanding of the `R` programming langauge and basic familiarity with the S4 spatial data classes defined in the `sp` package [@base;@sp]. Some processes in this document are run in parallel on six cores. All operations are performed on a server running R version 3.4.0 (64-bit) under Ubuntu 14. Reproducing the results on a machine with less than 32GB of RAM or using the 32-bit version of R may require adjusting the code.
<!-- [^gmaps]: As an example, we suggest using Google Maps, an excellent geocoding tool, to search for a point in the United States of America using very little information (e.g., ). Searching for a fairly comparable point in Kenya while supplying much more data (e.g., ). -->
<!-- # Background -->
<!-- ## Literature on record linkage -->
<!-- Christen [@Christen2012] offers one of the very few contemporary surveys of the record linkage literature, describing complex recent advances in generalized matching techniques . . . -->
<!-- ## Applications in the social sciences -->
# Example Data
We demonstrate our methods using samples of the polling stations file and a set of geocoded points. We also rely on polygon shapefiles detailing several levels of Kenyan administrative boundaries at different points in time. Finally, we use a 2010 population density raster for Kenya from [WorldPop](www.worldpop.org) [@Linard2012]. All of these data are available to download from the [GitHub repository](github.com/coletl/geocode).
```{r packages, message=FALSE, echo=TRUE}
# Load packages
library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(raster)
library(spatstat)
library(parallel)
library(matrixStats)
library(stringdist)
library(stringr)
library(purrr)
library(dplyr)
# Load user-defined functions
source("code/functions.R")
# Set number of cores for parallel processing
cores <- 6
```
## Points
We use two sets of point-specific data. First, the polling stations file, `data/ps_samp.rds`, is a sample of polling stations scraped from voter rolls and electoral results. Barring linkage errors, we observe all polling stations active in the 1997, 2002, 2007, and 2013 elections in Kenya, as well as the 2005 and 2010 constitutional referenda. The data include not only the name of the active polling station in that year (e.g., “MURKISIAN PRIMARY SCHOOL”) but also important information on what administrative units govern the polling station. These regionally identifying variables allow us to index to a geographic area, or “block”, within which we roughly restrict our searches. (More on this below.)
Second, we have a mass of geocoded _match points_ in Kenya obtained from various local, foreign, public, and private sources. A sample of these is stored in `data/mp_samp.rds`. The accuracy of the coordinates and the consistency of formatting in the name strings vary by source. Each match point represents a potential location for a given polling station, but the match points file is, in reality, neither complete nor unambiguous. Therefore, while the procedure outlined here links these two files and yields a version of the polling stations file with geographic coordinates attached to each record, some of these coordinates cannot be matched to and must be imputed instead.
```{r load_pts, echo=TRUE, message=FALSE, warning=FALSE}
# Read points data
ps <- readRDS("data/ps_samp.rds")
mp <- readRDS("data/mp_samp.rds")
```
## Polygons
The polygon files included with this document represent several administrative levels. Redistricting and revisions in data quality have produced files that record changes in boundaries over time. In roughly decreasing order of scale, the units are as follows:
1. pre-2010 (or "old") constituencies,
2. post-2010 constituencies,
3. post-2010 constituency assembly wards ("wards" or "CAWs"),
4. pre-2010 wards (or "old wards"),
5. locations,
6. sublocations.
```{r load_polys, echo=TRUE, message=FALSE, warning=FALSE}
# Read polygons data
const_old <- readRDS("data/const_old.rds")
const13 <- readRDS("data/const13.rds")
caw <- readRDS("data/caw.rds")
caw_old <- readRDS("data/caw_old.rds")
sl <- readRDS("data/sl.rds")
```
Other regionally identifying information for polling stations, such as county, division, and province, are sometimes available, but they denote units that are too large to be helpful in narrowing down our searches.
Because the boundaries of these geographic units have changed over time, and because the Kenyan government has not always provided these data at a high quality, we cannot perfectly and consistently map polling stations to units like the ward, location, or sublocation. This difficulty is particularly pronounced along the borders of administrative units, where anecdotal evidence informs us that what the central Kenyan government may understand and treat as dividing lines are not as clear-cut to chiefs and other local administrators on the ground. Therefore, the borders delineated by the polygon shapefiles can be treated as only soft boundaries on point locations.
## Assessing data quality
Understanding the quality of the polygon shapefiles informs us, if roughly, of how much error we should expect at the borders of administrative units. We've found two attributes of shapefiles to be most informative with only a quick study of the data: boundary detail and border alignment.
### Boundary detail
First, the level of detail provided at the boundaries of shapefile features indicates the amount of time, money, and effort put into creating the data. Building shapefiles that accurately represent the geography on the ground is labor intensive. Shortcuts, just like lower-resolution raster files, produce unrealistically rectangular boundaries where the land's natural cuts and curves were too costly to map. These differences are most conspicuous along shorelines.
Below, we plot the overlay of three constituencies bordering Lake Victoria as represented in two different shapefiles.
```{r bound_plot, message=FALSE, warning=FALSE, echo=FALSE}
ex_const13 <- gUnaryUnion(const13[const13$const %in% c("MBITA", "SUBA", "NYATIKE"), ])
ex_sl <- gUnaryUnion(sl[ex_const13, ])
plot(ex_const13, col = "azure2")
plot(ex_sl, add = TRUE, col = "coral1")
```
Such exploratory work is most easily performed in an interactive GUI like QGIS or ArcGIS, but even this simple plot demonstrates the vast differences in detail offered by the two files: a clear coastline, rather than blockish figures, and distinct islands.
### Border alignment
A second indicator of a shapefile's quality can be seen at points where two features meet. From a distance, the two constituencies plotted below appear to be drawn somewhat roughly: suspiciously straight lines mark borders.
```{r border_align}
plot(const13[const13$constid %in% 239:240, ], col = "azure2")
```
Again, an interactive GUI might be helpful here to zoom in on the boundary between these to constituencies. However, we can make do by unioning the two features together.
```{r border_align2}
plot(gUnaryUnion(const13[const13$constid %in% 239:240, ]))
```
Despite converting the object into a single feature, we still observe a partial line delimiting one constituency from the other. That line is the result of a gap between the two features, a flaw in the shapefile.
### A note on coordinate reference systems{-}
Whenever possible, we use UTM zone 37S as our coordinate reference system. Working with a projected CRS allows us to use functions like `gBuffer()` seamlessly, defining buffer widths in easily understood units like meters. If you are working with a relatively small area (e.g., a single medium-sized country), we would recommend picking an appropriate projected CRS early in the data cleaning process.
# Geocoding
We now outline our methods of finding the closest geographic match for polling stations in Kenya. It is worth noting now that our objective is neither to find the closest string match nor to find as many schools or health clinics as possible. Rather, we aim to minimize geographic error, even at the cost of finding what is certainly the wrong point.
## Preprocessing
It would be difficult to overstate the importance of preprocessing. Time spent developing the consistency of the data early in the matching procedure can lead to enormous gains in accuracy down the road. Perhaps more importantly, this phase develops a familiarity with the data's idiosyncracies that becomes invaluable when tailoring matching strategies to a specific data set or goal. By its nature, however, preprocessing requires techniques that may vary widely depending on whatever quirks of the unmatched data require standardization. We therefore begin with data that has already gone through some cleaning. We describe only a few preprocessing steps that are helpful in the most general cases, omitting any discussion of the more idiosyncratic adjustments we made.
### Truncating strings
<!-- Looking at a small subset of the polling station names, we see that many of the polling stations are located at schools. -->
<!-- ```{r print_ps_samp} -->
<!-- set.seed(10) -->
<!-- sample_n(ps@data, 5) %>% dplyr::select(starts_with("name")) -->
<!-- ``` -->
Polling stations were sometimes recorded in the pattern `ALPHA PR. SCH`, but variations including `ALPHA PRY SCH`, `ALPHA PRIMARY SCHOOL`, and `ALPHA PR.SCH` abound. This lack of standardization is common for all types of polling station locations. One approach would be to standardize the abbreviation of `PRIMARY` and `SCHOOL` by replacing all permutations with a single string. However, the [fuzzy-matching algorithm](https://en.wikipedia.org/wiki/Jaro–Winkler_distance) we use not only inspects differences between strings but also normalizes that edit distance by the number of characters in each string. In the example below, we return two different string distance scores given inputs of essentially the same information.
```{r, jw_trunc}
stringdist("ALPHA PRIMARY SCHOOL", "BETA PRIMARY SCHOOL", method = "jw", p = 0.15)
stringdist("ALPHA PR. SCH", "BETA PR. SCH", method = "jw", p = 0.15)
```
For our purposes, it is more important to match polling stations to the correct geographic area, rather than matching primary schools strictly to primary schools.[^alpha_beta] We can therefore remove the nonessential information and store it separately. We do this by cataloging the most common types of polling stations, then truncating the strings to separate these labels from the information that identifies the town, village, or area in which the polling station is located. The result is a truncated polling station name (e.g., `ALPHA`) with classifying information (e.g., `SCH` and `PRI`) in separate columns. Conveniently, in Kenya, substrings in the `ALPHA` position almost always refer to a village or other small locality. Thus, when calculating the similarity of two strings (e.g., `ALPHA` and `BETA`), we concentrate on regionally identifying information and concentrate on geographic accuracy.
[^alpha_beta]: When geocoding `ALPHA PRIMARY SCHOOL`, we would consider `ALPHA SECONDARY SCHOOL` to be a much better match than `BETA PRIMARY SCHOOL`. We are most intent on assigning the correct geographic location (`ALPHA`), not the corrrect school type.
Classifying data for each polling station-year are stored in election-specific `type` and `subtype` columns. These columns align temporally with the election-specific polling station names from which these data were extracted. For example, a polling station active in 2013 will have columns for its truncated name (`ALPHA`), its type (`SCH`), and its subtype (`PRI`); separate columns will hold analogous information for any prior elections during which this polling station was active.
### Cleaning strings
To improve matching further, we clean the strings of both our point (polling station) and polygon (administrative boundary) data. Most of the cleaning will likely be idiosyncratic to the locale of the data, like correcting for inconsistencies in transliteration, pronunciation, or abbreviation.
One technique we've found particularly helpful is to order the elements of each string. This corrects the particularly troublesome problem of some strings following the pattern `SAMBURU EAST`, while the same constituency may be recorded in a second data set as `EAST SAMBURU`. The function `order_substr()`, defined in `code/functions.R`, splits each string of a character vector at any commas, whitespace, forward slashes, and hyphens.[^order_substr] The output is a vector with those substrings ordered alphabetically and collapsed with a single space. Alternative regular expressions for splitting can be included in the `split` parameter; other strings may be used to separate the alphabetized substrings via the `collapse` parameter.
[^order_substr]: For an actively maintained version of this function, please see [this Gist](https://gist.github.com/coletl/cac6e9a18485199e8d368101160ecff2).
```{r order_substr}
tuna <- c("tuna,skipjack", "tuna , bluefin", "yellow-fin - tuna", "tuna, albacore")
order_substr(tuna)
colors <- c("green/red", "yellow / blue", "orange purple")
order_substr(colors, collapse = "-", reverse = TRUE)
```
[^order_substr]: Cleaning with the `order_substr()` function works poorly when substrings are inconsistently separated by the split pattern. For example, the vector of tuna species above hyphenates yellow-fin but leaves bluefin as a single word. Therefore, with the default split patterns, `order_substr()` splits "yellow-fin" prior to alphabetizing the string. (In this particular case, a good fix may be to adjust the regex to split at hyphens only when they are surrounded by whitespace.)
## Linking Polling Stations to Match Points
We define a function for each of several matching steps. These steps differ primarily in terms of which lower-level polygons they use (e.g., sublocation, old ward, new ward). To cut down on processing time, we run these functions on multiple cores via `parallel::mclapply()`.[^nb_parallel]
[^nb_parallel]: If you are running this code on a machine with less than 32GB of RAM, we suggest decreasing the `mc.cores` parameter for the matching processes.
Each matching function follows the same basic framework:
1. Block geographically by loading the match points and lower-level administrative units (e.g., sublocations) corresponding to a polling station's constituency.
2. Search for the lower-level administrative units belonging to the polling station.
3. Calculate the string distance between truncated match point names and the truncated polling station name.
4. Build spatial measures of match quality (e.g., minimum geographic distance from match points to the polling station's lower-level administrative units).
5. Subset to the match points with a string distance score less than a _manually_ determined threshold.
6. Check type and subtype alignment to use as another quality measure.
7. Among the matches with the minimum string distance score, pick those with the best quality scores.
8. Throw out matches that are too far from the polling station's lower-level administrative boundaries.
Below, we describe the blocking and fuzzy matching steps in detail before walking through the code for a simplified matching function with a single polling station. We then use the example data sets to demonstrate the two functions we used first in our matching process, `match_lsl()` and `match_inf_loc13()`. Together, these functions geocode roughly 75 percent of Kenya's 1997-2013 polling stations. We discuss two techniques particular to these functions; their full definitions can be found in `code/functions.R`.
### Blocking with polygon boundaries
Our full data set of match points contains over 125,000 coordinates. Running matching algorithms for each of the nearly 26,000 polling stations against all match points would therefore be hugely expensive. To save time and RAM, we adapt a common record-linkage technique, blocking, to a spatial context: we split our matching data geographically and use constituency code identifiers as a blocking key.
We first buffer our constituency polygons by 500 meters to account for error in the spatial data. Next, we build a list in which each element is the subset of match points contained by each of the buffered constituencies. We use a convenience function defined in `code/functions.R`, to save each object in a new directory: `data/mp_constb/`. By default, the name of each file matches the name of the list element to which the file corresponds. We repeat this process to split the sublocation polygons into constituency blocks.
```{r mp_sl_constb, cache=TRUE, warning=FALSE}
# Buffer old and new constituencies
constb <- gBuffer(const_old, byid = TRUE, width = 500)
l_constb <- split(constb, constb$id)
const13b <- gBuffer(const13, byid = TRUE, width = 500)
l_const13b <- split(const13b, const13$constid)
# Add sublocation and location information to match points
row.names(mp) <- 1:nrow(mp)
mp_sl <- over(mp, sl)
mp@data <- data.frame(mp@data, mp_sl)
# Split match points by buffered constituency
mp_constb <- mclapply(l_constb, function(x) mp[x, ],
mc.cores = cores)
save_list(mp_constb, "data/mp_constb")
mp_const13b <- mclapply(l_const13b, function(x) mp[x, ],
mc.cores = cores)
save_list(mp_const13b, "data/mp_const13b")
# Split sublocations by buffered constituency
sl_constb <- mclapply(l_constb, function(x) sl[x, ],
mc.cores = cores)
save_list(sl_constb, "data/sl_constb")
sl_const13b <- mclapply(l_const13b, function(x) sl[x, ],
mc.cores = cores)
save_list(sl_const13b, "data/sl_const13b")
```
Each match points and sublocations subset is named for the old (i.e., pre-2013) constituency code in which the spatial features are contained. We use these constituency codes as blocking keys by searching for matches only within subsets corresponding to a polling station's constituency. For example, when searching for matches to a polling station assigned to constituency `067` in 2002, we load only the objects in `data/mp_constb/067.rds` and `data/sl_constb/067.rds`.
### Fuzzy string matching
We use the `stringdist` package to calculate [Jaro-Winkler distances](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) between the polling stations data and both the match points' and polygon's name strings [@stringdist]. The `stringdist` package supports a number of metrics for measuring string similarity. We've found the Jaro-Winkler distance to work best for our purposes. Results of `stringdist(a, b, method = "jw")` are scaled between 0 and 1, for which lower numbers reflect more similar strings.
Aside from supporting multiple string metrics, one advantage of `stringdist()` over `agrep()` is that it provides a sense of how the distance calculations change across comparisons. Inspecting the results by hand allows you to pick precise thresholds for the minimum acceptable string distance.
```{r stringdist}
set.seed(575)
# Sample polling stations to work with and pull out
# the names of their 2002 sublocations
sl_samp <- sample(ps$sl02[ps$sl02 != ""], 20)
# Compute string distances between polling stations' sublocations
# and the names contained in the sublocations shapefile
sd_l <- lapply(sl_samp, stringdist,
b = sl$SUBLOCATIO, method = "jw")
# Find sublocation strings with the minimum string distance score
# (In this example, we'll just take the first match)
sl_match <- sapply(sd_l, function(x) sl[x == min(x), ]$SUBLOCATIO[1])
sl_dist <- sapply(sd_l, min)
# Bind the results in a data.frame
df <- data.frame(sl_samp, sl_match, sl_dist,
stringsAsFactors = FALSE)
df2 <- arrange(df, desc(sl_dist))
head(df2, 10)
```
In the example above, fourteen of the twenty polling stations were associated with 2002 sublocations that had an exact name match in the sublocations shapefile. The string distances listed in the `sl_dist` column tell us a little about how `stringdist()` scored the sublocation names that weren't identical. For this particular sample, it looks like matches with a score less than 0.11 would be acceptable.[^sl_dup] We would encourage researchers to go through this exercise---with their full data set---each time they set a string distance threshold.
[^sl_dup]: Given that sublocation names are not always unique, we could be more confident in these matches if we had blocked on constituency.
### Walkthrough with one polling station
We'll now go through the steps outlined above in order to match a single polling station. This workflow demonstrates the core stages of our geocodeing procedure: spatial blocking, fuzzy string matching, and match evaluation. We simplify the exercise by using polygon data only from 2002. The point name strings have already been cleaned and truncated.
```{r walkthrough, warning = FALSE}
# Pick a polling station
tps <- sample_n(ps, 1)
# Blocking ----
# Use old constituency code as a blocking key
tconst <- const_old[const_old$id == tps$constid10, ]
tconstb <- gBuffer(tconst, width = 500)
# Subset match points and sublocations to this buffered constituency
tmp <- mp[tconstb, ]
tsl <- sl[tconstb, ]
# Fuzzy string matching, polygons then points ----
strd_poly_A <- stringdist(tps$sl07, tsl$SUBLOCATIO, method = "jw")
strd_poly_B <- stringdist(tps$loc07, tsl$LOCATION, method = "jw")
strd_mp <- stringdist(tps$trunc13, tmp$trunc, method = "jw")
# Index to the closest polygon and point matches
poly_ind_A <- which(strd_poly_A == min(strd_poly_A) & strd_poly_A < 0.08)
poly_ind_B <- which(strd_poly_B == min(strd_poly_B) & strd_poly_B < 0.08)
mp_ind <- which(strd_mp == min(strd_mp))
# Quality measures ----
# Calculate geographic distance between point and polygon matches
gdist_A <- gDistance(tmp[mp_ind, ], tsl[poly_ind_A, ], byid = TRUE)
gdist_B <- gDistance(tmp[mp_ind, ], tsl[poly_ind_B, ], byid = TRUE)
# Record the minimum distance for each potential point match
min_gdist_A <- colMins(gdist_A)
min_gdist_B <- colMins(gdist_B)
type_match <- tps$type13 == tmp$type[mp_ind]
subtype_match <- tps$subtype13 == tmp$subtype[mp_ind]
# Build final output
tmatch <- tmp[mp_ind, ]
tmatch@data <-
data.frame(tps, type_match, subtype_match, min_gdist_A, min_gdist_B,
mp_name = tmatch$name, mp_code = tmatch$code, mp_source = tmatch$source,
slmatch = tsl$SL_ID[poly_ind_A])
```
Our result here, `tmatch`, contains a single likely location for the polling station chosen. We've also included additional data on the quality of each match, as well as the ID number of the sublocation in which we matched the polling station.
## Running matching algorithms in parallel
The example above would constitute a single step in our matching algorithm. In our full record-linkage process, we iterate a similar procedure for all unmatched polling stations using progressively less precise blocking methods and appropriately adjusted string distance thresholds. The `code/functions.R` script contains the full code for our two most precise matching functions: `match_lsl()` and `match_inf_loc13()`. Please refer to that script for more detailed comments on how these functions differ from the example above. Below, we demonstrate how to apply these two functions to the full sample data set. We run these fairly intensive processes in parallel.[^nb_parallel] These two functions accept three arguments:
1) a single row of polling station data,
2) a constituency `sp` object,
3) a vector of urban constituency IDs (for adjusted quality measures), and
4) a vector of column names indicating which polygons to search for when matching.
Instead of returning a data frame, these functions output a list, though the data contained are essentially the same.
```{r match_lsl, cache=TRUE, warning=FALSE}
# Split polling stations into list
match_list <- split(ps, ps$uid)
# Find old urban constituency IDs
urb_county <- c("MOMBASA", "NAIROBI CITY", "NAKURU", "KISUMU")
urb_constid02 <- unique(filter(ps, county13 %in% urb_county)$constid02)
# Run most precise matching function on six cores
lsl_match <- mclapply(match_list,
match_lsl,
const = constb,
urb_const = urb_constid02,
polys = c("sl97", "loc97", "sl02", "loc02", "sl07", "loc07"),
mc.cores = cores)
# Find unmatched polling stations
found_lsl <- map(lsl_match, "match_mp")
no_match <- which(sapply(found_lsl, is.character))
```
### Inferring missing polygon data
In subsequent matching steps, lower levels of precision are typically due to using larger polygons to subset the match points or lower-quality, less-recent data. Polling stations may be missing the data that allow us to link them to relatively small polygons. In such cases, it may be possible to infer that polling station’s inclusion in a set of administrative boundaries, based on data from other polling stations. For example, polling stations newly created in 2013 are missing any information on the encompassing sublocation. However, we do observe the 2013 ward. Rather than begin by searching in an entire ward, we can instead infer sublocation data based on previously established polling stations in the same ward. We then look for matches within these smaller, high-quality polygons.
```{r inf13, cache=TRUE, warning=FALSE}
# Remove matched polling stations
ps2 <- filter(ps, uid %in% names(no_match))
# Build lists of potential locations and sublocations
ps3 <- group_by(ps2, wardid13) %>%
mutate(loc13 = na.omit(c(loc07, loc02, loc97)) %>% unique() %>% list(),
sl13 = na.omit(c(sl07, sl02, sl97)) %>% unique() %>% list()
) %>% ungroup()
# Find new urban constituency IDs
urb_constid13 <- unique(filter(ps, county13 %in% urb_county)$constid13)
match_list2 <- split(ps3, ps3$uid)
inf_loc13_match <- mclapply(match_list2,
match_inf_loc13,
const = const13b,
urb_const = urb_constid13,
polys = c("sl13", "loc13"),
mc.cores = cores)
```
# Imputing points
For any polling stations remaining unmatched after all steps, we resort to imputing points within the smallest geographic area possible. The `spatstat` package provides functions and classes to make this process simple, allowing us to use a population density raster as the probability density of the polling station locations. Below, we will continue with the previous example’s polling station, as if we had not found any point matches.
```{r impute}
# Index to matched sublocation
tsl <- sl[sl$SL_ID == tmatch$slmatch, ]
# Read raster (file cropped from full Kenya population density raster)
pd <- raster("data/pd_sl5109.tif")
# Generate point
ppp <- rpoint(n = 1, f = as.im(pd))
# Convert to spatial point object
imp_point <- as(ppp, "SpatialPoints")
```
# Conclusion
Social scientists have rapidly seized on the research potential of GIS data. However, the analyses that spatial data afford almost always require some amount of record linkage to join quantities of interest to the relevant spatial covariates. In locations with a still-nascent spatial data industry, researchers' typical problems managing messy data are compounded with the difficulties of working with imprecise or unstandardized GIS files.
We hope that this guide can offer a useful resource for those looking to geocode points rigorously and systematically. The methods presented here are far from perfect: errors may still occur at any stage of our linkage process. However, we have found this procedure to be both sufficiently accurate and scalable to relatively large data sets.
# Software