# Spatial Analysis and Regression

In this learning activity, we will be using a range of more advanced geoprocessing and spatial analysis tools. Buckle up, it will get a bit advanced!

The main packages new packages we will use are **{spdep}** and **{spatialreg}** for working with our spatial weights matrices and running spatial models, respectively. 

One major issue in the United States is racial segregation; due to explicitly racist government policy like redlining, as well as more "neutral" policies single-family zoning and school district catchment zones, many American communities are segregated by race. In this activity, we will exam racial segregation in New Jersey by mapping and modeling racial *entropy* scores, a method for measuring diversity. We will rely on **{tidycensus}** to download both spatial and demographic data from the U.S. Census. Ready? Let's get started!

In [None]:
# if needed, install pacman
install.packages('pacman')

pacman::p_load(
    tidyverse,      # tidy data management + graphing
    sf,             # spatial data management and processing
    tigris,         # downloading GIS files from Census
    tidycensus,     # downloading census data + GIS
    spdep,          # spatial weights and autocorrelation
    spatialreg,     # spatial regression models
    modelsummary    # creating regression tables
)

Here we are going to use the `get_decennial()` function in order to get higher-quality data of very small geographic units, like census tracts. Note the use of the `sumfile` argument, and that we are getting more detailed geographic information by setting cartographic boundary files to off using `cb = FALSE`.

We will use `vacancy.renter` as a our measure of poverty (the decennial census does not included poverty questions). Here I am assuming that vacant rental properties reflect high levels of undesired locations due to the problems of concentrated poverty. 

In [None]:

df <- get_decennial(state = "NJ",
              year = 2020,
              geography = "tract", 
              variables = c(median.age = "DP1_0073C",
                            totalpop = "DP1_0092C",
                            latino.pct = "DP1_0093P",
                            white.pct = "DP1_0105P",
                            black.pct = "DP1_0106P",
                            nat.am.pct = "DP1_0107P",
                            asian.pct = "DP1_0108P",
                            vacancy.owner = "DP1_0156C",
                            vacancy.renter = "DP1_0157C"),
              geometry = TRUE,
              sumfile = "dp",
              cb = FALSE)

As is normal, we need to pivot our dataframe to move multiple variables into columns instead of rows. Let's also create a new variable which includes the rest of the racial/ethnic categories in NJ. 

In [None]:
df.wide <- df |> pivot_wider(names_from = variable,
                             values_from = value) |>
                 mutate(other.pct = 100 - (latino.pct + white.pct + 
                                           black.pct + nat.am.pct + 
                                           asian.pct))

Some of our racial/ethnic variables have missing data, as the Census codes them as `NA` instead of missing. We can use a simple loop to recode these variables. Since we have the same syntax used over many variables, this is more efficient.

Note that in order for us to create variable names inside the loop, we need to use some unique syntax inside of `mutate()`. The two !! points before `var.out` tells R that `var.out` will be evaluated; it isn't the name of the variable but is instead an object. Likewise, we can use the `.data[[]]` syntax to evaluate existing names we include through the `var.in` object. Lastly, in order for the function to work, we need to use `:=` instead of `=` in the formula.

Entropy scores are created by measuring the proportion of observations in some category times the natural log of the inverse of the proportion. The formula belows does this and then codes the missing data as zeros. 

In [None]:
for(i in c("latino", "white", "black", "nat.am", "asian", "other")){
  var.in <- paste0(i, ".pct")
  var.out <- paste0("entropy.", i)
  df.wide <- df.wide |> 
        mutate(!!var.out := (.data[[var.in]]/100)*log(1/(.data[[var.in]]/100)),
                                !!var.out := ifelse(is.na(.data[[var.out]]), 0, .data[[var.out]]))
}

Now, add up all the entropy scores to create the final measure.

In [None]:

df.wide <- df.wide |> 
        mutate(entropy = entropy.latino + 
        entropy.white + entropy.black + 
        entropy.nat.am + entropy.asian + entropy.other)


Let's map it!

In [None]:
ggplot(df.wide, aes(fill = entropy)) + 
    geom_sf(linewidth = .01) + 
    scale_fill_viridis_c(option = "A") +
    theme_void()

There are definitely spatial patterns here. More populous areas seem to be more diverse, although there appears to be pockets of low diversity in some of New Jersey's city centers (Trenton, Newark, Patterson, etc). 

Before we do more, let's deal with the odd part of this map: tract shapes are clearly including parts of the ocean. We can remove that area using **{tigris}**'s `erase_water()` function. This can take several minutes to run. 

In [None]:
df.wide <- df.wide |> 
            st_transform(32111) |>    # project to NAD / New Jersey 
            erase_water(year = 2020)  # removes water from polygon. 
                                      # year needs to match year from data source

In [None]:
ggplot(df.wide, aes(fill = entropy)) + 
    geom_sf(linewidth = .01) + 
    scale_fill_viridis_c(option = "A") +
    theme_void()

Much better!

Now, given the relationship between diversity and population density, let's create a formal measure of population density of the tract. We can use `st_area` from **{sf}** to measure the area. The function will use whatever unit is coded into your coordinate reference system. In this case, we'll be using squared meters. Below, we measure the area, and then divide the population of the tract by the area (converted to kilometers).  

In [None]:
df.wide$area = st_area(df.wide)
df.wide$popden = df.wide$totalpop / (df.wide$area / 1000) # thousands of persons per sq kilometer

In [None]:
ggplot(df.wide, aes(fill = as.numeric(popden))) +     # because popden is measured in squared kilometers, we convert to numeric
    geom_sf(linewidth = .01) + 
    scale_fill_viridis_c(option = "A", direction = -1) + # -1 reverses color scale
    theme_void()

## Spatial Weights

We can use a set of functions from **{spdep}** to create spatial weights matrices. Let's start with queen contiguity weights.

In [None]:
# here poly2nb() takes our polygons and creates a list of all neighbors, defined as queen contiguity
w_queen <- poly2nb(df.wide, queen = TRUE) 
# to use, we need to make them weights using nb2list2().
w_q <- nb2listw(w_queen,         # from previous function
                style = "W",     # W means row standardized
                zero.policy = TRUE) # allows weights with no neighbors

Creating distance or inverse distance weights is a little more complex. Below, we create two kinds. The first will include a list of all tracts within 20 kilometers of the tract. The second method will choose the 10 closest neighbors for each tract. In order for this to work, we need to use point data instead of polygon data. So we create polygon centroids using `st_centroid()`.

In [None]:
coords <- st_centroid(df.wide)

ggplot(coords) + geom_sf() + theme_void()

In [None]:
w_dist <- dnearneigh(coords,   # identifies all neighbors inside of range
            d1 = 0,            # minimum
            d2 = 20000)        # maximum = 20km

w_k <- knn2nb(knearneigh(coords, k = 10)) # creates neighbor list of the 20 nearest neighbors

Now we use `nbdists()` to measure the distance between neighbors.

In [None]:
w_dist1 <- nbdists(w_dist, coords) 
w_distk <- nbdists(w_k, coords)

Using `lapply()`, create the inverse distance, and then convert to the weights matrix using `nb2listw()`.

In [None]:

dists <- lapply(w_dist1, function(x) 1/(x))
w_inv <- nb2listw(w_dist, glist=dists, 
            style = "W", zero.policy = TRUE)

distsk10 <- lapply(w_distk, function(x) 1/(x))
w_invk10 <- nb2listw(w_k, glist=distsk10, 
            style = "W", zero.policy = TRUE)

## Spatial Autocorrelation Measures

Now that we have created our spatial weights, we can put them to use measuring spatial relationships. In the example below, I use Moran's I, but you could also use Geary's *c* and Getis-Ord *G* statistics. See [here](https://r-spatial.org/book/15-Measures.html#global-measures) for instructions for these statistics. 

Below, calculate Moran's I using each of our spatial weight matrices. Which spatial weight shows the strongest spatial autocorrelation?

In [None]:
moran.test(df.wide$entropy, listw = w_q, 
           alternative = "two.sided", 
           zero.policy = TRUE, 
           na.action = na.omit)

moran.test(df.wide$entropy, listw = w_inv, 
           alternative = "two.sided", 
           zero.policy = TRUE, 
           na.action = na.omit)
    
moran.test(df.wide$entropy, listw = w_invk10, 
           alternative = "two.sided",
           zero.policy = TRUE, 
           na.action = na.omit)

Local measures of Moran's I can be created using `localmoran()`. Again, we use all three definitions of weights. 

In [None]:
localm_q <- localmoran(df.wide$entropy,listw = w_q)
localm_inv <- localmoran(df.wide$entropy,listw = w_inv)
localm_invk10 <- localmoran(df.wide$entropy,listw = w_invk10)

After storing these results, we can create new variables on our spatial dataframe that records the local Moran values that are "significant" (Pebesma and Bivand prefer the term "interesting"). Below we use significance thresholds of .01 and adjust the p-values for false discovery rates, according to guidance from Caldas de Castro and Singer (2006) using the `hotspot()` function. 

In [None]:
df.wide$hotspot_I_q <- hotspot(localm_q, 
            Prname = "Pr(z != E(Ii))",       # this is the name of the pvalue in the table produced by localmoran()
            cutoff = 0.01, 
            p.adjust = "fdr")

df.wide$hotspot_I_inv <- hotspot(localm_inv, 
            Prname = "Pr(z != E(Ii))", 
            cutoff = 0.01, 
            p.adjust = "fdr")
            
df.wide$hotspot_I_invk10 <- hotspot(localm_invk10, 
            Prname = "Pr(z != E(Ii))", 
            cutoff = 0.01, 
            p.adjust = "fdr")

Let's graph them and exam the differences!

In [None]:
ggplot(df.wide, aes(fill = hotspot_I_q)) + 
    geom_sf(linewidth = 0.1) + 
    theme_void() + scale_fill_viridis_d()

ggplot(df.wide, aes(fill = hotspot_I_inv)) + 
    geom_sf(linewidth = 0.1) + 
    theme_void() + scale_fill_viridis_d()

ggplot(df.wide, aes(fill = hotspot_I_invk10)) + 
    geom_sf(linewidth = 0.1) + 
    theme_void() + scale_fill_viridis_d()

Clearly, the more expansive view of neighbor (the 20km weights) gives us a larger number of clusters identified. 

## Spatial Regression Models

A series of functions from **{spatialreg}** can be used to estimate spatial models. They are easy to fit. We use the fixed model for each by storing the forumla as `my.formula`. Here, we predict entropy scores by vacancy, median age, and population density. Then, we alter the spatial model by including lagged outcomes, lagged predictors, both, and lags in the error term. 

In [None]:
# store the base model for simplicity
my.formula <- formula(entropy ~ vacancy.renter + median.age + popden)

# OLS
m.lm <- lm(my.formula, data = df.wide)

# Spatial lag of x model: includes lags of Xs but not Y
m.slx <- lmSLX(my.formula, data = df.wide, listw = w_q)

# Spatial Lag Model or SAR; includes lag of outcome
m.slm <- lagsarlm(my.formula, data = df.wide, listw = w_q)

# Spatial Durbin model; includes WY lag and two WX covariates
m.sd1 <- lagsarlm(my.formula, data = df.wide, listw = w_q, Durbin = ~ vacancy.renter + median.age)

# This Durbin model includes all three WXs
m.sd2 <- lagsarlm(my.formula, data = df.wide, listw = w_q, Durbin = TRUE)

# Spatial Error Model; errors adjusted for spatial autocorrelation of Y
m.sem <- errorsarlm(my.formula, data = df.wide, listw = w_q)

# Spatial Durbin Error Model; incorporates WXs with spatial error term
m.sdem <- errorsarlm(my.formula, data = df.wide, listw = w_q, Durbin = TRUE)

# Spatial Autoregressive Combined model; uses both lag and error term of Y
m.sac <- sacsarlm(my.formula, data = df.wide, listw = w_q)

In [None]:
modelsummary(list(m.lm, m.slm, m.sac, m.slx, m.sd1, m.sd2, m.sdem),
             stars = TRUE,
             estimate = "{estimate}{stars}",
             statistic = "({std.error})",                  
             gof_map = c("nobs", "r.squared", "rmse", "aic")) 

One final step is to examine the direct and indirect effects of your X variables. We can do that with the `impacts()` function, although it requires a little bit of awkward processing first 

In [None]:
W <- as(w_q, "CsparseMatrix")
trw_q <- trW(W, type="mult")

impacts(m.slx)
impacts(m.sd2, tr = trw_q)

For which X is the largest portion of the effect coming indirectly (that is, from neighbors' values?). Speculate a bit. Why do you think that neighbor's levels of that variable would impact the racial segregation of a tract?  