# Advanced Spatial Analysis
# Module 10: Course Wrap Up

This week we will review the lessons learned in the previous modules and hold a workshop to practice any tricky concepts.

In [None]:
import geopandas as gpd
import pysal as ps
import statsmodels.api as sm
from scipy import stats
from statsmodels.tools.tools import add_constant

%matplotlib inline

In [None]:
# load the data
tracts = gpd.read_file('data/census_tracts_data.geojson')
tracts.shape

In [None]:
# choose a response variable and drop any rows in which it is null
response = 'med_home_value'
tracts = tracts.dropna(subset=[response])

In [None]:
# map the data
tracts.plot(column=response, scheme='quantiles')

In [None]:
tracts.columns

## 1. Statistical models

### 1a. Simple (bivariate) linear regression

OLS regression with a single predictor

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = 'med_household_income'
X = tracts[predictors].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a simple linear regression model with scipy
m, b, r, p, se = stats.linregress(x=X, y=y)
print('m={:.4f}, b={:.4f}, r^2={:.4f}, p={:.4f}'.format(m, b, r ** 2, p))

In [None]:
# estimate a simple linear regression model with statsmodels
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

This single predictor explains about half the variation of the response. To explain more, we need more predictors.

### 1b. Multiple regression

OLS regression with multiple predictors

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = ['med_household_income', 'pct_white']
X = tracts[predictors].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

#### Now add in more variables...

In [None]:
tracts.columns

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = ['med_household_income', 'pct_white', 'pct_single_family_home', 'pct_built_before_1940',
              'med_rooms_per_home', 'pct_bachelors_degree']
X = tracts[predictors].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

## 2. Spatial models

Basic types:

  - **Spatial heterogeneity**: account for systematic differences across space without explicitly modeling interdependency (non-spatial estimation)
    - spatial fixed effects (intercept varies for each spatial group)
    - spatial regimes (intercept and coefficients vary for each spatial group)
  - **Spatial dependence**: model interdependencies between observations through space
    - spatial lag model (spatially-lagged endogenous variable added as predictor;  because of endogeneity, cannot use OLS to estimate)
    - spatial error model (spatial effects in error term)
    - spatial lag+error combo model

### 2a. Spatial fixed effects

Using dummy variables representing the counties into which our observations (tracts) are nested

In [None]:
# create a new dummy variable for each county, with 1 if tract is in this county and 0 if not
for county in tracts['COUNTYFP'].unique():
    new_col = f'dummy_county_{county}'
    tracts[new_col] = (tracts['COUNTYFP'] == county).astype(int)

In [None]:
# remove one dummy from dummies to prevent perfect collinearity
# ie, a subset of predictors sums to 1 (which full set of dummies will do)
county_dummies = [f'dummy_county_{county}' for county in tracts['COUNTYFP'].unique()]
county_dummies = county_dummies[1:]

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = ['med_household_income', 'pct_white', 'pct_single_family_home', 'pct_built_before_1940',
              'med_rooms_per_home', 'pct_bachelors_degree']
X = tracts[predictors + county_dummies].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

### 2b. Spatial regimes

Each spatial regime can have different model coefficients. Here, the regimes are counties. We'll take a subset of our data (all the tracts appearing in 3 counties). This subsection just uses OLS for estimation, but you can also combine spatial regimes with spatial autogression models (the latter is introduced later).

In [None]:
# pick 3 counties as the regimes, and only estimate a regimes model for this subset
counties = tracts['COUNTYFP'].value_counts().index[:3]
mask = tracts['COUNTYFP'].isin(counties)

In [None]:
# create design matrix containing predictors (drop nulls), a response variable matrix, and a regimes vector
X = tracts.loc[mask, predictors].dropna() #only take rows in the 3 counties
Y = tracts.loc[X.index][[response]] #notice this is a matrix this time for pysal
regimes = tracts.loc[X.index]['COUNTYFP'] #define the regimes

In [None]:
# estimate spatial regimes model with OLS
olsr = ps.model.spreg.OLS_Regimes(y=Y.values, x=X.values, regimes=regimes.values, name_regimes='county',
                                  name_x=X.columns.tolist(), name_y=response, name_ds='tracts')
print(olsr.summary)

### 2c. Spatial diagnostics

So far we've seen two spatial heterogeneity models. Now we'll explore spatial dependence, starting by using queen-contiguity spatial weights to model spatial relationships between observations and OLS to check diagnostics.

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable matrix
predictors = ['med_household_income', 'pct_white', 'pct_single_family_home', 'pct_built_before_1940',
              'med_rooms_per_home', 'pct_bachelors_degree']
X = tracts[predictors].dropna()
Y = tracts.loc[X.index][[response]] #notice this is a matrix this time for pysal

In [None]:
# compute spatial weights from tract geometries (but only those tracts that appear in design matrix!)
W = ps.lib.weights.Queen.from_dataframe(tracts.loc[X.index])
W.transform = 'r'

In [None]:
# compute OLS spatial diagnostics to check the nature of spatial dependence
ols = ps.model.spreg.OLS(y=Y.values, x=X.values, w=W, spat_diag=True, moran=True)

In [None]:
# calculate moran's I (for the response) and its significance
mi = ps.explore.esda.Moran(y=Y, w=W, two_tailed=True)
print(mi.I)
print(mi.p_sim)

In [None]:
# moran's I (for the residuals): moran's i, standardized i, p-value
ols.moran_res

#### Interpreting the results

A significant Moran's *I* suggests spatial autocorrelation, but doesn't tell us which alternative specification should be used. Lagrange Multiplier (LM) diagnostics can help with that. If one LM test is significant and the other isn't, then that tells us which model specification (spatial lag vs spatial error) to use:

In [None]:
# lagrange multiplier test for spatial lag model: stat, p
ols.lm_lag

In [None]:
# lagrange multiplier test for spatial error model: stat, p
ols.lm_error

#### Interpreting the results

If (and only if) both the LM tests produce significant statistics, try the robust versions (the nonrobust LM tests are sensitive to each other):

In [None]:
# robust lagrange multiplier test for spatial lag model: stat, p
ols.rlm_lag

In [None]:
# robust lagrange multiplier test for spatial error model: stat, p
ols.rlm_error

#### So... which model specification to choose?

If neither LM test is significant: use regular OLS.

If only one LM test is significant: use that model spec.

If both LM tests are significant: run robust versions.

If only one robust LM test is significant: use that model spec.

If both robust LM tests are significant (this can often happen with large sample sizes):

  - first consider if the initial model specification is actually a good fit
  - if so, use the spatial specification corresponding to the larger robust-LM statistic
  - or consider a combo model

### 2d. Spatial lag model

When the diagnostics indicate the presence of a spatial diffusion process.

Model specification:

$y = \rho W y + X \beta + u$

where $y$ is a $n \times 1$ vector of observations (response), $W$ is a $n \times n$ spatial weights matrix (thus $Wy$ is the spatially-lagged response), $\rho$ is the spatial autoregressive parameter to be estimated, $X$ is a $n \times k$ matrix of observations (exogenous predictors), $\beta$ is a $k \times 1$ vector of parameters (coefficients) to be estimated, and $u$ is a $n \times 1$ vector of errors.

In [None]:
# maximum-likelihood estimation with full matrix expression
mll = ps.model.spreg.ML_Lag(y=Y.values, x=X.values, w=W, method='full', name_w='queen',
                            name_x=X.columns.tolist(), name_y=response, name_ds='tracts')
print(mll.summary)

In [None]:
# the spatial autoregressive parameter estimate, rho
mll.rho

### 2e. Spatial error model

When the diagnostics indicate the presence of spatial error dependence.

Model specification:

$y = X \beta + u$

where $X$ is a $n \times k$ matrix of observations (exogenous predictors), $\beta$ is a $k \times 1$ vector of parameters (coefficients) to be estimated, and $u$ is a $n \times 1$ vector of errors. The errors $u$ follow a spatial autoregressive specification:

$u = \lambda Wu + \epsilon$

where $\lambda$ is a spatial autoregressive parameter to be estimated and $\epsilon$ is the vector of errors.

In [None]:
# maximum-likelihood estimation with full matrix expression
mle = ps.model.spreg.ML_Error(y=Y.values, x=X.values, w=W, method='full', name_w='queen',
                            name_x=X.columns.tolist(), name_y=response, name_ds='tracts')
print(mle.summary)

In [None]:
# the spatial autoregressive parameter estimate, lambda
mle.lam

### 2f. Spatial lag+error combo model

Estimated with GMM (generalized method of moments). Essentially a spatial error model with endogenous explanatory variables.

Model specification:

$y = \rho W y + X \beta + u$

where $y$ is a $n \times 1$ vector of observations (response), $W$ is a $n \times n$ spatial weights matrix (thus $Wy$ is the spatially-lagged response), $\rho$ is the spatial autoregressive parameter to be estimated, $X$ is a $n \times k$ matrix of observations (exogenous predictors), $\beta$ is a $k \times 1$ vector of parameters (coefficients) to be estimated, and $u$ is a $n \times 1$ vector of errors.

The errors $u$ follow a spatial autoregressive specification:

$u = \lambda Wu + \epsilon$

where $\lambda$ is a spatial autoregressive parameter to be estimated and $\epsilon$ is the vector of errors.

In [None]:
gmc = ps.model.spreg.GM_Combo_Het(y=Y.values, x=X.values, w=W, name_w='queen', name_ds='tracts',
                                  name_x=X.columns.tolist(), name_y=response)
print(gmc.summary)