# Urban Informatics
# Module 11: Inferential Spatial Modeling

Statistical inference is the process of using a sample to *infer* the characteristics of an underlying population (from which this sample was drawn) through estimation and hypothesis testing. Contrast this with descriptive statistics, which focus simply on describing the characteristics of the sample itself.

Common goals of inferential statistics include:

  - parameter estimation and confidence intervals
  - hypothesis rejection
  - prediction
  - model selection

**Theory and Models**

"Theories are structures of ideas that explain and interpret facts." -Stephen Jay Gould

To conduct statistical inference, we rely on *statistical models*: sets of assumptions plus mathematical relationships between variables, producing a formal representation of some theory. We are essentially trying to explain the process underlying the generation of our data. What is the probability distribution (the probabilities of occurrence of different possible outcome values of our response variable)?

***Spatial* inference** introduces explicit spatial relationships into the statistical modeling framework, as both theory-driven (e.g., spatial spillovers) and data-driven (e.g., MAUP) issues could otherwise violate modeling assumptions.

Schools of statistical inference:

  - frequentist
    - frequentists think of probability as proportion of time some outcome occurs (relative frequency)
    - given lots of repeated trials, how likely is the observed outcome?
    - concepts: statistical hypothesis testing, *p*-values, confidence intervals
  - bayesian
    - bayesians think of probability as amount of certainty observer has about an outcome occurring (subjective probability)
    - probability as a measure of how much info the observer has about the real world, updated as info changes
    - concepts: prior probability, likelihood, bayes' rule, posterior probability

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pysal as ps
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from statsmodels.tools.tools import add_constant

np.random.seed(0)
%matplotlib inline

In [None]:
# load the data
tracts_ca = gpd.read_file('data/census_tracts_data.geojson').set_index('index')

In [None]:
# LA, ventura, orange counties only (and drop offshore island tracts)
islands = ['06037599100', '06037599000', '06111980000', '06111990100', '06111003612']
tracts = tracts_ca[tracts_ca['COUNTYFP'].isin(['037', '059', '111'])].drop(index=islands)
tracts.shape

In [None]:
# project spatial geometries from lat-long to a meter-based projection for SoCal
utm_ca = '+proj=utm +zone=11 +ellps=WGS84 +datum=WGS84 +units=m +no_defs'
tracts = tracts.to_crs(utm_ca)

In [None]:
# map the data
ax = tracts.plot()

In [None]:
tracts.columns

## 1. Statistical inference: introduction

### 1a. Estimating population parameters

Here, our population = all tracts in Orange, Ventura, and LA counties

In [None]:
# descriptive stat of the population: average tract-level median income
tracts['med_household_income'].mean()

In [None]:
# descriptive stat of a simple random sample
n = 500
sample = tracts['med_household_income'].sample(n)
sample.mean()

How similar is our sample mean to our population mean? Is it a good estimate? We have calculated a "point estimate" of the population mean. Let's calculate an "interval estimate" of it instead.

In [None]:
# calculate confidence interval using t-distribution (bc population std dev is unknown)
sample = sample.dropna() #drop nulls
conf = 0.95 #confidence level
df = len(sample) - 1 #degrees of freedom
mean = sample.mean() #the sample's mean
sem = stats.sem(sample) #the standard error of the mean
lower, upper = stats.t.interval(conf, df, loc=mean, scale=sem)

# calculate the margin of error
moe = upper - sample.mean()

# display confidence interval
print(f'{lower:0.0f} – {upper:0.0f} ({conf*100:0.0f}% confidence interval)')
print(f'{mean:0.0f} ± {moe:0.0f} (at {conf*100:0.0f}% confidence level)')

We are 95% confident that this interval contains the true population parameter value. That is, if we were to repeat this process many times (sampling then computing CI), on average 95% of the CIs would contain the true population parameter value (and 5% wouldn't).

In [None]:
# now it's your turn
# try different sample sizes and alpha levels: how do these change the confidence interval's size?


In [None]:
# now it's your turn
# randomly sample 100 tract-level median home values then calculate the mean and 99% confidence interval


### 1b. *t*-tests: difference in means

Is the difference between two groups statistically significant?

In [None]:
# choose a variable
var = 'med_home_value'

In [None]:
# create two data subsets
black_tracts = tracts[tracts['pct_black'] > 50]
group1 = black_tracts[var]
hispanic_tracts = tracts[tracts['pct_hispanic'] > 50]
group2 = hispanic_tracts[var]

In [None]:
# what are the probability distributions of these two data sets?
fig, ax = plt.subplots()
ax = group1.plot.kde(ls='--', c='k', alpha=0.5, lw=2, bw_method=0.7)
ax = group2.plot.kde(ls='-', c='k', alpha=0.5, lw=2, bw_method=0.7, ax=ax)
ax.set_xlim(left=0)
ax.set_ylim(bottom=0)
plt.show()

In [None]:
print(int(group1.mean()))
print(int(group2.mean()))

In [None]:
# calculate difference in means
diff = group1.mean() - group2.mean()
diff

In [None]:
# compute the t-stat and its p-value
t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var=False, nan_policy='omit')
p_value

In [None]:
# is the difference in means statistically significant?
alpha = 0.05 #significance level
p_value < alpha

In [None]:
# now it's your turn
# what is the difference in mean tract-level median home values in majority white vs majority black tracts?
# is it statistically significant?
# what if you randomly sample just 25 tracts from each group: is their difference significant?


## 2. Statistical models

Introduction to OLS linear regression.

Lots to cover in a course on regression that we must skip for today's quick overview. But in general you'd want to:

  - specify a model (or alternative models) based on theory
  - inspect candidate predictors' relationships with the response
  - inspect the predictors' relationships with each other (and reduce multicollinearity)
  - transform predictors for better linearity
  - identify and handle outlier observations
  - regression diagnostics

### 2a. Simple (bivariate) linear regression

OLS regression with a single predictor

In [None]:
# choose a response variable and drop any rows in which it is null
response = 'med_home_value'
tracts = tracts.dropna(subset=[response])

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = 'med_household_income'
X = tracts[predictors].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a simple linear regression model with scipy
m, b, r, p, se = stats.linregress(x=X, y=y)
print('m={:.4f}, b={:.4f}, r^2={:.4f}, p={:.4f}'.format(m, b, r ** 2, p))

In [None]:
# estimate a simple linear regression model with statsmodels
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

This single predictor explains about half the variation of the response. To explain more, we need more predictors.

### 2b. Multiple regression

OLS regression with multiple predictors

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = ['med_household_income', 'pct_white']
X = tracts[predictors].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

#### Now add in more variables...

In [None]:
tracts.columns

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = ['pct_white', 'pct_built_before_1940', 'med_rooms_per_home', 'pct_bachelors_degree']
X = tracts[predictors].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

In [None]:
# now it's your turn
# try different sets of predictors to increase R-squared while keeping the total number of predictors relatively low and theoretically sound


### 2c. Standardized regression

*Beta coefficients* are the estimated regression coefficients when the response and predictors are standardized so that their variances equal 1. Thus, we can interpret these coefficients as how many standard deviations the response changes for each standard deviation increase in the predictor. This tells us about "effect size": which predictors have greater effects on the response by ignoring the variables' different units/scales of measurement. However, it relies on the variables' distributions having similar shapes (otherwise the meaning of a std dev in one will differ from a std dev in another).

In [None]:
# estimate a standardized regression model
y_stdrd = pd.Series(stats.mstats.zscore(y), index=y.index, name=y.name)
X_stdrd = pd.DataFrame(stats.mstats.zscore(X), index=X.index, columns=X.columns)
Xc_stdrd = add_constant(X_stdrd)
model_stdrd = sm.OLS(y_stdrd, Xc_stdrd)
result_stdrd = model_stdrd.fit()
print(result_stdrd.summary())

### 2d. Diagnostics

Let's take a step back and think about some of the steps we might take prior to specifying the model.

In [None]:
# correlation matrix
# how well are predictors correlated with response... and with each other?
correlations = tracts[[response] + sorted(predictors)].corr()
correlations.round(2)

In [None]:
# visual correlation matrix via seaborn heatmap
# use vmin, vmax, center to set colorbar scale properly
sns.set(style='white')
ax = sns.heatmap(correlations, vmin=-1, vmax=1, center=0,
                 cmap=plt.cm.coolwarm, square=True, linewidths=1)
ax.set_ylim(plt.ylim()[0] + 0.5, plt.ylim()[1] - 0.5) #fix temporary mpl bug
plt.show()

In [None]:
# plot pairwise relationships with seaborn
grid = sns.pairplot(tracts[[response] + sorted(predictors)], markers='.')

## 3. Spatial models

Basic types:

  - **Spatial heterogeneity**: account for systematic differences across space without explicitly modeling interdependency (non-spatial estimation)
    - spatial fixed effects (intercept varies for each spatial group)
    - spatial regimes (intercept and coefficients vary for each spatial group)
  - **Spatial dependence**: model interdependencies between observations through space
    - spatial lag model (spatially-lagged endogenous variable added as predictor;  because of endogeneity, cannot use OLS to estimate)
    - spatial error model (spatial effects in error term)
    - spatial lag+error combo model

In section 3, we'll look at models that account for spatial heterogeneity. In section 4, we'll look at models that account for spatial dependence.

### 3a. Spatial fixed effects

Using dummy variables representing the counties into which our observations (tracts) are nested

In [None]:
# create a new dummy variable for each county, with 1 if tract is in this county and 0 if not
for county in tracts['COUNTYFP'].unique():
    new_col = f'dummy_county_{county}'
    tracts[new_col] = (tracts['COUNTYFP'] == county).astype(int)

In [None]:
# remove one dummy from dummies to prevent perfect collinearity
# ie, a subset of predictors sums to 1 (which full set of dummies will do)
county_dummies = [f'dummy_county_{county}' for county in tracts['COUNTYFP'].unique()]
county_dummies = county_dummies[:-1]
county_dummies

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
X = tracts[predictors + county_dummies].dropna()
y = tracts.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

### 3b. Spatial regimes

Each spatial regime can have different model coefficients. Here, the regimes are our 3 counties. This subsection just uses OLS for estimation, but you can also combine spatial regimes with spatial autogression models (the latter is introduced later).

In [None]:
# create design matrix containing predictors (drop nulls), a response variable matrix, and a regimes vector
X = tracts[predictors].dropna() #only take rows with non-null observations
Y = tracts.loc[X.index][[response]] #notice this is a matrix (not a vector) this time for pysal
regimes = tracts.loc[X.index]['COUNTYFP'] #define the regimes
regimes.sample(5)

In [None]:
# estimate spatial regimes model with OLS
olsr = ps.model.spreg.OLS_Regimes(y=Y.values, x=X.values, regimes=regimes.values, name_regimes='county',
                                  name_x=X.columns.tolist(), name_y=response, name_ds='tracts')
print(olsr.summary)

## 4. Spatial Dependence

### 4a. Spatial diagnostics

So far we've seen two spatial heterogeneity models. Now we'll explore spatial dependence, starting by using queen-contiguity spatial weights to model spatial relationships between observations and OLS to check diagnostics.

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable matrix
X = tracts[predictors].dropna()
Y = tracts.loc[X.index][[response]] #notice this is a matrix this time for pysal

In [None]:
# compute spatial weights from tract geometries (but only those tracts that appear in design matrix!)
W = ps.lib.weights.Queen.from_dataframe(tracts.loc[X.index])
W.transform = 'r'

In [None]:
# compute OLS spatial diagnostics to check the nature of spatial dependence
ols = ps.model.spreg.OLS(y=Y.values, x=X.values, w=W, spat_diag=True, moran=True)

In [None]:
# calculate moran's I (for the response) and its significance
mi = ps.explore.esda.Moran(y=Y, w=W, two_tailed=True)
print(mi.I)
print(mi.p_sim)

In [None]:
# moran's I (for the residuals): moran's i, standardized i, p-value
ols.moran_res

#### Interpreting the results

A significant Moran's *I* suggests spatial autocorrelation, but doesn't tell us which alternative specification should be used. Lagrange Multiplier (LM) diagnostics can help with that. If one LM test is significant and the other isn't, then that tells us which model specification (spatial lag vs spatial error) to use:

In [None]:
# lagrange multiplier test for spatial lag model: stat, p
ols.lm_lag

In [None]:
# lagrange multiplier test for spatial error model: stat, p
ols.lm_error

#### Interpreting the results

If (and only if) both the LM tests produce significant statistics, try the robust versions (the nonrobust LM tests are sensitive to each other):

In [None]:
# robust lagrange multiplier test for spatial lag model: stat, p
ols.rlm_lag

In [None]:
# robust lagrange multiplier test for spatial error model: stat, p
ols.rlm_error

#### So... which model specification to choose?

If neither LM test is significant: use regular OLS.

If only one LM test is significant: use that model spec.

If both LM tests are significant: run robust versions.

If only one robust LM test is significant: use that model spec.

If both robust LM tests are significant (this can often happen with large sample sizes):

  - first consider if the initial model specification is actually a good fit
  - if so, use the spatial specification corresponding to the larger robust-LM statistic
  - or consider a combo model

### 4b. Spatial lag model

When the diagnostics indicate the presence of a spatial diffusion process.

Model specification:

$y = \rho W y + X \beta + u$

where $y$ is a $n \times 1$ vector of observations (response), $W$ is a $n \times n$ spatial weights matrix (thus $Wy$ is the spatially-lagged response), $\rho$ is the spatial autoregressive parameter to be estimated, $X$ is a $n \times k$ matrix of observations (exogenous predictors), $\beta$ is a $k \times 1$ vector of parameters (coefficients) to be estimated, and $u$ is a $n \times 1$ vector of errors.

In [None]:
# maximum-likelihood estimation with full matrix expression
mll = ps.model.spreg.ML_Lag(y=Y.values, x=X.values, w=W, method='full', name_w='queen',
                            name_x=X.columns.tolist(), name_y=response, name_ds='tracts')
print(mll.summary)

In [None]:
# the spatial autoregressive parameter estimate, rho
mll.rho

### 4c. Spatial error model

When the diagnostics indicate the presence of spatial error dependence.

Model specification:

$y = X \beta + u$

where $X$ is a $n \times k$ matrix of observations (exogenous predictors), $\beta$ is a $k \times 1$ vector of parameters (coefficients) to be estimated, and $u$ is a $n \times 1$ vector of errors. The errors $u$ follow a spatial autoregressive specification:

$u = \lambda Wu + \epsilon$

where $\lambda$ is a spatial autoregressive parameter to be estimated and $\epsilon$ is the vector of errors.

In [None]:
# maximum-likelihood estimation with full matrix expression
mle = ps.model.spreg.ML_Error(y=Y.values, x=X.values, w=W, method='full', name_w='queen',
                            name_x=X.columns.tolist(), name_y=response, name_ds='tracts')
print(mle.summary)

In [None]:
# the spatial autoregressive parameter estimate, lambda
mle.lam

### 4d. Spatial lag+error combo model

Estimated with GMM (generalized method of moments). Essentially a spatial error model with endogenous explanatory variables.

Model specification:

$y = \rho W y + X \beta + u$

where $y$ is a $n \times 1$ vector of observations (response), $W$ is a $n \times n$ spatial weights matrix (thus $Wy$ is the spatially-lagged response), $\rho$ is the spatial autoregressive parameter to be estimated, $X$ is a $n \times k$ matrix of observations (exogenous predictors), $\beta$ is a $k \times 1$ vector of parameters (coefficients) to be estimated, and $u$ is a $n \times 1$ vector of errors.

The errors $u$ follow a spatial autoregressive specification:

$u = \lambda Wu + \epsilon$

where $\lambda$ is a spatial autoregressive parameter to be estimated and $\epsilon$ is the vector of errors.

In [None]:
gmc = ps.model.spreg.GM_Combo_Het(y=Y.values, x=X.values, w=W, name_w='queen', name_ds='tracts',
                                  name_x=X.columns.tolist(), name_y=response)
print(gmc.summary)

In [None]:
# now it's your turn
# with a new set of predictors, compute spatial diagnostics and estimate a new spatial model accordingly
