# Introductory to Spatial Regression in Python
Author: Zach Schira

Regression analysis allows you to model and predict some process based on its relationship to a specific dependent variable or variables. Often times, however, a standard regression model is insufficient for modeling data with a spatial dependency. This happens when the data is spatially autocorrelated. In these cases, you may want to instead use a spatial regression model.

Spatial regression analysis allows you to encorporate spatial dependencies into your model. This tutorial will outline how to use the [PySAL](https://pypi.python.org/pypi/PySAL) Python package to use these spatial regression methods.

## Objectives
- Perform ordinary linear regression analysis
- Find spatial Autocorrelation of dataset
- Perform spatial regression analysis

## Dependencies
- PySAL
- numpy

In [None]:
!pip install pysal

In [59]:
import pysal
import numpy as np
from pysal.spreg import ols
from pysal.spreg import ml_error
from pysal.spreg import ml_lag

The pysal package contains many sample data files that can be used to demonstrate the package's abilities. For this example we will be analyzing Columbus home values with relation to income and crime by neighborhood. First we will run an [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) linear regression model to analyze the relationship between these variables.

This first section of code will read in the home values (dependent variable) into an array `y` and the income and crime values (independent variables) into a two dimmensional array `X`.

In [24]:
f = pysal.open(pysal.examples.get_path("columbus.dbf"),'r')
y = np.array(f.by_col['HOVAL'])
y.shape = (len(y),1)
X= []
X.append(f.by_col['INC'])
X.append(f.by_col['CRIME'])
X = np.array(X).T

68.892043999999999

Now that we have stored the values we are analyzing we can perform our least squares test to determine dependency. This is done with the [pysal.spreg](http://pysal.readthedocs.io/en/v1.11.0/library/spreg/index.html) module. Our instance of `OLS`, named `ls`, has many useful tools for reviewing the results of our test. In this case, we will use `ls.summary` to obtain of full summary of the results, but for more specific results you can look at some of the other options on the `pysal.spreg` page linked above.

In [43]:
ls = ols.OLS(y, X, name_y = 'home val', name_x = ['Income', 'Crime'], name_ds = 'Columbus')
print(ls.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :    Columbus
Weights matrix      :        None
Dependent Variable  :    home val                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
R-squared           :      0.3495
Adjusted R-squared  :      0.3212
Sum squared residual:   10647.015                F-statistic           :     12.3582
Sigma-square        :     231.457                Prob(F-statistic)     :   5.064e-05
S.E. of regression  :      15.214                Log likelihood        :    -201.368
Sigma-square ML     :     217.286                Akaike info criterion :     408.735
S.E of regression ML:     14.7406                Schwarz criterion     :     414.411

-----------------------------------------------------------------------------

Looking at these results, we see the [sum squared residuals](https://en.wikipedia.org/wiki/Residual_sum_of_squares) is quite high, so we want a better model to fit our data.

Now we will check whether or not there is a spacial dependency in this data using a [Moran's I](https://en.wikipedia.org/wiki/Moran%27s_I) test. Our first step in that process is to create a spatial weights matrix. PySAL's example data has a GAL file that we can read in directly to create this matrix.

In [38]:
w = pysal.open(pysal.examples.get_path("columbus.gal")).read()

Now we can create our instance of Moran's using the data and the weights matrix. Looking at these results, we see that our observed value for I is much higher than the value we would expect if there was no spatial dependeny. As you would expect, this leads to a very low p-value, which allows us to assume that there is spatial dependency in these home values.

In [39]:
mi = pysal.Moran(y, w, two_tailed=False)
print('Observed I:', mi.I, '\nExpected I:', mi.EI, '\n   p-value:', mi.p_norm)

Observed I: 0.180093114317 
Expected I: -0.020833333333333332 
   p-value: 0.0149554171455


Now that we have found that there is spatial autocorrelation in this data, we can use a spatial regression model to see if that better fits our data. 

The `spreg` module has several different functions for creating a spatial regression model. In this example, we will use a spatial error model, but the implementation of a spatial lag model is almost identical. 

To compare our spatial regression model to our original linear regression model we can use the [Akaike info criterion(AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion#How_to_apply_AIC_in_practice). This value is a way to qualitatively compare the amount of information lost in a predictive model. The lower the AIC, the better the model. As you can see in the following output, the AIC for the spatial error model is lower than that of the original OLS model, meaning we have created a model that better fits our data using these spatial regression techniques.  

In [60]:
spat_err = ml_error.ML_Error(y, X, w, name_y='home value', name_x=['income','crime'], name_w='columbus.gal', name_ds='columbus')
print(spat_err.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL)
-------------------------------------------------------------------
Data set            :    columbus
Weights matrix      :columbus.gal
Dependent Variable  :  home value                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
Pseudo R-squared    :      0.3495
Sigma-square ML     :     197.314                Log likelihood        :    -199.769
S.E of regression   :      14.047                Akaike info criterion :     405.537
                                                 Schwarz criterion     :     411.213

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
-----------------------------------------------------------

