**CURSO**: Análisis Geoespacial, Departamento de Geociencias y Medio Ambiente, Universidad Nacional de Colombia - sede Medellín <br/>
**Profesor**: Edier Aristizábal (evaristizabalg@unal.edu.co) <br />
**Credits**: The content of this notebook is taken from [Tutorial de Spatial Data Science for sustainable development](https://sustainability-gis.readthedocs.io/en/latest/lessons/L4/spatial_regression.html#spatially-lagged-endogenous-regressors-wy).

# Spatial regression

## Libraries

In [2]:
from pysal.model import spreg
from pysal.lib import weights
from pysal.explore import esda
from scipy import stats
import statsmodels.formula.api as sm
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import osmnx as ox
from pyrosm import OSM, get_data
sns.set(style="whitegrid")

In [7]:
!pip install hvplot

Collecting hvplot
  Downloading hvplot-0.7.3-py2.py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.9 MB/s eta 0:00:01
[?25hCollecting holoviews>=1.11.0
  Downloading holoviews-1.14.8-py2.py3-none-any.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 10.4 MB/s eta 0:00:01
Collecting pyviz-comms>=0.7.4
  Downloading pyviz_comms-2.2.0-py2.py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.9 MB/s  eta 0:00:01
[?25hCollecting panel>=0.8.0
  Downloading panel-0.12.7-py2.py3-none-any.whl (12.9 MB)
[K     |████████████████████████████████| 12.9 MB 10.4 MB/s eta 0:00:01
Collecting markdown
  Downloading Markdown-3.3.6-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 7.1 MB/s  eta 0:00:01
Installing collected packages: pyviz-comms, markdown, panel, holoviews, hvplot
Successfully installed holoviews-1.14.8 hvplot-0.7.3 markdown-3.3.6 panel-0.12.7 pyviz-comms-2.2.0


In [12]:
import hvplot.pandas

## Data

Let’s read the Airbnb data and OSM data for Austin, Texas:

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/daniel-codes/airbnb-austin-tx/master/listings_austin.csv")
data.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms',

In [4]:
# Read OSM data - get administrative boundaries

# define the place query
query = {'city': 'Austin'}

# get the boundaries of the place (add additional buffer around the query)
boundaries = ox.geocode_to_gdf(query, buffer_dist=5000)

# Let's check the boundaries on a map
boundaries.explore()

Let’s convert the Airbnb data into GeoDataFrame based on the longitude and latitude columns and filter the data geographically based on Austing boundaries:

In [5]:
# Create a GeoDataFrame
data["geometry"] = gpd.points_from_xy(data["longitude"], data["latitude"])
data = gpd.GeoDataFrame(data, crs="epsg:4326")

# Filter geographically
data = gpd.sjoin(data, boundaries[["geometry"]])
data = data.reset_index(drop=True)

# Check the first rows
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,geometry,index_right
0,2265,https://www.airbnb.com/rooms/2265,20180710171409,2018-07-10,Zen-East in the Heart of Austin,Zen East is situated in a vibrant & diverse mu...,This colorful and clean 1923 house was complet...,Zen East is situated in a vibrant & diverse mu...,none,,...,"{""Texas State""}",f,f,strict_14_with_grace_period,f,f,3,0.19,POINT (-97.71398 30.27750),0
1,5245,https://www.airbnb.com/rooms/5245,20180710171409,2018-07-10,"Green, Colorful, Clean & Cozy home",,Situated in a vibrant & diverse multicultural ...,Situated in a vibrant & diverse multicultural ...,none,,...,"{""Texas State""}",f,f,strict_14_with_grace_period,f,f,3,0.08,POINT (-97.71379 30.27577),0
2,5456,https://www.airbnb.com/rooms/5456,20180710171409,2018-07-10,"Walk to 6th, Rainey St and Convention Ctr",Fabulous location for walking to Convention Ce...,Cute Private Studio apartment located in Willo...,Fabulous location for walking to Convention Ce...,none,My neighborhood is ideally located if you want...,...,"{""Texas State""}",f,f,strict_14_with_grace_period,f,t,1,3.88,POINT (-97.73448 30.26112),0
3,5769,https://www.airbnb.com/rooms/5769,20180710171409,2018-07-10,NW Austin Room,,Looking for a comfortable inexpensive room to ...,Looking for a comfortable inexpensive room to ...,none,Quiet neighborhood with lots of trees and good...,...,"{""Texas State""}",f,f,moderate,t,t,1,2.3,POINT (-97.78370 30.45596),0
4,6413,https://www.airbnb.com/rooms/6413,20180710171409,2018-07-10,Gem of a Studio near Downtown,"Great studio apartment, perfect for couples or...","(License #114332) Large, contemporary studio a...","Great studio apartment, perfect for couples or...",none,Travis Heights is one of the oldest neighborho...,...,"{""Texas State""}",t,f,strict_14_with_grace_period,f,f,1,0.72,POINT (-97.73726 30.24829),0


In [None]:
data.hvplot(geo=True, tiles="OSM", alpha=0.5, width=600, height=600, hover_cols=["name"])

## Non-Spatial regression
Before introducing explicitly spatial methods, we will run a simple linear regression model. This will allow us, on the one hand, set the main principles of hedonic modeling and how to interpret the coefficients, which is good because the spatial models will build on this; and, on the other hand, it will provide a baseline model that we can use to evaluate how meaningful the spatial extensions are.

Essentially, the core of a linear regression is to explain a given variable -the price of a listing  on AirBnb ($P_i$) as a linear function of a set of other characteristics we will collectively call 
$X_i$:

$ln(P_i)=\alpha + \beta X_i + \epsilon_i$

For several reasons, it is common practice to introduce the price in logarithms, so we will do so here. Additionally, since this is a probabilistic model, we add an error term 
 that is assumed to be well-behaved (i.i.d. as a normal).
 
For our example, we will consider the following set of explanatory features of each listed property:

In [17]:
explanatory_vars = ['host_listings_count', 'bathrooms', 'bedrooms', 'beds', 'guests_included']

Additionally, we are going to derive a new feature of a listing from the amenities variable. Let us construct a variable that takes 1 if the listed property has a pool and 0 otherwise:

In [13]:
def has_pool(a):
    if 'Pool' in a:
        return 1
    else:
        return 0
    
data['pool'] = data['amenities'].apply(has_pool)

Let’s calculate the logarithmic value from the price. Let’s first check our values:

In [14]:
data["price"].head()

0    $200.00
1    $125.00
2     $95.00
3     $40.00
4     $99.00
Name: price, dtype: object

As we can see, our values are represented as strings with a dollar sign. Before we can take a logarithmic value out of them, we need to remove the dollar sign and convert the values to floats:

In [15]:
# Remove dollar sign and the thousand separator (comma, e.g. 1000,000.00) and convert to float
data["price"] = data["price"].str.replace("$", '', regex=True).str.replace(",", "").astype(float)
data["log_price"] = np.log(data["price"] + 0.000001)

Do we have any missing values in our dependent or explanatory variables?

In [18]:
all_model_attributes = ["price"] + explanatory_vars
has_nans = False
for attr in all_model_attributes:
    if data[attr].hasnans:
        has_nans = True
print("Has missing values:", has_nans)

Has missing values: True


Okay, as we can see there are missing values, hence, let’s remove them before continuing:

In [19]:
data = data.dropna(subset=all_model_attributes).copy()

To run the model, we can use the spreg module in PySAL, which implements a standard OLS routine, but is particularly well suited for regressions on spatial data. At this point, we are ready to fit the regression:

In [21]:
m1 = spreg.OLS(data[['log_price']].values, data[explanatory_vars].values, 
                  name_y = 'log_price', name_x = explanatory_vars)

To get a quick glimpse of the results, we can print its summary:

In [22]:
print(m1.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:       11166
Mean dependent var  :      5.0455                Number of Variables   :           6
S.D. dependent var  :      1.1277                Degrees of Freedom    :       11160
R-squared           :      0.2898
Adjusted R-squared  :      0.2895
Sum squared residual:   10084.906                F-statistic           :    910.6550
Sigma-square        :       0.904                Prob(F-statistic)     :           0
S.E. of regression  :       0.951                Log likelihood        :  -15275.331
Sigma-square ML     :       0.903                Akaike info criterion :   30562.662
S.E of regression ML:      0.9504                Schwarz criterion     :   30606.586

-----------------------------------------------------------------------------

Results are largely unsurprising, but nonetheless reassuring. Both an extra bedroom and an extra bathroom increase the final price around 30%. Accounting for those, an extra bed pushes the price about 2%. Neither the number of guests included nor the number of listings the host has in total have a significant effect on the final price.

Including a spatial weights object in the regression buys you an extra bit: the summary provides results on the diagnostics for spatial dependence. These are a series of statistics that test whether the residuals of the regression are spatially correlated, against the null of a random distribution over space. If the latter is rejected a key assumption of OLS, independently distributed error terms, is violated. Depending on the structure of the spatial pattern, different strategies have been defined within the spatial econometrics literature to deal with them. The main summary from the diagnostics for spatial dependence is that there is clear evidence to reject the null of spatial randomness in the residuals, hence an explicitly spatial approach is warranted.

## Spatially lagged exogenous regressors (WX)

The first and most straightforward way to introduce space is by **spatially lagging one of the explanatory variables**. Mathematically, this can be expressed as follows:  
$ln(p_i)=\alpha+\beta X_i + \theta \sum_j{w_{ij} X_{i}^{'} + \epsilon_i}$  
where $ln(p_i)$ is our dependent variable (logarithmic price), X_{i}^{'} is a subset of $X_i$  although it could encompass all of the explanatory variables, and 
$w_{ij}$ is the $ij$-th cell of a spatial weights matrix . Because  assigns non-zero values only to spatial neighbors, if  is row-standardized (customary in this context), then 
 captures the average value of X_{i}^{'} in the surroundings of location . This is what we call the spatial lag of $X_i$. Also, since it is a spatial transformation of an explanatory variable, the standard estimation approach -OLS- is sufficient: spatially lagging the variables does not violate any of the assumptions on which OLS relies.
 
 Usually, we will want to spatially lag variables that we think may affect the price of a house in a given location. For example, one could think that pools represent a visual amenity. If that is the case, then listed properties surrounded by other properties with pools might, everything else equal, be more expensive. To calculate the number of pools surrounding each property, we can build an alternative weights matrix that we do not row-standardize:

In [23]:
# Create weigts
w_pool = weights.KNN.from_dataframe(data, k=8)
# Assign spatial lag based on the pool values
lagged = data.assign(w_pool=weights.spatial_lag.lag_spatial(w_pool, data['pool'].values))
lagged.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,geometry,index_right,pool,log_price,w_pool
0,2265,https://www.airbnb.com/rooms/2265,20180710171409,2018-07-10,Zen-East in the Heart of Austin,Zen East is situated in a vibrant & diverse mu...,This colorful and clean 1923 house was complet...,Zen East is situated in a vibrant & diverse mu...,none,,...,strict_14_with_grace_period,f,f,3,0.19,POINT (-97.71398 30.27750),0,0,5.298317,0.0
1,5245,https://www.airbnb.com/rooms/5245,20180710171409,2018-07-10,"Green, Colorful, Clean & Cozy home",,Situated in a vibrant & diverse multicultural ...,Situated in a vibrant & diverse multicultural ...,none,,...,strict_14_with_grace_period,f,f,3,0.08,POINT (-97.71379 30.27577),0,0,4.828314,0.0
2,5456,https://www.airbnb.com/rooms/5456,20180710171409,2018-07-10,"Walk to 6th, Rainey St and Convention Ctr",Fabulous location for walking to Convention Ce...,Cute Private Studio apartment located in Willo...,Fabulous location for walking to Convention Ce...,none,My neighborhood is ideally located if you want...,...,strict_14_with_grace_period,f,t,1,3.88,POINT (-97.73448 30.26112),0,0,4.553877,1.0
3,5769,https://www.airbnb.com/rooms/5769,20180710171409,2018-07-10,NW Austin Room,,Looking for a comfortable inexpensive room to ...,Looking for a comfortable inexpensive room to ...,none,Quiet neighborhood with lots of trees and good...,...,moderate,t,t,1,2.3,POINT (-97.78370 30.45596),0,0,3.688879,2.0
5,6448,https://www.airbnb.com/rooms/6448,20180710171409,2018-07-10,Secluded Studio in 78704 (Zilker),Our garage apartment provides a private space ...,Stay in our lovely 1 bedroom garage apartment ...,Our garage apartment provides a private space ...,none,The neighborhood is fun and funky (but quiet)!...,...,strict_14_with_grace_period,f,f,1,2.21,POINT (-97.76503 30.26027),0,0,4.859812,0.0


And now we can run the model, which has the same setup as m1, with the exception that it includes the number of AirBnb properties with pools surrounding each house:

In [25]:
# Add pool to the explanatory variables
extended_vars = explanatory_vars + ["pool", "w_pool"]

m2 = spreg.OLS(lagged[['log_price']].values, lagged[extended_vars].values, 
               name_y = 'log_price', name_x = extended_vars)

In [26]:
print(m2.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:       11166
Mean dependent var  :      5.0455                Number of Variables   :           8
S.D. dependent var  :      1.1277                Degrees of Freedom    :       11158
R-squared           :      0.2987
Adjusted R-squared  :      0.2983
Sum squared residual:    9957.626                F-statistic           :    679.0391
Sigma-square        :       0.892                Prob(F-statistic)     :           0
S.E. of regression  :       0.945                Log likelihood        :  -15204.420
Sigma-square ML     :       0.892                Akaike info criterion :   30424.840
S.E of regression ML:      0.9443                Schwarz criterion     :   30483.405

-----------------------------------------------------------------------------

Results are largely consistent with the original model. Also, incidentally, the number of pools surrounding a property does not appear to have any significant effect on the price of a given property. This could be for a host of reasons: maybe AirBnb customers do not value the number of pools surrounding a property where they are looking to stay; but maybe they do but our dataset only allows us to capture the number of pools in other AirBnb properties, which is not necessarily a good proxy of the number of pools in the immediate surroundings of a given property.

## Spatially lagged endogenous regressors (WY)

In a similar way to how we have included the spatial lag, one could think the prices of houses surrounding a given property also enter its own price function. In math terms, this implies the following:  
$ln(p_i)=\alpha+\beta X_i + \theta \sum_j{w_{ij} ln(p_i) + \epsilon_i}$  
This is essentially what we call a spatial lag model in spatial econometrics. Two calls for caution:
* Unlike before, this specification does violate some of the assumptions on which OLS relies. In particular, it is including an endogenous variable 
 on the right-hand side. This means we need a new estimation method to obtain reliable coefficients. The technical details of this go well beyond the scope of this tutorial. But we can offload those to PySAL and use the GM_Lag class, which implements the state-of-the-art approach to estimate this model.

* A more conceptual gotcha: you might be tempted to read the equation above as the effect of the price in neighboring locations  on that of location . This is not exactly the exact interpretation. Instead, we need to realize this is all assumed to be a “joint decission”: rather than some houses setting their price first and that having a subsequent effect on others, what the equation models is an interdependent process by which each owner sets her own price taking into account the price that will be set in neighboring locations. This might read a bit like a technical subtlety and, to some extent, it is; but it is important to keep it in mind when you are interpreting the results.

Let us see how you would run this using PySAL:


In [27]:
variables = explanatory_vars + ["pool"]
m3 = spreg.GM_Lag(data[['log_price']].values, data[variables].values, 
                  w=w,
                  name_y = 'ln(price)', name_x = variables)

In [28]:
print(m3.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   ln(price)                Number of Observations:       11166
Mean dependent var  :      5.0455                Number of Variables   :           8
S.D. dependent var  :      1.1277                Degrees of Freedom    :       11158
Pseudo R-squared    :      0.3318
Spatial Pseudo R-squared:  0.2993

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       2.9215085       0.1286871      22.7024171       0.0000000
 host_listings_count       0.0000096       0.0000725       0.1317448       0.8951861
           bathrooms       0.2649284       0.0174426      1

As we can see, results are again very similar in all the other variable. It is also very clear that the estimate of the spatial lag of price is statistically significant. This points to evidence that there are processes of spatial interaction between property owners when they set their price.

## Prediction performance of spatial models
Even if we are not interested in the interpretation of the model to learn more about how alternative factors determine the price of an AirBnb property, spatial econometrics can be useful. In a purely predictive setting, the use of explicitly spatial models is likely to improve accuracy in cases where space plays a key role in the data generating process. To have a quick look at this issue, we can use the mean squared error (MSE), a standard metric of accuracy in the machine learning literature, to evaluate whether explicitly spatial models are better than traditional, non-spatial ones:

In [29]:
from sklearn.metrics import mean_squared_error as mse

mses = pd.Series({'OLS': mse(data["log_price"], m1.predy.flatten()), \
                  'OLS+W': mse(data["log_price"], m2.predy.flatten()), \
                  'Lag': mse(data["log_price"], m3.predy_e)
                    })
mses.sort_values()

Lag      0.891062
OLS+W    0.891781
OLS      0.903180
dtype: float64

We can see that the inclusion of the number of surrounding pools slightly reduces the MSE, and the inclusion of the spatial lag of price improves the accuracy of the model even further.