# 7CUSMSDA Practical Week 7 :  Spatial Autocorrelation & Spatial Regression
<a href="#This Week's Overview">This Week's Overview</a>

<a href="#Learn Outcomes">Learn Outcomes</a> 

<a href='#Get prepared'>Get prepared</a>

<a href='#Spatial Weights'>Spatial Weights</a>
  - <a href='#Contiguity Based Weights'>Contiguity Based Weights<a/>
  - <a href='#Distance Based Weights'>Distance Based Weights</a>
  - <a href='#Kernel Weights'>Kernel Weights<a/>

<a href='#Spatial Lag'>Spatial Lag</a>
  - <a href='#Spatial Similarity'>Spatial Similarity<a/>
  - <a href='#Moran Plot'>Moran Plot<a/>
  - <a href='#Global spatial autocorrelation'>Global spatial autocorrelation<a/>
  - <a href='#Local spatial autocorrelation'>Local spatial autocorrelation<a/>
    
<a href='#Spatial Regression'>Spatial Regression</a>
  - <a href='#Spatial Lag model'>Spatial Lag model<a/>
  - <a href='#Spatial Error model'>Spatial Error model<a/>
  - <a href='#Prediction performance of spatial models'>Prediction performance of spatial models</a>
  - <a href='#GWR Prediction'>GWR Prediction<a/>

- <a href='#Task 1'>Task 1<a/>
- <a href='#Task 2'>Task 2<a/>
- <a href='#Task 3'>Task 3<a/>
- <a href='#Task 4'>Task 4<a/>
- <a href='#Task 5'>Task 5<a/>
- <a href='#Task 6 (Optional)'>Task 6 (Optional)<a/>
- <a href='#Task 7'>Task 7<a/>
- <a href='#Task 8'>Task 8<a/>
- <a href='#Task 9'>Task 9<a/>
- <a href='#Task 10'>Task 10<a/>



# <a id="This Week's Overview">This Week's Overview</a>
This practical will make you more confident with your understanding of Spatial Weights by 3 main types: the widely used contiguity based weights, the distance based weights and kernel weights. You will be provided the functions from `PySAL` and `Libpysal` to explore the features for corresponding functions, and conduct further comparisons among the results. Upon the interpretation of spatial weights, concepts of `Spatial Lag` and `Global Spatial Autocorrelation` will be presented with variables on london housing, as well as detailed explanations on their processes and corresponding visualizations.

Since Moran's I value can only tells us the existence of global spatial autocorrelation, but incapable to help identifying where the clusters are; in another word, if we want explore further on the spatial instability incurred by particular areas' departuring from the general pattern, we need to explore some local measures to obtain further insight by using `Local Indicators of Spatial Association (LISAs)` in `PySAL` to classify observations in our dataset into four groups, each of which are based on Moran plot and called "quadrants".
- high values surrounded by high values (HH), in what we call `hot spots`.
- low values nearby other low values (LL), in what we call `cold spots`.
- high values among low values (HL), in what we call `spatial outliers`.
- low values among high values (LH), in what we call `spatial outliers`. 

Spatial regression models will be introduced with both Spatial Lag model and Spatial Error model for your comparison, and GWR (geographical weighted regression) prediction will also follow as the end of this practical. 

# <a id="Learn Outcomes">Learn Outcomes</a>
You will practice your understanding on the concepts delivered in lecture, which are:
- Spatial weights (Contiguity based, Distance based, kernel weights)
- Spatial Autocorrelation (Global & Local)
- Spatial Regression models (Lag & Error)
- GWR prediction

You will further explore the functions provided in `PySAL` and `Libpysal`.

# <a id="Get prepared">Get prepared</a>

In [5]:
import sys
import os
sys.path.append(os.path.abspath('..'))
import geopandas as gpd
import seaborn as sns
import libpysal as lps
from libpysal.weights import KNN
import pysal as ps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pysal.viz as viz

from pysal.model.spreg import ols
from pysal.model.spreg import ML_Error
from pysal.model.spreg import ML_Lag

# GWR prediction libraries
from pysal.model.mgwr.sel_bw import Sel_BW
from pysal.model.mgwr.gwr import GWR
# from pysal.contrib.glm.family import Gaussian
from scipy.stats import pearsonr

import warnings
warnings.simplefilter('ignore')

We will glue to only 1 set of data, so please copy the **shapefile data "borough_airbnb_housing"** we produced last week into your "data" folder for this week.

In [6]:
# read in your data and get the headers presented
gdf=gpd.read_file('data/lsoa_IMD_airbnb_housing.shp')
gdf.head()

Unnamed: 0,Code,Area,Year,Value,Measure,objectid,lsoa11nmw,st_areasha,st_lengths,IMD_Rand,...,TotDec,DepChi,Pop16_59,Pop60_,WorkPop,Mean Price,Small Host,Multiple L,Property C,geometry
0,E01000001,City of London 001A,Year ending Dec 2017,1204928.0,Mean,1,City of London 001A,133320.768872,2291.846072,29199,...,656,465,715.0,343907.41983,3682.43942,148.444444,8.0,10.0,18.0,"POLYGON ((532095.563 181577.351, 532095.125 18..."
1,E01000002,City of London 001B,Year ending Dec 2017,991549.0,Mean,2,City of London 001B,226191.27299,2433.960112,30379,...,580,394,619.75,583474.041779,3910.38724,200.4,8.0,2.0,10.0,"POLYGON ((532267.728 181643.781, 532262.875 18..."
2,E01000003,City of London 001C,Year ending Dec 2017,913007.0,Mean,3,City of London 001C,57302.966538,1142.359799,14915,...,759,445,804.0,147839.506081,1834.93132,139.428571,5.0,2.0,7.0,"POLYGON ((532105.312 182010.574, 532104.872 18..."
3,E01000006,Barking and Dagenham 016A,Year ending Dec 2017,354300.0,Mean,5,Barking and Dagenham 016A,144195.846857,1935.510354,14486,...,1297,221,1284.5,372257.321186,3108.610781,44.2,5.0,0.0,5.0,"POLYGON ((544817.826 184346.261, 544815.791 18..."
4,E01000007,Barking and Dagenham 015A,Year ending Dec 2017,230380.0,Mean,6,Barking and Dagenham 015A,198134.809244,2824.036914,7256,...,1424,105,1404.0,511543.283051,4537.675635,62.0,10.0,4.0,14.0,"POLYGON ((544175.331 184526.180, 544175.880 18..."


### <a id='Task 1'>Task 1<a/>

If we want to get a rough idea about the distribution frequency of airbnbs across lsoas in London, we can plot a quantile map. So recall your memory to draw a simple quantile map for $variable$ 'Property C' and set up the $cmap$ value as 'coolwarm'.

**Hint**: scheme to be 'quantiles'.

In [None]:
# your code here
gdf.plot(column='Property C', alpha=0.8, cmap='coolwarm', scheme='quantiles')

In [None]:
gdf.columns.values

However, as we are going to calculate the spatial weights by distance today, we need to be more careful about **crs**; it means that we need to reset the crs to epsg 27700 again, but I am sure you are more familiar with the steps now.

In [None]:
# Set up figure and axis
f, ax = plt.subplots(1, figsize=(12,10))
# Plot Number of Airbnbs
# Quickly transform to OSGB CRS and plot 
gdf.plot(column='Property C', scheme='Quantiles', legend=True, ax=ax)
# Remove axis frame
ax.set_axis_off()
# Change background color of the figure
f.set_facecolor('0.8')
# set up the title
f.suptitle('Amount of Airbnbs in LSOAs', size=25)
plt.show()

Similar maps but stretched, right? Once you are prepared, let's start our spatial calculation.
## <a id='Spatial Weights'>Spatial Weights</a>

Spatial weights are mathematical structures representing for spatial relationships, and are crucial components of spatial analysis. Generally it is a $n$x$n$ matrix measuring the potential spatial relationships between paired observations in a spatial dataset，which is on $n$ locations with varied geometries. The spatial relationships between these geometries can be based on criteria like:
- Contiguity Based Weights
- Distance Based Weights (both geospatial distance and general distance)
- Kernel Weights

In spatial weights matrix, the geographical space is encoded into numerical form for statistical practice.The elements in diagons $w_{ii}$ are set to zero while the rest cells $w_{ij}$ measure the potential interactions between each pair at location $i$ and $j$.

We are going to realize the function using `PySAL` to create, manipulate and analyze spatial weights matrices across different types in the following section. For further details see the Spatial Weights [API](https://pysal.readthedocs.io/en/latest/api.html).

### <a id='Contiguity Based Weights'>Contiguity Based Weights</a>

Contiguity Weights can be built from dataframe with a geometry column or from  contiguity graph representation, e.g.shapefile. In this section, we will use contiguity to define neighboring observations: use `weights.Contiguity` module in `PySAL` to constructe and manipulate spatial weights matrices based on contiguity criteria; and to use `weights.Contiguity` in `libpysal` to further get the idea plotted out.

Three contiguity weights will be compared: **Queen**, **Rook** and **Bishop**.
#### Queen contiguity weight
This commonly used weight type build a queen contiguity matrix for our data, , reflecting the adjacency relationship whether a polygon shares an **edge** or a **vertex** with another polygon or not. A pair of boroughs to be considered neighbours under this $W$ will need to "touch" each other to some degree. As the weights are symmetric, if borough $A$ neighbors borough $B$, then both $w_{AB} = 1$ and $w_{BA} = 1$.

We will begin with the `GeoDataFrame` and pass it on to the queen contiguity weights builder in `PySAL` (`ps.lib.weights.Queen.from_dataframe`). 

In [7]:
# Create the spatial weights matrix
w_queen = ps.lib.weights.Queen.from_dataframe(gdf)
w_queen.n

4201

In [None]:
print ('%.4f'%w_queen.pct_nonzero) # percentage of non zero queen weights

In [None]:
w_queen.histogram # frequency of n neighbors 

In [None]:
gdf.loc[gdf.Area=='Westminster 018A']

The index used in weights are the same with dataframe, so let's try to check which LSOAs are neighbors of observation `Westminster 018A` with index `4000`, and how much they are "weighted".

In [None]:
w_queen[4000]

Can you get the name list for neighbours? For example, put the the target lsoa and its neighbours' indexes and names presented in one list. You should get a list including Westminster and Camden. 

In [None]:
target_lsoa = [4000]
target_lsoa.extend(w_queen.neighbors[4000])
gdf.loc[target_lsoa]

Let us row-standardize it to make sure every row of the matrix sums up to one, and check the neighbours for Westminster 018A again.

In [None]:
# Row standardize the matrix
w_queen.transform = 'R'
w_queen[4000]

What you get this time? The weight given to each neighbour has changed from 1.0 to 0.2! Think of the reason, and the sum of their weights, it should be 1 after the row-standardizing.
We call `pysal.full` to get a full, dense matrix describing all of the pairwise relationships:

In [None]:
wqmatrix, ids = w_queen.full()
wqmatrix

In [None]:
n_neighbors = wqmatrix.sum(axis=1) # how many neighbors each region has

In [None]:
n_neighbors[4000]

We are now get a direct image of the Queen adjacent regions/neighbors for Westminster 018A.

In [None]:
queen_neighs=w_queen.neighbors[4000]
q=gdf.loc[queen_neighs].plot(edgecolor='grey', facecolor='w')
title=plt.title('Queen Adjacency for Westminster 018A')

However, to visualize the Queen contiguity weights from neighbors, we need to call `libpysal` further, which is more visiable of the contiguity when plotting.

In [None]:
w_queen_1 = lps.weights.Queen.from_dataframe(gdf)
type(w_queen_1)

Now you may find the returned type of the same weight matrix is different! Because you've called another library `libpysal`. Let's have a look of their contiguity by calling `plot`. 

**!! Be patient**, it takes time to get the plot!

In [None]:
ax = gdf.plot(edgecolor='grey', facecolor='w', figsize=(20, 12))
f,ax = w_queen_1.plot(gdf, ax=ax, 
                   edge_kws=dict(color='r', linestyle=':', linewidth=1),
                   node_kws=dict(marker=''))

Think of the rationale for using queen weights from libpysal, rather than that from pysal, for plotting. How if you change it to the latter? What you may get and why?

#### Rook contiguity weights

Rook weights define neighbors as those sharing an edge on their respective borders. At finer scales, the rook neighbors of an observation may be different from the queen neighbors, depending on the configuration of both targeted observation and 'neighbors'. This time we use `.from_shapefile` function to get the rook neighbors.

In [None]:
w_rook = ps.lib.weights.Rook.from_shapefile('data/lsoa_IMD_airbnb_housing.shp')

### <a id='Task 2'>Task 2<a/>
Since we get a new spatial weight matrix by using Rook rather than Queen, we need do similar work to get the corresponding values as below:

In [None]:
# total number of rows or columns
w_rook.n

In [None]:
# Percentage of nonzero neighbor counts
w_rook.pct_nonzero  

In [None]:
# histogram
w_rook.histogram  

In [None]:
# get the indices for neighbors of Westminster 018A indexed at 4000
w_rook.neighbors[4000]  

In [None]:
# get the neighboring lsoa names for Westminster 018A indexed at 4000
gdf['Area'][w_rook.neighbors[4000]]

So Westminster 018A has 5 rook neighbor LSOAs in Westminster and Camden; the same as queen neighbors at lsoa level.

In [None]:
# plot Westminster 018A and 5 rook adjacent neighbors
# your code here
rook_neighs=w_rook.neighbors[4000]
r=gdf.loc[rook_neighs].plot(edgecolor='grey', facecolor='w')
title=plt.title('Rook Adjacency')

### <a id='Task 3'>Task 3<a/>
Similarly, we can try to call `libpysal` to get the spatial weights visualized.
    
**Be patient again!**

In [None]:
w_rook_1 = lps.weights.Rook.from_shapefile('data/lsoa_IMD_airbnb_housing.shp')
type(w_rook_1)

In [None]:
# plot out the contiguity 
ax = gdf.plot(edgecolor='grey', facecolor='w', figsize=(20, 12))
f,ax = w_rook_1.plot(gdf, ax=ax, 
                   edge_kws=dict(color='r', linestyle=':', linewidth=1),
                   node_kws=dict(marker=''))

To prove our test on the similarity of results between Queen weight and Rook weight:

In [None]:
 (w_queen.pct_nonzero == w_rook.pct_nonzero) and (w_queen.n == w_rook.n)

#### Bishop contiguity weights

Bishop weighting only consider polygons as neighbors when they share vertexes. It is not directly available from `PySAL`, but we can construct it by using `w_difference` function.

In [None]:
w_bishop = ps.lib.weights.w_difference(w_queen, w_rook, constrained=False, silence_warnings=True)
w_bishop.histogram

From the histogram result, we can tell for this dataset: LSOA Westminster 018A has some bishop neighbors, which means these lsoas only share vertexes without sharing any edges; we can use this function or simply call the `islands` to check.

In [None]:
print (w_bishop.islands)

### <a id='Task 4 (Optional)'>Task 4 (Optional)<a/>
Get both Rook weights and Queen weights plotted. 

In [None]:
# plot them, your code here
f,ax = plt.subplots(1,2,figsize=(10, 6), subplot_kw=dict(aspect='equal'))
# plot the rook, set the title and axis 
w_rook_1.plot(gdf, ax=ax[0], edge_kws=dict(color='b', linestyle='-', linewidth=1), node_kws=dict(marker=''))
w_queen_1.plot(gdf, ax=ax[1], 
                   edge_kws=dict(color='r', linestyle=':', linewidth=1),
                   node_kws=dict(marker=''))

###  <a id='Distance Based Weights'>Distance Based Weights</a>
Besides of contiguity defined neighbors, we can also use distance to define neighbors in a more common way. We can use [`weights.Distance` module](https://pysal.readthedocs.io/en/latest/library/weights/Distance.html) in `PySAL`. However, if you recap on what we've done on Week 4 and Week 5, the measurement of distance should be careful about crs. So we need ensure the shapefile used has been projected in the right way.
#### k-nearest neighbor weights
We use k-nearest neighbor criterion to define neighbors for target observation. For example, we set $k=4$ for a trial.

In [None]:
w_knn = ps.lib.weights.KNN.from_dataframe(gdf, k=4)

In [None]:
w_knn.histogram

In [None]:
w_knn.s0

We could also use this function to call all the neighbors' list:

In [None]:
listnei = w_knn.reweight(p=1, inplace=False)
print (listnei.neighbors)

In [None]:
from libpysal.weights import KNN
w_knn_1 = KNN.from_dataframe(gdf, k=4)

In [None]:
# get it plotted again
ax = gdf.plot(edgecolor='grey', facecolor='w')
f,ax = w_knn_1.plot(gdf, ax=ax, 
        edge_kws=dict(color='r', linestyle=':', linewidth=1),
        node_kws=dict(marker=''))
ax.set_axis_off()

The $k$ value could be adjusted when we want to change the weights by calling `reweight`, this will help us to change the weight object without re-constructing the KDTree when computing the nearest neighbors queries. For example, we want to change the $k$ value into 5: 

In [None]:
w_knn_1.reweight(k=5)

### <a id='Task 5'>Task 5<a/>
Let's still take Westminster 018A as an example again, and compare the outputs for K=5 with k=4.

In [None]:
# get the w_knn neighbors list for Westminster 018A
# hint: simply call its index
w_knn[4000]

In [None]:
# use the reweight function to reset the k as 5
w_knn_r = w_knn.reweight(k=5, inplace=False)
# new list of neighbors for Westminster 018A
w_knn_r[4000]

In [None]:
# new s0 value
w_knn_r.s0

In [None]:
set(w_knn_r.neighbors[4000]) == set([744, 4001, 4153, 743, 4024])

In [None]:
w_knn_r.weights[4000]

#### Distance Band Weights (Optional)
You may observed already that using $knn$ weights, we will get all target observations with same number of neighbors. If we use  distance bands or thresholds to define neighbors, to find those falling into the defined threshold distance, then the neighbors vary. The distance band weights could be generated from array, dataframe, shapefile, specified values, etc. We are here to try get an example from dataframe:

In [None]:
w_thresh=ps.lib.weights.DistanceBand.from_dataframe(gdf, 0.8)

In [None]:
w_thresh.weights[4000]

### <a id='Kernel Weights'>Kernel Weights<a/> <font color='red'>Don't need to run the code, your laptop will crash!</font> 

We had used Kernels for several times, and this kernel weights combine the aforementioned thresholds and continuously numeric weights together, to define neighbors by continuous distance-based weights using kernel densities. Upon the defined bandwidth, a continuous kernel function is evaluated to get a weight between 0 and 1, hence many kernels could be called on:

In [None]:
w_kernel = ps.lib.weights.Kernel.from_dataframe(gdf)
w_kernel.neighbors

![kw_1](kw_1.png)

In [None]:
w_kernel.weights[4000]

![kw_2](kw_2.png)

In [None]:
gdf.loc[w_kernel.neighbors[4000] + [4000]]

![kw_3](kw_3.png)

In [None]:
w_kernel.bandwidth[0:6]

![kw_4](kw_4.png)

Handling nonplanar geometries, and try to compare it with the output of Queen weights.

In [None]:
w_fuzzy_non = lps.weights.fuzzy_contiguity(gdf)
len(w_fuzzy_non.islands)

In [None]:
ax = gdf.plot(edgecolor='grey', facecolor='w')
f,ax = w_fuzzy_non.plot(gdf, ax=ax, 
        edge_kws=dict(color='r', linestyle=':', linewidth=1),
        node_kws=dict(marker=''))
ax.set_title('Nonplanar Weights')

![kw_5](kw_5.png)

## <a id='Spatial Lag'>Spatial Lag<a/>
`Spatial lag` is the product of the spatial weights matrix and a given variable and that, if 𝑊 is row-standardized, the result amounts to the average value of the variable in the neighborhood of each observation.

In [None]:
gdf.info()

We will use variable for average airbnb listing price (`Mean Price`) as an example to interpret the concept of `spatial lag`. Firstly let's have a general idea of the spatial distribution pattern.

In [None]:
pr = ps.viz.mapclassify.Quantiles(gdf['Mean Price'], k=5)
f, ax = plt.subplots(1, figsize=(20, 12))
gdf.assign(cl_pr=pr.yb).plot(column='cl_pr', categorical=True, k=5, cmap='OrRd', 
                                      linewidth=0.1, ax=ax, edgecolor='white', legend=True)

plt.title('Average Airbnb Listing Price Quintiles')
plt.show()

In [None]:
# get the spatial lag for listing price using queen weight
gdf['w_price'] = ps.lib.weights.lag_spatial(w_queen, gdf['Mean Price'])
# list out the name, listing price, and spatial lag for Westminster 018A
gdf[['Area', 'Mean Price', 'w_price']].loc[[4000]]

### <a id='Task 6'>Task 6<a/>
To interpret the spatial lag (w_price) result, we can take Westminster 018A as an example. The average pricing in Westminster 018A is about £255, it is surrounded by neighboring lsoas where the average listing price varies dramatically. We can further check its accuracy by querying the spatial weights matrix to find out the neighbors:

In [None]:
# get the queen spatial weight neighbors for Westminster 018A  
w_queen[4000]

In [None]:
# check the neighbors' price for private rent
nei_price = gdf.loc[w_queen[4000], 'Mean Price']
nei_price

In [None]:
# get the average value for neighboring price
nei_price.mean()

For some of the techniques we will be seeing below, it makes more sense to operate with the standardized version of a variable, rather than with the raw one. Standardizing means to substract the average value and divide by the standard deviation each observation of the column. 

Can you work out the standardized value for airbnb listing price below; and further explore the spatial patterns of the standardized values, or $zscore$, we need to create its spatial lag:

In [None]:
# your code here
gdf['price_std'] = (gdf['Mean Price'] - gdf['Mean Price'].mean()) / gdf['Mean Price'].std()
gdf['w_price_std'] = ps.lib.weights.lag_spatial(w_queen, gdf['price_std'])

## <a id='Spatial Autocorrelation'>Spatial Autocorrelation<a/>

Do you still remember `CSR`, the spatial randomness we had on Week 4? It justifies that a spatially random variable follows no discrenible distribution pattern over space, so the variable of interest in a given location will give no information about its value. In another word, if we take our **Airbnb listing price Quintiles** plot as an example, there should be no visible clustering of similar values on the map. However, we can easily spot out the gradient colors change from centre to inner then to outer London, which indicating that it is not CSR. 

On the contrary, **spatial autocorrelation** could be defined as "absence of spatial randomness" in that, for a given dataset, the $similarity$ $in$ $values$ among observations relates to their $locational$ $similarity$; hence relates the target observation's value with values in neighboring locations for specific variable. So in the following, we are trying to generate the meansures for spatial similarity and attribute similarity respectively, which had been utilized widely to generate combined measures for spatial autocorrelation. 

### <a id='Spatial Similarity'>Spatial Similarity<a/>
Spatial weights are used to measure spatial similarity as we've done in previous sections, here we will only use queen contiguity as an example:

In [None]:
W_queen = ps.lib.weights.Queen.from_shapefile('data/lsoa_IMD_airbnb_housing.shp')
W_queen.transform = 'r' # row-standardize the contiguity weights

Spatial lag has been defined as derived variable pair the attribute similarity up with the spatial similarity.
For LSOA $i$ the spatial lag is defined as: ${Mean Price}$$Lag_i$=$∑_jw_{i,j}$${Mean Price_j}$, where $j$ are the neighboring lsoas for lsoa $i$.

In [None]:
price_Lag = ps.lib.weights.lag_spatial(W_queen, gdf['Mean Price']) #spatial lag of the variable
price_LagQ5 = ps.viz.mapclassify.Quantiles(price_Lag, k=5) # let's say k=5 for example

In [None]:
f, ax = plt.subplots(1, figsize=(10, 8))
gdf.assign(cl_lag=price_LagQ5.yb).plot(column='cl_lag', categorical=True, k=5, cmap='coolwarm', linewidth=0.1, ax=ax, edgecolor='white', legend=True)
plt.title('Airbnb Mean Price Lag Quintiles')
plt.show()

Any differences? Is the quintiles map for spatial lag showing the enhancement of attribute similarity spatially? YES! However, if we want to give any statement on relationship between value of mean airbnb listing price in a lsoa and the value of spatial lag of the price for it, we still need one more step to justify it through statistical measures of spatial autocorrelation. We could use Moran Scatterplot for prelimenary visualization. 

### <a id='Moran Plot'>Moran Plot<a/>

Moran scatter plot is similar to normal scatter plot, but widely used to visualize spatial autocorrelation, with the variable of interest against x axis, whilst its spatial lag against y axis.

In [None]:
price=gdf['Mean Price']
b,a = np.polyfit(price, price_Lag, 1)
f, ax = plt.subplots(1, figsize=(10, 8))
plt.plot(price, price_Lag, '.', color='firebrick')

 # dashed vert at mean of the last year's private rent level
plt.vlines(price.mean(), price_Lag.min(), price_Lag.max(), linestyle='--')
 # dashed horizontal at mean of lagged private rent
plt.hlines(price_Lag.mean(), price.min(), price.max(), linestyle='--')

# red line of best fit using global I as slope
plt.plot(price, a + b*price, 'r')
plt.title('Moran Scatterplot')
plt.ylabel('Spatial Lag of Airbnb Listing Price')
plt.xlabel('Airbnb Mean Price')
plt.show()

We could also use `seaborn`'s regplot function to get the standardized value for variable(s) of interest plotted, as well as against its spatial lag. So now we will use the standardized values generated in Spatial Lag section for plotting:

In [None]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(10, 8))
# Plot values
sns.regplot(x='price_std', y='w_price_std', data=gdf, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='r', alpha=0.5)
plt.axhline(0, c='r', alpha=0.5)
# Display
plt.show()

The figure above displays the relationship between the standardized airbnb listing price (`price_std`) and its spatial lag (`w_price_std`) in neighboring lsoas. The linear fit line is the best linear fit to the scatter plot representing the relationship between the two variables.

Be obvious from the plot that, these two variables have positive relationship, which leads to next section about two main types of spatial autocorrelations (SA): 
- **Positive spatial autocorrelation**: similar values tend to group together in similar locations. Generally, high values tend to be surrounded by high values, and low values to be close to low values, with justification of main pattern as clustered.
- **Negative spatial autocorrelation**: similar values tend to be dispersed and further apart from each other. Generally, high values tend to be surrounded by low values, and low values to be close to high values, with justification of main pattern as sparsed.

Meanwhile, we normally have two main classes of SA: (1) **Global spatial autocorrelation** and (2) **Local spatial autocorrelation**; and use Exploratory Spatial Data Analysis (`ESDA`) tools to realize the analysis purpose, i.e. spatial queries, statistical inference, choropleths, etc. 

### <a id='Global spatial autocorrelation'>Global spatial autocorrelation<a/>
Global spatial autocorrelation considers the overall geographical pattern of the target values presented, measure the trend statistically through statements about the degree of clustering, further summarize the result numerically for further visualization. This tool helps to answer questions concerning about geographical distribution patterns of values, the higher adjacency of similar values, etc. We will start interpreting the rationale of global spatial autocorrelation from binary view, and further practice with Moran's I statistic.

We can classify the "low listing price" and "high listing price" dividing by its median value to convert it into binary case.

In [None]:
gdf['Mean Price'].median()

In [None]:
binary = gdf['Mean Price']> gdf['Mean Price'].median()
sum(binary)

Among over 4000 lsoas, 2100 of them with average price above the median (£73) and remaining below the median.

In [None]:
labels = ['Low Price', 'High Price']
binary = [labels[i] for i in 1*binary] 
gdf['binary'] = binary

In [None]:
fig = plt.figure(figsize=(12,10))
ax = plt.gca()
gdf.plot(column='binary', cmap='binary', edgecolor='grey', legend=True, ax=ax)

Has the plot recalled the image covered in lecture about calculating the joins in neighbors, especially counting the joins by three different types?
- BB (Black-Black)
- WW (White-White)
- BW (Black-White, or White-Black) 

The joins are reflected in our binary spatial weights object W_queen. So given about half of lsoas are black polygons from this case, how many BB join should we expect for, if they were randomly assigned? We can work out the logic for join counts statistic as below:

In [None]:
from pysal.explore import esda 
binary = 1 * (gdf['Mean Price']> gdf['Mean Price'].median()) # convert back to binary
W_queen = lps.weights.Queen.from_dataframe(gdf)
W_queen.transform = 'b'
np.random.seed(12345)
jc = esda.join_counts.Join_Counts(binary, W_queen)

### <a id='Task 7'>Task 7<a/>
You may feel free to explore the number of bb, ww and bw by calling `jc.bb`, `jc.ww` and `jc.bw`.

In [None]:
jc.bb

In [None]:
jc.mean_bw

In [None]:
jc.ww

In [None]:
sns.kdeplot(jc.sim_bb, shade=True)
plt.vlines(jc.bb, 0, 1, color='r')
plt.vlines(jc.mean_bb, 0,1)
plt.xlabel('BB Counts')

With the black vertical line indicating mean BB count from the synthetic realizations, this density plot shows us the distribution of BB counts, whilst the red line is our observed count, which is extremely higher than the mean value. So let's further check the pseudo p-value for this statistic:

In [None]:
jc.p_sim_bb

So what will you conclude from the value? Write it down below:

** ------------------------------------------ ** 

Since this is below conventional significance levels, we would reject the null of complete spatial randomness in favor of spatial autocorrelation in airbnb listing price in London. 

To summarize, we created the binary variable for airbnb listing price in London, and explore the join count analysis whilst disregarding information in the original values. So now let's turn back to our real data and test for the spatial autocorrelation.

In [None]:
mi = esda.moran.Moran(gdf['Mean Price'], W_queen) # call moran function
mi.I # print out the moran's I value

### <a id='Task 8'>Task 8<a/>
Plot the statistic against a reference distribution under the null of CSR. 
    
**Hint**: similar to what we did in join count analysis. Call `seaborn`'s kdeplot function. But the expected value should be EI for $mi$.

In [None]:
sns.kdeplot(mi.sim, shade=True)
plt.vlines(mi.I, 0, 40, color='r')
plt.vlines(mi.EI, 0, 40)
plt.xlabel("Moran's I")

In [None]:
# Check the statistical significance
mi.p_sim

This is just 0.1% (or you may get 0.002, 0.003..., slightly different everytime) and would be considered statistically significant. It means if we try to allocate the same values randomlly over space to get new map, then the Moran's $I$ statistic for new map could have 0.1% possibility to display a larger $I$ value than the one from our real data; while 99.9% of random mapping would receive a smaller $I$ value. As $I$ value could also be interpreted as the slope for Moran plot, the airbnb listing price in London is more concentrated than if it follows a CSR process, hence statistically significance, and has its spatial structure. 

Besides of calling `esda` in PySAL, we can also realize the Moran's I statistic by directly calling the specific function in `PySAL.explore.esda.Moran`. 

In [None]:
I_price = ps.explore.esda.Moran(gdf['Mean Price'].values, w_queen)  # Moran's I
I_price.I, I_price.p_sim  #value of statistic, inference on Moran's I

Thus, the $I$ statistic is $0.3478$ for this data, and has a very small $p$ value.

In [None]:
b # I is same as the slope of the line in the scatterplot

In [None]:
I_price.sim[0:5]

Let us visualize the distribution using KDEplot again, with a rug showing all of the simulated points, and a vertical line denoting the observed value of the statistic.

In [None]:
sns.kdeplot(I_price.sim, shade=True)
plt.vlines(I_price.sim, 0, 40)
plt.vlines(I_price.I, 0, 40, 'r')
plt.xlim([-0.1, 0.4])

if our $I$ statistic were close to our expected value, I_price.EI , our plot might look like this:

In [None]:
sns.kdeplot(I_price.sim, shade=True)
plt.vlines(I_price.sim, 0, 10)
plt.vlines(I_price.EI+.01, 0, 40, 'r')
plt.xlim([-0.1, 0.1])

We can arrive at the conclusion now: the pattern for airbnb listing price is not spatially random, but instead has signficant spatial association.

### <a id='Local spatial autocorrelation'>Local spatial autocorrelation<a/>
We implement Local Indicators of Spatial Association (LISAs) for Moran’s I and Getis and Ord’s G in PySal to detect hotspots.

We use Local Moran's I index to test the spatial autocorrelationality. It will measure the spatial autocorrelation in an attribute y measured over n spatial units. To calculate Moran’s I we first need to create and read in a GAL file for a weights matrix W. In order to get W, we need to work out what polygons neighbour each other (e.g. Queen Style Contiguity Neighbours, and Rook's Case Neighbours, etc.). 

Read more in R.Bivand (2017) "Finding Neighbours".

Normally, we use Rook and Queen contiguity weight matrix. Rook weights consider observations as neighboring only when they share an edge; while queen contigutiy weight reflects adjacency relationships whether or not a polygon shares an edge or a vertex with another polygon. They may be different, depending on how the observation and its nearby polygons are configured.

Instead of a single $I$ statistic, we have an *array* of local $I_i$ statistics, stored in the `.Is` attribute, and p-values from the simulation are in `p_sim`.

In [None]:
lisa = ps.explore.esda.Moran_Local(gdf['Mean Price'].values, w_queen, permutations=999)
lisa.Is

In [None]:
lisa.q   # quantile classification

In [None]:
lisa.p_sim

In [None]:
(lisa.p_sim < 0.05).sum()

A Moran scatterplot with statistically significant LISA values highlighted. We want to plot the statistically-significant LISA values in a different color than the others. To do this, first find all of the statistically significant LISAs. Since the $p$-values are in the same order as the $I_i$ statistics, we can do this in the following way.

In [None]:
gdf['lag_price'] = ps.lib.weights.lag_spatial(w_queen, gdf['Mean Price'])
sigs = gdf['Mean Price'][lisa.p_sim <= .05]
W_sigs = gdf['lag_price'][lisa.p_sim <= .05]
insigs = gdf['Mean Price'][lisa.p_sim > .05]
W_insigs = gdf['lag_price'][lisa.p_sim > .05]

Then, since we have a lot of points, we can plot the points with a statistically insignficant LISA value lighter using the alpha keyword. In addition, we would like to plot the statistically significant points in a dark red color with triangle shape.

In [None]:
b,a = np.polyfit(gdf['Mean Price'], gdf['lag_price'], 1)
moran=ps.explore.esda.Moran(gdf['Mean Price'].values, w_queen)

fig, ax=plt.subplots(1, figsize=(14,10))
plt.plot(sigs, W_sigs, '^', color='firebrick')
plt.plot(insigs, W_insigs, '.k', alpha=.2)
 # dashed vert at mean of the last year's PCI
plt.vlines(gdf['Mean Price'].mean(), gdf['lag_price'].min(), gdf['lag_price'].max(), linestyle='--')
 # dashed horizontal at mean of lagged PCI
plt.hlines(gdf['lag_price'].mean(), gdf['Mean Price'].min(), gdf['Mean Price'].max(), linestyle='--')

# red line of best fit using global I as slope
plt.plot(gdf['Mean Price'], a + b*gdf['Mean Price'], 'r')
plt.text(s='$I = %.3f$' % moran.I, x=1400, y=500, fontsize=14)
plt.text(600, 500, "HH", fontsize=15, color='r')
plt.text(600, 0, "HL", fontsize=15, color='r')
plt.text(50, 500, "LH", fontsize=15, color='r')
plt.text(50, 0, "LL", fontsize=15, color='r')
plt.title('Moran Scatterplot')
plt.ylabel('Spatial Lag of Price')
plt.xlabel('Airbnb Listing Price')

After measuring both global and local spatial autocorrelation, let's visualize the results on London map.

In [None]:
from pysal.viz.splot.esda import lisa_cluster
fig, ax=plt.subplots(1, figsize=(14,10))
fig = lisa_cluster(lisa, gdf, ax=ax)
plt.title("LISA Cluster Map")
plt.show()

So far, only high value surrounded by high values has been highlighted. However, we can distinguish the specific type of local spatial association reflected in the four quadrants of the Moran Scatterplot as:

In [None]:
sig = lisa.p_sim < 0.05
hotspot = sig * lisa.q==1
coldspot = sig * lisa.q==3
doughnut = sig * lisa.q==2
diamond = sig * lisa.q==4

In [None]:
# list the boroughs which are hotspots
gdf['Mean Price'][hotspot]

In [None]:
spots = ['n.sig.', 'hot spot']
labels = [spots[i] for i in hotspot*1]

from matplotlib import colors
hmap = colors.ListedColormap(['red', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
gdf.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

### <a id='Task 9'>Task 9<a/>
Get the information for coldspot, doughnut and diamond by yourself, and plot them out respectively.

In [None]:
spots = ['n.sig.', 'cold spot']
labels = [spots[i] for i in coldspot*1]

from matplotlib import colors
hmap = colors.ListedColormap(['blue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
gdf.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

In [None]:
spots = ['n.sig.', 'doughnut']
labels = [spots[i] for i in doughnut*1]

from matplotlib import colors
hmap = colors.ListedColormap(['lightblue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
gdf.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

In [None]:
spots = ['n.sig.', 'diamond']
labels = [spots[i] for i in diamond*1]

from matplotlib import colors
hmap = colors.ListedColormap(['orange', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
gdf.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

With LISAs in `PySAL`, we classify the observations into 4 groups by its value and the neighbors', exploring their concentration pattern, identifying cases either more similar (**HH, LL**) concentrated or dissimilar (**HL, LH**) around.  The mechanism is similar to Moran's $I$, but applied in this case to each observation. This tool is widely used in identifying clusters in space, and provide suggestive evidence about the processes that might be at work, e.g.identification of spatial clusters of groups of people, delineation of areas with particularly high/low frequency of certain activity, etc.

Now let us pull out areas with statistically significant spatial clustering (at the 5% level):

In [None]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(12, 8))
# Plot building blocks
gdf.plot(ax=ax, facecolor='1', linewidth=0.1)
# Plot HH clusters
hh = gdf.loc[(hotspot) & (sig==True), 'geometry']
hh.plot(ax=ax, color='red', linewidth=0.1, edgecolor='w')
# Plot LL clusters
ll = gdf.loc[(coldspot) & (sig==True), 'geometry']
ll.plot(ax=ax, color='blue', linewidth=0.1, edgecolor='w')
# Plot LH clusters
lh = gdf.loc[(doughnut) & (sig==True), 'geometry']
lh.plot(ax=ax, color='#83cef4', linewidth=0.1, edgecolor='w')
# Plot HL clusters
hl = gdf.loc[(diamond) & (sig==True), 'geometry']
hl.plot(ax=ax, color='orange', linewidth=0.1, edgecolor='w')
# Non-significant
ns = gdf.loc[sig!=True, 'geometry']
ns.plot(ax=ax, color='0.75', linewidth=0.1, edgecolor='w')
# Style and draw
f.suptitle('LISA for Airbnb Listing Price', size=20)
f.set_facecolor('w')
plt.show()

## <a id='Spatial Regression'>Spatial Regression</a>
We need to go back to last week's OLS regression configuration first, and use Moran's I tool to check for spatial autocorrelation.

In [8]:
from pysal.model.spreg import ols
from pysal.model.spreg import ml_error
from pysal.model.spreg import ml_lag

In [10]:
# read the .dbf file from your shapefile data
f = ps.lib.io.open('data/lsoa_IMD_airbnb_housing.dbf','r')
# Read in the listing_price (dependent variable) into an array y
y = np.array(f.by_col['Mean Price'])
y.shape = (len(y),1)
# value for independent variables into a one dimmensional array X. 
# You can feel free to change the independant variables
X= []
X.append(f.by_col['Value']) # average house price
X.append(f.by_col['IncScore']) # Income score in 2019
X.append(f.by_col['EduScore']) # Education score in 2019
X.append(f.by_col['BHSScore']) # Barrier to Housing Services score in 2019
X.append(f.by_col['IMDScore']) # Deprivation index in 2019
X.append(f.by_col['Property C']) # number of airbnbs
X = np.array(X).T

In [None]:
mi = ps.explore.esda.moran.Moran(gdf['Mean Price'], w_queen, two_tailed=False)
print("The Statistic Moran's I is: "+str("%.4f"%mi.I),
      "\nThe Expected Value for Statistic I is: "+str("%.4f"%mi.EI),
      "\nThe Significance Test Value is: "+str("%.4f"%mi.p_norm))

### <a id='Spatial Lag model'>Spatial Lag model</a>
In a similar way to how we have included the spatial lag, one could think the airbnb listing prices surrounding a given property also enter its own price function. Recall your memory on Spatial Lag model from lecture, and use `ML_Lag` class in `pysal.model.spreg` to estimate this model.

In [11]:
import pysal.model.spreg as psms

spat_lag = psms.ML_Lag(y,X,w_queen, name_y='Airbnb_price', 
                       name_x=['house_price', 'income_score','education_score', 'barrier_score', 'IMD', 'Num_Airbnb'], 
                       name_w='w_queen', name_ds='lsoa_airbnb_housing')
print(spat_lag.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :lsoa_airbnb_housing
Weights matrix      :     w_queen
Dependent Variable  :Airbnb_price                Number of Observations:        4201
Mean dependent var  :     87.6554                Number of Variables   :           8
S.D. dependent var  :     71.2105                Degrees of Freedom    :        4193
Pseudo R-squared    :      0.3039
Spatial Pseudo R-squared:  0.2929
Sigma-square ML     :    3529.299                Log likelihood        :  -23131.647
S.E of regression   :      59.408                Akaike info criterion :   46279.294
                                                 Schwarz criterion     :   46330.039

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
----------------------

In [None]:
print("{0:.6f}".format(spat_lag.rho)) #estimate of spatial autoregressive coefficient
print(np.around(spat_lag.betas, decimals=4)) #array of estimated coefficients
print("{0:.6f}".format(spat_lag.mean_y)) #Mean of dependent variable
print("{0:.6f}".format(spat_lag.std_y))#Standard deviation of dependent variable
print(np.around(np.diag(spat_lag.vm1), decimals=4))#Variance covariance matrix (k+2 x k+2) includes sigma2
print(np.around(np.diag(spat_lag.vm), decimals=4)) #Variance covariance matrix (k+1 x k+1) - includes lambda
print("{0:.6f}".format(spat_lag.sig2))#Sigma squared used in computations
print("{0:.6f}".format(spat_lag.logll)) #maximized log-likelihood (including constant terms)

As we can see, results are again very similar in all the other variable. It is also very clear that the estimate of the spatial lag of price is statistically significant. This points to evidence that there are processes of spatial interaction between property owners when they set their price.

### <a id='Spatial Error model'>Spatial Error model</a>

Now, we can use a spatial error model to account for spatial non-independence. The **spreg** module in PySAL has several different functions for creating a spatial regression model. 

Reference：
[1] Anselin, L. (1988) "Spatial Econometrics: Methods and Models".
    Kluwer Academic Publishers. Dordrecht.

In [None]:
spat_err = psms.ML_Error(y,X,w_queen, name_y='Airbnb_price', 
                       name_x=['house_price', 'income_score','education_score', 'barrier_score', 'IMD', 'Num_Airbnb'], 
                       name_w='w_queen', name_ds='lsoa_airbnb_housing')
print(spat_err.summary)

In [None]:
print("{0:.6f}".format(spat_err.lam)) #estimate of spatial autoregressive coefficient
print(np.around(spat_err.betas, decimals=4)) #array of estimated coefficients
print("{0:.6f}".format(spat_err.mean_y)) #Mean of dependent variable
print("{0:.6f}".format(spat_err.std_y))#Standard deviation of dependent variable
print(np.diag(spat_err.vm)) #Variance covariance matrix (k+1 x k+1) - includes lambda
print("{0:.6f}".format(spat_err.sig2[0][0]))#Sigma squared used in computations
print("{0:.6f}".format(spat_err.logll)) #maximized log-likelihood (including constant terms)

### <a id='Task 10'>Task 10</a>
Discuss you interpretation on the results from Spatial Lag Model and Spatial Error Model with your neighbors, comparing them with the OLS regression result last week, and summarize your conclusions below:

**----------------------------------------------------------------**

### <a id='Prediction performance of spatial models'>Prediction performance of spatial models</a>
We can use the mean squared error (MSE), a standard metric of accuracy in the machine learning literature, to evaluate whether explicitly spatial models are better than traditional, non-spatial ones:

In [12]:
m1 = psms.OLS(y, X, name_y='Airbnb_price', 
              name_x=['house_price', 'income_score','education_score', 'barrier_score', 'IMD', 'Num_Airbnb'], 
              name_ds='lsoa_airbnb_housing')
sl = psms.ML_Lag(y, X, w_queen, name_y='Airbnb_price', 
                       name_x=['house_price', 'income_score','education_score', 'barrier_score', 'IMD', 'Num_Airbnb'], 
                       name_w='w_queen', name_ds='lsoa_airbnb_housing')
se = psms.ML_Error(y, X, w_queen, name_y='Airbnb_price', 
                       name_x=['house_price', 'income_score','education_score', 'barrier_score', 'IMD', 'Num_Airbnb'], 
                       name_w='w_queen', name_ds='lsoa_airbnb_housing')

In [13]:
from sklearn.metrics import mean_squared_error as mse

mses = pd.Series({'OLS': mse(y, m1.predy.flatten()), 
                  'SL': mse(y, sl.predy.flatten()), 
                  'SE': mse(y, se.predy.flatten())
                    })
mses.sort_values()

SL     3529.299498
OLS    3624.876483
SE     3630.180809
dtype: float64

The inclusion of the spatial lag of price marginally reduces the MSE, however, does a better job at improving the accuracy of the model.

### <a id='GWR Prediction'>GWR Prediction</a>
Geographically weighted regression (**GWR**) can fit Gaussian, Poisson, and logistic models to estimate a 'GWR Results' object. Now let's compare the maps $before$ and $after$ applying the GWR prediction.

In [None]:
vmin, vmax = np.min(gdf['Mean Price']), np.max(gdf['Mean Price']) 
ax = gdf.plot('Mean Price', vmin=vmin, vmax=vmax, figsize=(8,8), cmap='Reds')
ax.set_title('Mean Price'+' t-vals')
fig = ax.get_figure()
cax = fig.add_axes([1.0, 0.3, 0.02, 0.4]) # the position and size of colormap legend bar
sm_price = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap='Reds')
sm_price._A = []
fig.colorbar(sm_price, cax=cax)

In [18]:
# Prep data into design matrix and coordinates

# Dependent variable
y = gdf['Mean Price']
y = np.array(y).reshape(-1,1) # make array change the list format

In [17]:
#Design matrix - covariates - intercept added automatically
house_price = np.array(gdf.Value).reshape(-1,1)
income_score = np.array(gdf.IncScore).reshape(-1,1)
educa_score = np.array(gdf.EduScore).reshape(-1,1)
barri_score = np.array(gdf.BHSScore).reshape(-1,1)
imd_score = np.array(gdf.IMDScore).reshape(-1,1)
num_airbnb = np.array(gdf['Property C']).reshape(-1,1)

X = np.hstack([house_price, income_score, educa_score, barri_score, imd_score, num_airbnb])
labels = ['Intercept', 'house_price', 'income_score', 'educa_score', 'barri_score', 'imd_score', 'num_airbnb']

# standardization
X_s = (X - X.mean(axis=0)) / X.std(axis=0)
y_s = (y - y.mean(axis=0)) / y.std(axis=0)

In [19]:
#Coordinates for calibration points
def getXY(pt):
    return (pt.x, pt.y)
centroidseries = gdf['geometry'].centroid
u,v = [list(t) for t in zip(*map(getXY, centroidseries))]

coords = list(zip(u,v))

In [20]:
#Prepare dataset inputs
g_y = gdf['Mean Price'].values.reshape((-1,1))
g_X = gdf[['Value', 'IncScore','EduScore', 'BHSScore', 'IMDScore', 'Property C']].values

g_coords = list(zip(u,v))

# Standardised our data to have mean of 0 and standard deviation of 1
g_X = (g_X - g_X.mean(axis=0)) / g_X.std(axis=0)

g_y = g_y.reshape((-1,1))

g_y = (g_y - g_y.mean(axis=0)) / g_y.std(axis=0)

In [21]:
from pysal.model.mgwr.sel_bw import Sel_BW

# Select bandwidth for kernel
bw = Sel_BW(g_coords, 
            g_y, # Independent variable
            g_X, # Dependent variable
            fixed=False, # True for fixed bandwidth and false for adaptive bandwidth
            spherical=True) # Spherical coordinates (long-lat) or projected coordinates
# calculate the optimum bandwidth for our local regression
bw.search(bw_min=2)

1295.0

In [22]:
from pysal.model.mgwr.gwr import GWR
#Instantiate GWR model and then estimate parameters and diagnostics using fit method
model = GWR(coords, y, X, bw.bw[0])
results = model.fit()

In [None]:
#Results in a set of mappable results 
results.params.shape

In [None]:
results.params

In [None]:
print (results.bse[0:10, 1])
print (results.tvalues[0:10, 1])

In [None]:
#Map Parameter estimates and T-vals for each covariate
for param in range(results.params.shape[1]):
    gdf[str(param)] = results.params[:,param]
    vmin, vmax = np.min(gdf[str(param)]), np.max(gdf[str(param)]) 
    ax = gdf.plot(str(param), vmin=vmin, vmax=vmax, figsize=(8,8), cmap='YlOrRd')
    ax.set_title(labels[param] + ' Estimates')
    fig = ax.get_figure()
    cax = fig.add_axes([1.0, 0.3, 0.03, 0.4])
    sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap='YlOrRd')
    sm._A = []
    fig.colorbar(sm, cax=cax)
    
    gdf[str(param)] = results.tvalues[:,param]
    vmin, vmax = np.min(gdf[str(param)]), np.max(gdf[str(param)]) 
    ax = gdf.plot(str(param), vmin=vmin, vmax=vmax, figsize=(8,8), cmap='Greys')
    ax.set_title(labels[param] + ' t-vals')
    fig = ax.get_figure()
    cax = fig.add_axes([1.0, 0.3, 0.02, 0.4])
    sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap='Greys')
    sm._A = []
    fig.colorbar(sm, cax=cax)

In [None]:
print (len(results.localR2))
print (np.mean(results.localR2))
print (results.localR2)

In [None]:
#Map local R-square values which is a weighted R-square at each observation location

gdf['localR2'] = results.localR2
vmin, vmax = np.min(gdf['localR2']), np.max(gdf['localR2']) 
ax = gdf.plot('localR2', vmin=vmin, vmax=vmax, figsize=(8,8), cmap='PuBuGn')
ax.set_title('Local R-Squared')
fig = ax.get_figure()
cax = fig.add_axes([1.0, 0.3, 0.02, 0.4])
sm = plt.cm.ScalarMappable(norm=plt.Normalize(vmin=vmin, vmax=vmax), cmap='PuBuGn')
sm._A = []
fig.colorbar(sm, cax=cax)

## Credits!

#### Contributors:
The following individual(s) have contributed to these teaching materials: Yijing Li (yijing.li@kcl.ac.uk).

#### License
These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).