# Update and Proximity

In [13]:
from sqlalchemy import create_engine
import geopandas as gpd
import requests
import shutil
import folium
import json
from pathlib import Path

In [14]:
engine_str = "postgresql+psycopg2://docker:docker@0.0.0.0:25432/restaurants"
engine = create_engine(engine_str)

In [15]:
%load_ext sql
%sql $engine.url

'Connected: docker@restaurants'

We are using a new dataset, NYC subway stations retrieved from [NYC Open Data](https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49)

In [3]:
# nyc open data automatically generates the layer name
layer = "geo_export_4accb246-9c7e-4ce4-92bd-300470ade60a"

The `ogr2ogr` command failed with this error:

```
ERROR:  numeric field overflow
```

As a workaround, we'll use the layer creation option `precision=NO` to use float type instead of numeric type which has a fixed precision. 

In [32]:
%%bash -s "$layer"
unzip -o "../data/Subway Stations.zip" -d "../data/Subway Stations" > /dev/null

ogr2ogr -f "PostgreSQL" \
 PG:"host='0.0.0.0' port='25432' user='docker' password='docker' dbname='restaurants'" "../data/Subway Stations" $1 \
 -nlt PROMOTE_TO_MULTI \
 -nln "nyc_subway_stations" \
 -lco GEOMETRY_NAME=geom \
 -lco precision=NO \
 -a_srs EPSG:4326 \
 -overwrite 

Confirm presence of spatial index

In [84]:
%%sql
SELECT tablename, indexname, indexdef FROM pg_indexes WHERE tablename = 'nyc_subway_stations';

 * postgresql+psycopg2://docker:***@0.0.0.0:25432/restaurants
2 rows affected.


tablename,indexname,indexdef
nyc_subway_stations,nyc_subway_stations_pkey,CREATE UNIQUE INDEX nyc_subway_stations_pkey ON public.nyc_subway_stations USING btree (ogc_fid)
nyc_subway_stations,nyc_subway_stations_geom_geom_idx,CREATE INDEX nyc_subway_stations_geom_geom_idx ON public.nyc_subway_stations USING gist (geom)


### Replace exceptional values with Spatial Average
In the mappluto dataset, there are soe properties with extremely low `assesstot` values. Lets update these to the average value of properties within 800 meters. The geometry of the mappluto table is in `EPSG:2263`, a projection in feet. 800 meters would be 2624.67 feet.  

In [56]:
%%sql
SELECT assesstot FROM mappluto_queens
WHERE assesstot < 1000 OR assesstot is NULL
LIMIT 5;

 * postgresql+psycopg2://docker:***@0.0.0.0:25432/restaurants
5 rows affected.


assesstot
""
""
""
0.0
""


In [72]:
%%sql 
EXPLAIN ANALYZE
WITH mpq_fill AS
(
    SELECT mpq.objectid as objectid, AVG(mpqb.assesstot) as avg_assesstot
    FROM mappluto_queens as mpq
    JOIN mappluto_queens as mpqb
    ON ST_DWithin(mpq.geom, mpqb.geom, 2624.67)
    WHERE mpq.assesstot IS NULL OR mpq.assesstot < 1000
    GROUP BY mpq.objectid
)
UPDATE mappluto_queens mpq
SET assesstot = mpq_fill.avg_assesstot
FROM mpq_fill
WHERE (mpq.assesstot IS NULL or mpq.assesstot < 1000) AND mpq.objectid = mpq_fill.objectid;

 * postgresql+psycopg2://docker:***@0.0.0.0:25432/restaurants
25 rows affected.


QUERY PLAN
Update on mappluto_queens mpq (cost=1833107.77..1845911.81 rows=0 width=0) (actual time=372.375..372.378 rows=0 loops=1)
-> Hash Join (cost=1833107.77..1845911.81 rows=13 width=50) (actual time=332.585..354.343 rows=1 loops=1)
Hash Cond: (mpq.objectid = mpq_fill.objectid)
-> Seq Scan on mappluto_queens mpq (cost=0.00..12798.50 rows=2113 width=10) (actual time=304.566..326.320 rows=1 loops=1)
Filter: ((assesstot IS NULL) OR (assesstot < '1000'::double precision))
Rows Removed by Filter: 324247
-> Hash (cost=1833081.35..1833081.35 rows=2113 width=48) (actual time=27.958..27.960 rows=1 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 33kB
-> Subquery Scan on mpq_fill (cost=1833033.81..1833081.35 rows=2113 width=48) (actual time=27.951..27.954 rows=1 loops=1)
-> HashAggregate (cost=1833033.81..1833060.22 rows=2113 width=12) (actual time=27.945..27.948 rows=1 loops=1)


After filling, there is only one Null record. I suspect this record didn't have any nearby neighbors within 800 meters. 

In [71]:
%sql SELECT objectid, assesstot FROM mappluto_queens WHERE assesstot is NULL;

 * postgresql+psycopg2://docker:***@0.0.0.0:25432/restaurants
1 rows affected.


objectid,assesstot
485288,


### Spatial Proximity

Five closest fast food restaurants to each subway station entrance including restaurant names & distances to the station. 

Using the `<->` operator lets us compare relative distance without doing full ST_Distance calculation. 

In [86]:
%%sql 
SELECT 
sb.ogc_fid as sb_id,
sb.name as sb_name,
rt.id as rt_id,
rt.name as rt_name,
ST_Distance(geography(sb.geom), geography(rt.geom)) distance_m
FROM nyc_subway_stations sb
CROSS JOIN LATERAL
(
    SELECT 
        id,
        name,
        geom
    FROM restaurants
    ORDER BY sb.geom <-> geom
    LIMIT 5
) as rt
LIMIT 25;


 * postgresql+psycopg2://docker:***@0.0.0.0:25432/restaurants
25 rows affected.


sb_id,sb_name,rt_id,rt_name,distance_m
1,Astor Pl,23591,MCD,206.19760499
1,Astor Pl,42707,TCB,213.52621397
1,Astor Pl,29384,MCD,229.74810986
1,Astor Pl,41069,TCB,580.74329365
1,Astor Pl,14730,KFC,505.68644509
2,Canal St,20117,MCD,79.73184965
2,Canal St,6849,BKG,111.40386844
2,Canal St,20185,MCD,505.98773468
2,Canal St,30144,MCD,552.49421209
2,Canal St,49448,WDY,942.42202175


Testing performance of query

In [87]:
%%sql 
EXPLAIN ANALYZE
SELECT 
sb.ogc_fid as sb_id,
sb.name as sb_name,
rt.id as rt_id,
rt.name as rt_name,
ST_Distance(geography(sb.geom), geography(rt.geom)) distance_m
FROM nyc_subway_stations sb
CROSS JOIN LATERAL
(
    SELECT 
        id,
        name,
        geom
    FROM restaurants
    ORDER BY sb.geom <-> geom
    LIMIT 5
) as rt


 * postgresql+psycopg2://docker:***@0.0.0.0:25432/restaurants
13 rows affected.


QUERY PLAN
Nested Loop (cost=65238.04..30916807.44 rows=2365 width=34) (actual time=159.702..15683.617 rows=2365 loops=1)
-> Seq Scan on nyc_subway_stations sb (cost=0.00..26.73 rows=473 width=58) (actual time=0.014..0.328 rows=473 loops=1)
-> Limit (cost=65238.04..65238.05 rows=5 width=48) (actual time=32.862..32.863 rows=5 loops=473)
-> Sort (cost=65238.04..65363.04 rows=50002 width=48) (actual time=32.859..32.860 rows=5 loops=473)
Sort Key: ((sb.geom <-> restaurants.geom))
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on restaurants (cost=0.00..64407.52 rows=50002 width=48) (actual time=0.104..24.209 rows=50002 loops=473)
Planning Time: 0.099 ms
JIT:
Functions: 6
