# ANT404 Lab #5  - Redshift Spatial Processing

In this lab you will use Redshift's new Spatial Processing capabilities, including using Redshift Spectrum to query external and cast it into the `GEOMETRY` data type. Example source data has been shared with your test account. 


### Background on Redshift Spatial Processing

Amazon Redshift recently announced support for spatial data with the addition of a new polymorphic data type `GEOMETRY`. This capability enables you to store, retrieve, and process spatial data so you can enhance your business insights by integrating spatial data into your analytical queries.

The new data type supports multiple geometric shapes such as Point, Linestring, Polygon, MultiPoint, MultiLinestring, MultiPolygon, and GeometryCollection. 

You can add GEOMETRY columns to Redshift tables and write SQL queries spanning across spatial and non-spatial data. Redshift also adds over 40 new spatial SQL functions to construct geometric shapes, import, export, access and process the spatial data. 

You can also seamlessly extend spatial processing to your data lake by integrating external tables in spatial queries and casting data into `GEOMETRY` during query execution.



## 1. Check for credentials file
Check for the credentials created in the `START_HERE` notebook.

In [None]:
%%bash
cat ant404-lab.creds

## 2. Set local variables from credentials file
Run this `cell` to import the credentials created in `START_HERE` notebook into this notebook. Later cells rely on these variables.

In [None]:
import simplejson
with open("ant404-lab.creds") as fh:
    creds = simplejson.loads(fh.read())
username=creds["user_name"]
password=creds["password"]
host_name=creds["host_name"]
port_num=creds["port_num"]
db_name=creds["db_name"]

# Example Account and Region values for this lab
log_account=123456789101
region="us-east-1"

%set_env username={username}


## 3. Connect to your Redshift cluster

You will use the `sqlalchemy` and `ipython-sql` Python libraries to manage the Redshift connection. 

This cell creates a `%sql` element so we can use the connection in other cells in the notebook.

-------
_**Note:** Please ignore the pink error message that says: "UserWarning: The psycopg2 wheel package will be renamed from release 2.8"_'**Look for** 'Connected: ant404@dev' in the 'Out [ ]' section below the warning.

In [None]:
import sqlalchemy
import psycopg2
import simplejson

%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = 'postgresql+psycopg2://'+username+':'+password+'@'+host_name+':'+port_num+'/'+db_name
%sql $connect_to_db

## 4. Create tables with the `GEOMETRY` spatial data type.
The data sets in this lab are based on public data on Berlin published by AirBnB at [insideairbnb.com](http://insideairbnb.com/)

### 4.1. `accommodations`
This table stores accommodations, each has a geo location which stores their longitude and latitude coordinates as well as meta and business data such as name of the listing and how many people book it over time. 

In [None]:
%%sql
DROP TABLE IF EXISTS public.accommodations;
CREATE TABLE public.accommodations (
       id                             INTEGER   PRIMARY KEY
     , shape                          GEOMETRY
     , name                           VARCHAR(100)
     , host_name                      VARCHAR(100)
     , neighbourhood_group            VARCHAR(100)
     , neighbourhood                  VARCHAR(100)
     , room_type                      VARCHAR(100)
     , price                          SMALLINT
     , minimum_nights                 SMALLINT
     , number_of_reviews              SMALLINT
     , last_review                    DATE
     , reviews_per_month              NUMERIC(8,2)
     , calculated_host_listings_count SMALLINT
     , availability_365               SMALLINT
);
SELECT * FROM information_schema.tables WHERE table_schema = 'public' AND table_name = 'accommodations';

### 4.2. `zipcode`
A zip code has a polygon and additional meta-data such as name, alias, id, and type.

In [None]:
%%sql
DROP TABLE IF EXISTS public.zipcode;
CREATE TABLE public.zipcode (
       ogc_field     INTEGER IDENTITY(0,1)
     , wkb_geometry  GEOMETRY
     , gml_id        VARCHAR
     , spatial_name  VARCHAR
     , spatial_alias VARCHAR
     , spatial_type  VARCHAR
);
SELECT * FROM information_schema.tables WHERE table_schema = 'public' AND table_name = 'zipcode';

### 4.3. `attractions`
Stores the geo location which stores their longitude and latitude of attractions in the city

In [None]:
%%sql
DROP TABLE IF EXISTS public.attractions;
CREATE TABLE public.attractions (
       name     VARCHAR
     , address  VARCHAR
     , lat      FLOAT
     , lon      FLOAT
     , gps_lat  VARCHAR
     , gps_lon  VARCHAR
);
SELECT * FROM information_schema.tables WHERE table_schema = 'public' AND table_name = 'attractions';

## 5. Load data from S3 into Redshift local tables

### 5.1. `accommodations`

In [None]:
%%sql
COPY public.accommodations 
FROM 's3://redshift-managed-loads-datasets-us-east-1/dataset=spatial/size=None/table=accommodations/accommodations.csv' 
DELIMITER ';' 
IGNOREHEADER 1 
CREDENTIALS 'aws_iam_role=arn:aws:iam::080945919444:role/mod-27c4c61fae3b42fe-RedshiftClusterRole-1GBP75PRR61RG'
GZIP
;
SELECT COUNT(*) FROM public.accommodations;

### 5.2. `zipcode`

In [None]:
%%sql
COPY public.zipcode 
FROM 's3://redshift-managed-loads-datasets-us-east-1/dataset=spatial/size=None/table=zipcode/zipcode.csv.gz' 
DELIMITER ';' 
IGNOREHEADER 1
EXPLICIT_IDS
CREDENTIALS 'aws_iam_role=arn:aws:iam::080945919444:role/mod-27c4c61fae3b42fe-RedshiftClusterRole-1GBP75PRR61RG'
GZIP
;
SELECT COUNT(*) FROM public.zipcode;

### 5.3. `attractions`

In [None]:
%sql SELECT * FROM stl_load_errors ORDER BY query DESC LIMIT 5;

In [None]:
%%sql
COPY public.attractions 
FROM 's3://redshift-managed-loads-datasets-us-east-1/dataset=spatial/size=None/table=attraction_coordinates/BerlinAttractionCoordinates.txt.gz'
DELIMITER '|' 
IGNOREHEADER 1 
CREDENTIALS 'aws_iam_role=arn:aws:iam::080945919444:role/mod-27c4c61fae3b42fe-RedshiftClusterRole-1GBP75PRR61RG'
GZIP
;
SELECT COUNT(*) FROM public.attractions;

## 6. Querying spatial data

After your  tables are created and filled with data, you can query them using the same SELECT statements that you use to query other Amazon Redshift tables. 

### 6.1. 
Get number of listings stored in accommodations where the spatial reference system is WGS84. This spatial reference system has the unique spatial reference identifier 4326.

In [None]:
%%sql
SELECT count(*) 
FROM public.accommodations 
WHERE ST_SRID(shape) = 4326
----------
-- 22,248
;

### 6.2. 
Fetch geometry object in well known text (WKT) format and additional attributes such as the zip code of this area. Additionally validate if this data is also stored in WGS84 which uses the spatial reference id (SRID) 4326. Spatial data must be stored in the same spatial reference system to be interoperable.

In [None]:
%%sql
SELECT ogc_field
     , spatial_name
     , spatial_type
     , ST_SRID(wkb_geometry)
     , ST_AsText(wkb_geometry) 
FROM public.zipcode 
ORDER BY spatial_name
---------------------------------------------
-- 0 | 10115 | Polygon | 4326 | POLYGON((...))
-- 4 | 10117 | Polygon | 4326 | POLYGON((...))
-- 8 | 10119 | Polygon | 4326 | POLYGON((...))
;

### 6.3.
Select the polygon of Berlin Mitte in GeoJSON format, their dimension and calculate the number of points in this polygon.

In [None]:
%%sql
SELECT ogc_field
     , spatial_name
     , ST_AsGeoJSON(wkb_geometry)
     , ST_Dimension(wkb_geometry)
     , ST_NumPoints(wkb_geometry)
FROM public.zipcode 
WHERE spatial_name='10117'
----------------------------------------------------------------------
-- 4 | 10117 | {"type":"Polygon", "coordinates":[[[...]]]} | 2 | 331
;

### 6.4. 
How many accommodations are around the Brandenburger gate within a Euclidean distance of 0.01 (which is roughly 677 meters at the latitude of Brandenburger gate)? The used geo-location below is the exact position of the Brandenburger gate.

In [None]:
%%sql
SELECT count(*) 
FROM public.accommodations 
WHERE ST_DWithin(shape, ST_GeomFromText('POINT(13.377704 52.516431)', 4326), 0.01)
-------
-- 137
;

### 6.5.
Get the rough location of the Brandenburger gate from accommodations which are saying they are near by. This requires a sub select as follows and leads to a slightly different result as the position is not the same and we are closer to living areas.

In [None]:
%%sql
WITH poi(loc) as (
  SELECT st_astext(shape) FROM accommodations WHERE name LIKE '%brandenburg gate%'
)
SELECT count(*) 
FROM accommodations a, poi p
WHERE st_dwithin(a.shape, ST_GeomFromText(p.loc, 4326), 0.01)
-----
-- 240
;

### 6.6.
Find all accommodations around the Brandenburger gate ordered by price in descending order. 

In [None]:
%%sql
SELECT name, price, ST_AsText(shape) 
FROM public.accommodations
WHERE ST_DWithin(shape, ST_GeomFromText('POINT(13.377704 52.516431)', 4326), 0.01)
ORDER BY price DESC
-----
-- 28 BED ROOM / 8 RO. APARTMENT/ HOSTEL               | 899 | POINT(...)
-- "Luxurious suite directly in the Sony Center Mitte" | 480 | POINT(...)
;

### 6.7. 
Find the most expensive accommodation and show which zip code it is in.

In [None]:
%%sql
SELECT 
  a.price, a.name, ST_AsText(a.shape), 
  z.spatial_name, ST_AsText(z.wkb_geometry) 
FROM accommodations a, zipcode z 
WHERE price = 9000 AND ST_Within(a.shape, z.wkb_geometry)
--------------------------------------------------------------------------------
-- 9000 | Ueber den Dächern Berlins Zentrum | POINT(..) | 10777 | POLYGON((...))
;

### 6.8. 
Select all accommodations which are offered for the average price

In [None]:
%%sql
SELECT a.price, a.name
     , ST_AsText(a.shape)
     , z.spatial_name
     , ST_AsText(z.wkb_geometry) 
FROM accommodations a, zipcode z 
WHERE ST_Within(a.shape, z.wkb_geometry) 
  AND price = (SELECT AVG(price) FROM accommodations)
-----
-- 67 | Apartment 'Falcon' | POINT(...) | 10170 | POLYGON ((...))
;

### 6.9. 
Find the number of accommodations listed in Berlin grouped by zip codes and sort by amount of supply to find the hot spots.

In [None]:
%%sql
SELECT z.spatial_name as zip, count(*) as numAccommodations 
FROM public.accommodations a, public.zipcode z
WHERE ST_Within(a.shape, z.wkb_geometry)
GROUP BY zip 
ORDER BY numAccommodations DESC
-----
-- 10245 | 872
-- 10247 | 832
-- 10437 | 733
;

## 7. Running spatial queries on S3 data with Amazon Redshift Spectrum
Redshift Spectrum does not currently offer native support for spatial data types, but you still can use spatial functions and predicates on data stored on S3. 

The following section describes how to leverage a csv-file store on S3 ind transform longitute and latitude values on the fly to combine it with Redshift Spatial.

### 7.1. Create `spatial` external schema

In [None]:
%%sql
/* -- Escape autocommit with */END;/* -- */
CREATE EXTERNAL SCHEMA IF NOT EXISTS ant404_spatial 
FROM DATA CATALOG 
DATABASE 'spatial' 
IAM_ROLE 'arn:aws:iam::080945919444:role/mod-27c4c61fae3b42fe-RedshiftClusterRole-1GBP75PRR61RG'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
SELECT * FROM svv_external_schemas WHERE schemaname = 'ant404_spatial';

### 7.2. Create `geoname` external table

In [None]:
%%sql
/* -- Escape autocommit with */END;/* -- */
CREATE EXTERNAL TABLE ant404_spatial.geoname (
       geonameid       INT
     , name            VARCHAR(200)
     , asciiname       VARCHAR(200)
     , alternatenames  VARCHAR(2048)
     , latitude        FLOAT
     , longitude       FLOAT
     , fclass          CHAR(1)
     , fcode           VARCHAR(10)
     , country         VARCHAR(2)
     , cc2             VARCHAR(60)
     , admin1          VARCHAR(20)
     , admin2          VARCHAR(80)
     , admin3          VARCHAR(20)
     , admin4          VARCHAR(20)
     , population      BIGINT
     , elevation       INT
     , gtopo30         INT
     , timezone        VARCHAR(40)
     , moddate         DATE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE 
LOCATION 's3://redshift-managed-loads-datasets-us-east-1/dataset=spatial/size=None/table=geoname/'
TABLE PROPERTIES ('compression_type'='gzip')
;
SELECT * FROM svv_external_tables WHERE schemaname = 'ant404_spatial' AND tablename = 'geoname';

### 7.3. Transform `FLOAT` coordinates to `GEOMETRY` type.

In [None]:
%%sql
SELECT ST_GeomFromText('POINT('||longitude||' '||latitude||')', 4326) 
FROM ant404_spatial.geoname
LIMIT 25; 

### 7.4. Get points of interest around Brandenburger Gate

In [None]:
%%sql
SELECT *
     , ST_GeomFromText('POINT('||longitude||' '||latitude||')', 4326) 
FROM ant404_spatial.geoname
WHERE ST_DWithin(ST_GeomFromText('POINT('||longitude||' '||latitude||')', 4326), 
                 ST_GeomFromText('POINT(13.377704 52.516431)', 4326), 
                 0.01)
LIMIT 25;

### Further Info on Redshift Audit Logs
* Redshfit Documentation: [Querying Spatial Data in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/geospatial-overview.html)
* Redshift Documentation: [Spatial Functions](https://docs.aws.amazon.com/redshift/latest/dg/geospatial-functions.html)
* AWS Blog: ["Using Spatial Data with Amazon Redshift"](https://aws.amazon.com/blogs/aws/using-spatial-data-with-amazon-redshift/)