# Amazon Redshift - Query S3 Data With Redshift Spectrum

TODO: Describe scenario

<img src="img/c3-12.png" width="90%" align="left">


Amazon Redshift Spectrum directly queries data in S3, using the same SQL syntax of Amazon Redshift. You can also run queries that span both the frequently accessed data stored locally in Amazon Redshift and your full datasets stored cost-effectively in S3.

To use Redshift Spectrum, your cluster needs authorization to access data catalog in Amazon Athena and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role that is attached to your cluster. 

To use this capability in from your Amazon SageMaker notebook:

* Register your Athena database `dsoaws` with Redshift Spectrum
* Query Your Data in Amazon S3

In [None]:
import boto3

# Connect to Redshift
redshift = boto3.client('redshift')

# Get region 
session = boto3.session.Session()
region_name = session.region_name


## Setup Redshift Connection Via SQLAlchemy
https://pypi.org/project/SQLAlchemy/

In [None]:
!pip install -q SQLAlchemy==1.3.13

In [None]:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import pandas as pd

In [None]:
# Redshift configuration parameters
redshift_cluster_identifier = 'dsoaws'

database_name = 'dsoaws'
database_name_athena = 'dsoaws'

master_user_name = 'dsoaws'
master_user_pw = '<password>'

redshift_port = '5439'

schema = 'redshift'
schema_athena = 'athena'

table_name_tsv = 'amazon_reviews_tsv'


In [None]:
# Set Redshift endpoint address & IAM Role
response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)

redshift_endpoint_address = response['Clusters'][0]['Endpoint']['Address']
iam_role = response['Clusters'][0]['IamRoles'][0]['IamRoleArn']

print(redshift_endpoint_address)
print(iam_role)

In [None]:
# Connect to Redshift database engine
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name))


In [None]:
# Configure Session
session = sessionmaker()
session.configure(bind=engine)
s = session()

## Register Athena Database `dsoaws` with Redshift Spectrum to access the data directly in S3 


In [None]:
statement = """CREATE EXTERNAL SCHEMA IF NOT EXISTS {} FROM DATA CATALOG 
                DATABASE '{}' 
                IAM_ROLE '{}'
                REGION '{}'
                CREATE EXTERNAL DATABASE IF NOT EXISTS""".format(schema_athena, database_name_athena, iam_role, region_name)


print(statement)

In [None]:
s.execute(statement)
s.commit()

## Run a sample query

In [None]:
statement = """SELECT product_category, COUNT(star_rating) AS count_star_rating
                FROM {}.{}
                GROUP BY product_category
                ORDER BY count_star_rating DESC""".format(schema_athena, table_name_tsv)

print(statement)

In [None]:
df = pd.read_sql_query(statement, engine)
df.head(5)

### TODO: Add query across S3 and Redshift using Spectrum
Query Redshift (last 5 years) and S3 (before that). Reminder: TSV does not have `year`.

In [None]:
statement = """SELECT COUNT(athena.amazon_reviews_tsv.review_id) AS athena_count, 
                        COUNT(redshift.amazon_reviews_tsv.review_id) AS redshift_count
        FROM athena.amazon_reviews_tsv, redshift.amazon_reviews_tsv
        WHERE athena.amazon_reviews_tsv.review_id = redshift.amazon_reviews_tsv.review_id"""

print(statement)