# Create Athena Database Schema

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. 

Athena is based on Presto, and supports various standard data formats, including CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC. 

Presto is an open source, distributed SQL query engine, developed for fast analytic queries against data of any size. It can query data where it is stored, without the need to move the data. Query execution runs in parallel over a pure memory-based architecture which makes Presto extremely fast. 


<img src="img/athena_setup.png" width="60%" align="left">

In [None]:
import boto3
import sagemaker

# Get region 
session = boto3.session.Session()
region = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

In [None]:
ingest_create_athena_db_passed = False

In [None]:
%store -r s3_public_path_tsv

In [None]:
try:
    s3_public_path_tsv
except NameError:
    print('*****************************************************************************')
    print('[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************')
    print('[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************')
    print('*****************************************************************************')

In [None]:
print(s3_public_path_tsv)

In [None]:
%store -r s3_private_path_tsv

In [None]:
try:
    s3_private_path_tsv
except NameError:
    print('*****************************************************************************')
    print('[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************')
    print('[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************')
    print('*****************************************************************************')

In [None]:
print(s3_private_path_tsv)

# Import PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [None]:
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.util import as_pandas

# Create Athena Database

In [None]:
# Set Athena database name
database_name = 'dsoaws'

Note: The databases and tables that we create in Athena use a data catalog service to store the metadata of your data. For example, schema information consisting of the column names and data type of each column in a table, together with the table name, is saved as metadata information in a data catalog. 

Athena natively supports the AWS Glue Data Catalog service. When we run `CREATE DATABASE` and `CREATE TABLE` queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the AWS Glue Data Catalog.

In [None]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = 's3://{0}/athena/staging'.format(bucket)

In [None]:
# SQL statement to execute
statement = 'CREATE DATABASE IF NOT EXISTS {}'.format(database_name)
print(statement)

In [None]:
# Execute statement using connection cursor
cursor = connect(region_name=region, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Verify The Database Has Been Created Succesfully

In [None]:
statement = 'SHOW DATABASES'
cursor.execute(statement)

df_show = as_pandas(cursor)
df_show.head(5)

In [None]:
if database_name in df_show.values:
    ingest_create_athena_db_passed = True

In [None]:
%store ingest_create_athena_db_passed

# Store Variables for the Next Notebooks

In [None]:
%store

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();