[Amazon Athena](https://aws.amazon.com/athena/)

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is based on [Presto](https://prestodb.io/), and supports various standard data formats, including CSV, JSON, Avro or columnar data formats such as [Apache Parquet](https://parquet.apache.org/documentation/latest/) and [Apache ORC](https://orc.apache.org/docs/).

Presto is an open source, distributed SQL query engine, developed for fast analytic queries against data of any size. It can query data where it is stored, without the need to move the data. Query execution runs in parallel over a pure memory-based architecture which makes Presto extremely fast.

Ensure you add "AmazonAthenaFullAccess" to your IAM Policy as well as start Athena in your respective domain(US-East 1, US-East 2, etc.)

In [22]:
import boto3
import sagemaker

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set Athena database name
database_name = 'coeaws'

### Install PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [23]:
# Install PyAthena
!pip install -q --upgrade pip
!pip install -q PyAthena==1.8.0

In [24]:
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.util import as_pandas

### Create Athena Database

Note: The databases and tables that we create in Athena use a data catalog service to store the metadata of your data. For example, schema information consisting of the column names and data type of each column in a table, together with the table name, is saved as metadata information in a data catalog. 

Athena natively supports the [AWS Glue Data Catalog service](https://aws.amazon.com/glue/). When we run `CREATE DATABASE` and `CREATE TABLE` queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the [AWS Glue Data Catalog][https://aws.amazon.com/glue/features/]

In [25]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = 's3://{0}/athena/staging'.format(bucket)

In [26]:
# SQL statement to execute
statement = 'CREATE DATABASE IF NOT EXISTS {}'.format(database_name)
print(statement)

CREATE DATABASE IF NOT EXISTS coeaws


In [27]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7f80d2d75630>

If the above has errors you'll need to create a "StartQueryExecution" policy
Also check to ensure you added "AmazonAthenaFullAccess" to your IAM Policy as well as start Athena in your respective domain(US-East 1, US-East 2, etc.)

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "jsr4",
            "Effect": "Allow",
            "Action": [
                "iam:StartQueryExecution"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

In [28]:
statement = 'SHOW DATABASES'
cursor.execute(statement)

df_show = as_pandas(cursor)
df_show.head(5)

Unnamed: 0,database_name
0,coeaws
1,default
