# ADS-508-01-SP23 Team 8: Final Project

Much of the code is modified from `Fregly, C., & Barth, A. (2021). Data science on AWS: Implementing end-to-end, continuous AI and machine learning pipelines. O’Reilly.`

# Import PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [1]:
!pip install --disable-pip-version-check -q PyAthena==2.1.0

[0m

In [2]:
import boto3
from botocore.client import ClientError
import sagemaker
import pandas as pd
from pyathena import connect

In [3]:
session = boto3.session.Session()
region = session.region_name
sagemaker_session = sagemaker.Session()
def_bucket = sagemaker_session.default_bucket()
bucket = 'sagemaker-us-east-ads508-sp23-t8'

s3 = boto3.Session().client(service_name="s3", region_name=region)

In [4]:
setup_s3_bucket_passed = False
ingest_create_athena_db_passed = False
ingest_create_athena_table_tsv_passed = False

In [5]:
print(f"Default bucket: {def_bucket}")
print(f"Public T8 bucket: {bucket}")

Default bucket: sagemaker-us-east-1-657724983756
Public T8 bucket: sagemaker-us-east-ads508-sp23-t8


# Verify S3_BUCKET Bucket Creation

In [6]:
%%bash

aws s3 ls s3://${bucket}/

2023-03-16 17:05:02 aws-athena-query-results-657724983756-us-east-1
2023-03-02 16:56:48 sagemaker-studio-657724983756-5nh7ydsouq7
2023-03-02 17:25:41 sagemaker-studio-657724983756-7yc8bp8xk0b
2023-03-02 17:01:51 sagemaker-us-east-1-657724983756
2023-03-17 05:19:31 sagemaker-us-east-ads508-sp23-t8


In [7]:
response = None

try:
    response = s3.head_bucket(Bucket=bucket)
    print(response)
    setup_s3_bucket_passed = True
except ClientError as e:
    print(f"[ERROR] Cannot find bucket {bucket} in {response} due to {e}.")

{'ResponseMetadata': {'RequestId': 'Z03JAW8B657MYYZ8', 'HostId': '3AYSWSLAruekdXWawFp9Hc2MNIvv+4uNRmxByTGx+cXa7blTh8Dd3cdh/Z2YXNT9XVnFfIQBZDQ=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '3AYSWSLAruekdXWawFp9Hc2MNIvv+4uNRmxByTGx+cXa7blTh8Dd3cdh/Z2YXNT9XVnFfIQBZDQ=', 'x-amz-request-id': 'Z03JAW8B657MYYZ8', 'date': 'Sat, 18 Mar 2023 18:50:31 GMT', 'x-amz-bucket-region': 'us-east-1', 'x-amz-access-point-alias': 'false', 'content-type': 'application/xml', 'server': 'AmazonS3'}, 'RetryAttempts': 0}}


In [8]:
%store setup_s3_bucket_passed

Stored 'setup_s3_bucket_passed' (bool)


# Create Athena Database

In [9]:
database_name = "ads508_t8"

Note: The databases and tables that we create in Athena use a data catalog service to store the metadata of your data. For example, schema information consisting of the column names and data type of each column in a table, together with the table name, is saved as metadata information in a data catalog. 

Athena natively supports the AWS Glue Data Catalog service. When we run `CREATE DATABASE` and `CREATE TABLE` queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the AWS Glue Data Catalog.

In [10]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = f"s3://{bucket}/athena/staging"
s3_raw_data_path = f"s3://{bucket}/raw_data/grad_outcomes"
print(s3_raw_data_path)
s3_db_tbls_dir = f"s3://{bucket}/athena/db_tbls"

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/grad_outcomes


In [11]:
conn = connect(region_name=region,
               s3_staging_dir=s3_staging_dir)

In [12]:
create_db_stmnt = f"CREATE DATABASE IF NOT EXISTS {database_name}"
print(create_db_stmnt)

CREATE DATABASE IF NOT EXISTS ads508_t8


In [13]:
pd.read_sql(create_db_stmnt,
            conn)

# Verify The Database Has Been Created Succesfully

In [14]:
show_db_stmnt = "SHOW DATABASES"

df_show = pd.read_sql(show_db_stmnt,
                      conn)
df_show.head(5)

Unnamed: 0,database_name
0,ads508_t8
1,default
2,dsoaws


In [15]:
if database_name in df_show.values:
    ingest_create_athena_db_passed = True

In [16]:
%store ingest_create_athena_db_passed

Stored 'ingest_create_athena_db_passed' (bool)


# Define function to create tables in existing database

In [18]:
def create_athena_tbl_tsv(conn=None,
                          db=None,
                          tbl_name=None,
                          fields='',
                          s3_path=None,
                          comp='',
                          skip=''):
    # Set Athena parameters

    # SQL statement to execute
    drop_tbl_stmnt = f"""DROP TABLE IF EXISTS {db}.{tbl_name}"""

    create_tbl_stmnt = f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {db}.{tbl_name}({fields})
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' LOCATION '{s3_path}'
    TBLPROPERTIES ({comp}{skip})"""

    print(f'Create table statement:\n{create_tbl_stmnt}')

    pd.read_sql(drop_tbl_stmnt,
                conn)

    pd.read_sql(create_tbl_stmnt,
                conn)
    
    # Verify The Table Has Been Created Succesfully
    show_tbl_stmnt = f"SHOW TABLES in {db}"

    df_show = pd.read_sql(show_tbl_stmnt,
                          conn)
    print(df_show.head(5))

    if tbl_name in df_show.values:
        ingest_create_athena_table_tsv_passed = True

    print(f'\nDataframe contains records: {ingest_create_athena_table_tsv_passed}')

# Create Athena Table from Local TSV File - `grad_outcomes`

#### Dataset columns

- `demographic`: ,
- `dbn`: ,
- `school_name`: ,
- `cohort`: ,
- `total_cohort`: ,
- `total_grads_n`: ,
- `total_grads_perc_cohort`: ,
- `total_regents_n`: ,
- `total_regents_perc_cohort`: ,
- `total_regents_perc_grads`: ,
- `advanced_regents_n`: ,
- `advanced_regents_perc_cohort`: ,
- `advanced_regents_perc_grads`: ,
- `regents_wo_advanced_n`: ,
- `regents_wo_advanced_perc_cohort`: ,
- `regents_wo_advanced_perc_grads`: ,
- `local_n`: ,
- `local_perc_cohort`: ,
- `local_perc_grads`: ,
- `still_enrolled_n`: ,
- `still_enrolled_perc_cohort`: ,
- `dropped_out_n`: ,
- `dropped_out_perc_cohort`: 


In [19]:
table_name_grad_outcomes = 'grad_outcomes'
field_list_grad_outcomes = """
demographic string,
dbn string,
school_name string,
cohort string,
total_cohort string,
total_grads_n string,
total_grads_perc_cohort string,
total_regents_n string,
total_regents_perc_cohort string,
total_regents_perc_grads string,
advanced_regents_n string,
advanced_regents_perc_cohort string,
advanced_regents_perc_grads string,
regents_wo_advanced_n string,
regents_wo_advanced_perc_cohort string,
regents_wo_advanced_perc_grads string,
local_n string,
local_perc_cohort string,
local_perc_grads string,
still_enrolled_n string,
still_enrolled_perc_cohort string,
dropped_out_n string,
dropped_out_perc_cohort string
"""

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=table_name_grad_outcomes,
                      fields=field_list_grad_outcomes,
                      s3_path=s3_raw_data_path,
                      comp='',
                      skip="'skip.header.line.count'='1'")

Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.grad_outcomes(zdemographic string,
    dbn string,
    school_name string,
    cohort string,
    total_cohort string,
    total_grads_n string,
    total_grads_perc_cohort string,
    total_regents_n string,
    total_regents_perc_cohort string,
    total_regents_perc_grads string,
    advanced_regents_n string,
    advanced_regents_perc_cohort string,
    advanced_regents_perc_grads string,
    regents_wo_advanced_n string,
    regents_wo_advanced_perc_cohort string,
    regents_wo_advanced_perc_grads string,
    local_n string,
    local_perc_cohort string,
    local_perc_grads string,
    still_enrolled_n string,
    still_enrolled_perc_cohort string,
    dropped_out_n string,
    dropped_out_perc_cohort string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION 's3://sagemaker-us-east-ads508-sp23-t8/raw_data/grad_outcomes'
    TBLPROPERTIES ('skip.header.line.count'='1')
  

# Run A Sample Query

In [20]:
dbn_01M292 = "01M448"

statement = f"""SELECT * FROM {database_name}.{table_name_grad_outcomes}
WHERE dbn = '{dbn_01M292}' LIMIT 20"""

print(statement)

SELECT * FROM ads508_t8.grad_outcomes
WHERE dbn = '01M448' LIMIT 20


In [21]:
df = pd.read_sql(statement,
                 conn)
df.head(5)

Unnamed: 0,zdemographic,dbn,school_name,cohort,total_cohort,total_grads_n,total_grads_perc_cohort,total_regents_n,total_regents_perc_cohort,total_regents_perc_grads,...,regents_wo_advanced_n,regents_wo_advanced_perc_cohort,regents_wo_advanced_perc_grads,local_n,local_perc_cohort,local_perc_grads,still_enrolled_n,still_enrolled_perc_cohort,dropped_out_n,dropped_out_perc_cohort
0,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2001,64,46,71.9,32,50.0,69.6,...,25,39.1,54.3,14,21.9,30.4,10,15.6,6,9.4
1,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2002,52,33,63.5,19,36.5,57.6,...,11,21.2,33.3,14,26.9,42.4,16,30.8,1,1.9
2,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2003,87,67,77.0,39,44.8,58.2,...,28,32.2,41.8,28,32.2,41.8,9,10.3,11,12.6
3,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2004,112,75,67.0,36,32.1,48.0,...,30,26.8,40.0,39,34.8,52.0,33,29.5,4,3.6
4,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2005,121,64,52.9,35,28.9,54.7,...,31,25.6,48.4,29,24.0,45.3,41,33.9,11,9.1


In [22]:
if not df.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


# Review the New Athena Table in the Glue Catalog

In [23]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="top" href="https://console.aws.amazon.com/glue/home?region={}#">AWS Glue Catalog</a></b>'.format(
            region
        )
    )
)

# Store Variables for the Next Notebooks

In [24]:
%store

Stored variables and their in-db values:
ingest_create_athena_db_passed                    -> True
ingest_create_athena_table_tsv_passed             -> True
s3_private_path_tsv                               -> 's3://sagemaker-us-east-1-657724983756/amazon-revi
s3_public_path_tsv                                -> 's3://amazon-reviews-pds/tsv'
setup_dependencies_passed                         -> True
setup_iam_roles_passed                            -> True
setup_instance_check_passed                       -> True
setup_s3_bucket_passed                            -> True


# Release Resources

In [25]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [26]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>