# ADS-508-01-SP23 Team 8: Final Project

Much of the code is modified from `Fregly, C., & Barth, A. (2021). Data science on AWS: Implementing end-to-end, continuous AI and machine learning pipelines. O’Reilly.`

## Import PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [2]:
!pip install --disable-pip-version-check -q PyAthena==2.1.0

[0m

In [3]:
import boto3
from botocore.client import ClientError
import sagemaker
import pandas as pd
from pyathena import connect
from IPython.core.display import display, HTML

In [4]:
session = boto3.session.Session()
region = session.region_name
sagemaker_session = sagemaker.Session()
def_bucket = sagemaker_session.default_bucket()
bucket = 'sagemaker-us-east-ads508-sp23-t8'

s3 = boto3.Session().client(service_name="s3", region_name=region)

In [5]:
setup_s3_bucket_passed = False
ingest_create_athena_db_passed = False
ingest_create_athena_table_tsv_passed = False

In [6]:
print(f"Default bucket: {def_bucket}")
print(f"Public T8 bucket: {bucket}")

Default bucket: sagemaker-us-east-1-657724983756
Public T8 bucket: sagemaker-us-east-ads508-sp23-t8


## Verify S3 Bucket Creation

In [7]:
%%bash

aws s3 ls s3://${bucket}/

2023-03-16 17:05:02 aws-athena-query-results-657724983756-us-east-1
2023-03-02 16:56:48 sagemaker-studio-657724983756-5nh7ydsouq7
2023-03-02 17:25:41 sagemaker-studio-657724983756-7yc8bp8xk0b
2023-03-02 17:01:51 sagemaker-us-east-1-657724983756
2023-03-17 05:19:31 sagemaker-us-east-ads508-sp23-t8


In [8]:
response = None

try:
    response = s3.head_bucket(Bucket=bucket)
    print(response)
    setup_s3_bucket_passed = True
except ClientError as e:
    print(f"[ERROR] Cannot find bucket {bucket} in {response} due to {e}.")

{'ResponseMetadata': {'RequestId': 'V061TB7SR6YNC3P6', 'HostId': 'BQAAxPQqWy4KqwduMMojY5BCXMH1z9r0GJNlcgMDvr3sThvUFH/b/DdIvorSLifuY/xj/xCQ3qk=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'BQAAxPQqWy4KqwduMMojY5BCXMH1z9r0GJNlcgMDvr3sThvUFH/b/DdIvorSLifuY/xj/xCQ3qk=', 'x-amz-request-id': 'V061TB7SR6YNC3P6', 'date': 'Sat, 18 Mar 2023 21:04:30 GMT', 'x-amz-bucket-region': 'us-east-1', 'x-amz-access-point-alias': 'false', 'content-type': 'application/xml', 'server': 'AmazonS3'}, 'RetryAttempts': 0}}


In [9]:
%store setup_s3_bucket_passed

Stored 'setup_s3_bucket_passed' (bool)


## Create Athena Database

In [10]:
database_name = "ads508_t8"

Note: The databases and tables that we create in Athena use a data catalog service to store the metadata of your data. For example, schema information consisting of the column names and data type of each column in a table, together with the table name, is saved as metadata information in a data catalog. 

Athena natively supports the AWS Glue Data Catalog service. When we run `CREATE DATABASE` and `CREATE TABLE` queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the AWS Glue Data Catalog.

In [11]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = f"s3://{bucket}/athena/staging"
print(s3_staging_dir)

s3://sagemaker-us-east-ads508-sp23-t8/athena/staging


In [12]:
conn = connect(region_name=region,
               s3_staging_dir=s3_staging_dir)

In [13]:
create_db_stmnt = f"CREATE DATABASE IF NOT EXISTS {database_name}"
print(create_db_stmnt)

CREATE DATABASE IF NOT EXISTS ads508_t8


In [14]:
pd.read_sql(create_db_stmnt,
            conn)

### Verify The Database Has Been Created Succesfully

In [15]:
show_db_stmnt = "SHOW DATABASES"

df_show = pd.read_sql(show_db_stmnt,
                      conn)
df_show.head(5)

Unnamed: 0,database_name
0,ads508_t8
1,default
2,dsoaws


In [16]:
if database_name in df_show.values:
    ingest_create_athena_db_passed = True

In [17]:
%store ingest_create_athena_db_passed

Stored 'ingest_create_athena_db_passed' (bool)


## Define custom function to create tables in existing database

In [18]:
def create_athena_tbl_tsv(conn=None,
                          db=None,
                          tbl_name=None,
                          fields='',
                          s3_path=None,
                          delim=',',
                          ret='',
                          comp='',
                          skip=''):
    # Set Athena parameters

    # SQL statement to execute
    drop_tbl_stmnt = f"""DROP TABLE IF EXISTS {db}.{tbl_name}"""

    create_tbl_stmnt = f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {db}.{tbl_name}({fields})
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '{delim}' LINES TERMINATED BY '{ret}\\n' LOCATION '{s3_path}'
    TBLPROPERTIES ({comp}{skip})"""

    print(f'Create table statement:\n{create_tbl_stmnt}')

    pd.read_sql(drop_tbl_stmnt,
                conn)

    pd.read_sql(create_tbl_stmnt,
                conn)
    
    # Verify The Table Has Been Created Succesfully
    show_tbl_stmnt = f"SHOW TABLES in {db}"

    df_show = pd.read_sql(show_tbl_stmnt,
                          conn)
    display(df_show.head(5))

    if tbl_name in df_show.values:
        ingest_create_athena_table_tsv_passed = True

    print(f'\nDataframe contains records: {ingest_create_athena_table_tsv_passed}')

## Create Athena Table from Local TSV File - `2005-2010_Graduation_Outcomes_-_School_Level.tsv`

### Dataset columns

- `demographic`: ,
- `dbn`: ,
- `school_name`: ,
- `cohort`: ,
- `total_cohort`: ,
- `total_grads_n`: ,
- `total_grads_perc_cohort`: ,
- `total_regents_n`: ,
- `total_regents_perc_cohort`: ,
- `total_regents_perc_grads`: ,
- `advanced_regents_n`: ,
- `advanced_regents_perc_cohort`: ,
- `advanced_regents_perc_grads`: ,
- `regents_wo_advanced_n`: ,
- `regents_wo_advanced_perc_cohort`: ,
- `regents_wo_advanced_perc_grads`: ,
- `local_n`: ,
- `local_perc_cohort`: ,
- `local_perc_grads`: ,
- `still_enrolled_n`: ,
- `still_enrolled_perc_cohort`: ,
- `dropped_out_n`: ,
- `dropped_out_perc_cohort`: 

In [19]:
grd_tbl_name = 'grad_outcomes'
grd_field_list = """
mdemographic string,
dbn string,
school_name string,
cohort string,
total_cohort string,
total_grads_n string,
total_grads_perc_cohort string,
total_regents_n string,
total_regents_perc_cohort string,
total_regents_perc_grads string,
advanced_regents_n string,
advanced_regents_perc_cohort string,
advanced_regents_perc_grads string,
regents_wo_advanced_n string,
regents_wo_advanced_perc_cohort string,
regents_wo_advanced_perc_grads string,
local_n string,
local_perc_cohort string,
local_perc_grads string,
still_enrolled_n string,
still_enrolled_perc_cohort string,
dropped_out_n string,
dropped_out_perc_cohort string
"""
grd_s3_raw_data_path = f"s3://{bucket}/raw_data/grad_outcomes"
print(grd_s3_raw_data_path)

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=grd_tbl_name,
                      fields=grd_field_list,
                      s3_path=grd_s3_raw_data_path,
                      delim='\\t',
                      comp='',
                      skip="'skip.header.line.count'='1'")

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/grad_outcomes
Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.grad_outcomes(
mdemographic string,
dbn string,
school_name string,
cohort string,
total_cohort string,
total_grads_n string,
total_grads_perc_cohort string,
total_regents_n string,
total_regents_perc_cohort string,
total_regents_perc_grads string,
advanced_regents_n string,
advanced_regents_perc_cohort string,
advanced_regents_perc_grads string,
regents_wo_advanced_n string,
regents_wo_advanced_perc_cohort string,
regents_wo_advanced_perc_grads string,
local_n string,
local_perc_cohort string,
local_perc_grads string,
still_enrolled_n string,
still_enrolled_perc_cohort string,
dropped_out_n string,
dropped_out_perc_cohort string
)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION 's3://sagemaker-us-east-ads508-sp23-t8/raw_data/grad_outcomes'
    TBLPROPERTIES ('skip.header.line.count'='1')


Unnamed: 0,tab_name
0,grad_outcomes
1,hs_info



Dataframe contains records: True


### Run A Sample Query

In [20]:
grd_dbn_id01 = "01M448"

grd_select_dbn_stmnt = f"""SELECT * FROM {database_name}.{grd_tbl_name}
WHERE dbn = '{grd_dbn_id01}' LIMIT 20"""

print(grd_select_dbn_stmnt)

SELECT * FROM ads508_t8.grad_outcomes
WHERE dbn = '01M448' LIMIT 20


In [21]:
grd_df01_s01 = pd.read_sql(grd_select_dbn_stmnt,
                           conn)
grd_df01_s01.head(5)

Unnamed: 0,mdemographic,dbn,school_name,cohort,total_cohort,total_grads_n,total_grads_perc_cohort,total_regents_n,total_regents_perc_cohort,total_regents_perc_grads,...,regents_wo_advanced_n,regents_wo_advanced_perc_cohort,regents_wo_advanced_perc_grads,local_n,local_perc_cohort,local_perc_grads,still_enrolled_n,still_enrolled_perc_cohort,dropped_out_n,dropped_out_perc_cohort
0,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2001,64,46,71.9,32,50.0,69.6,...,25,39.1,54.3,14,21.9,30.4,10,15.6,6,9.4
1,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2002,52,33,63.5,19,36.5,57.6,...,11,21.2,33.3,14,26.9,42.4,16,30.8,1,1.9
2,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2003,87,67,77.0,39,44.8,58.2,...,28,32.2,41.8,28,32.2,41.8,9,10.3,11,12.6
3,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2004,112,75,67.0,36,32.1,48.0,...,30,26.8,40.0,39,34.8,52.0,33,29.5,4,3.6
4,Total Cohort,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,2005,121,64,52.9,35,28.9,54.7,...,31,25.6,48.4,29,24.0,45.3,41,33.9,11,9.1


In [22]:
if not grd_df01_s01.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


## Create Athena Table from Local TSV File - `2014_-_2015_DOE_High_School_Directory.tsv`

### Dataset columns

- `dbn`: ,
- `school_name`: ,
- `borough`: ,
- `building_code`: ,
- `phone_number`: ,
- `fax_number`: ,
- `grade_span_min`: ,
- `grade_span_max`: ,
- `expgrade_span_min`: ,
- `expgrade_span_max`: ,
- `bus`: ,
- `subway`: ,
- `primary_address_line_1`: ,
- `city`: ,
- `state_code`: ,
- `postcode`: ,
- `website`: ,
- `total_students`: ,
- `campus_name`: ,
- `school_type`: ,
- `overview_paragraph`: ,
- `program_highlights`: ,
- `language_classes`: ,
- `advancedplacement_courses`: ,
- `online_ap_courses`: ,
- `online_language_courses`: ,
- `extracurricular_activities`: ,
- `psal_sports_boys`: ,
- `psal_sports_girls`: ,
- `psal_sports_coed`: ,
- `school_sports`: ,
- `partner_cbo`: ,
- `partner_hospital`: ,
- `partner_highered`: ,
- `partner_cultural`: ,
- `partner_nonprofit`: ,
- `partner_corporate`: ,
- `partner_financial`: ,
- `partner_other`: ,
- `addtl_info1`: ,
- `addtl_info2`: ,
- `start_time`: ,
- `end_time`: ,
- `se_services`: ,
- `ell_programs`: ,
- `school_accessibility_description`: ,
- `number_programs`: ,
- `priority01`: ,
- `priority02`: ,
- `priority03`: ,
- `priority04`: ,
- `priority05`: ,
- `priority06`: ,
- `priority07`: ,
- `priority08`: ,
- `priority09`: ,
- `priority10`: ,
- `location_1`: ,
- `community_board`: ,
- `council_district`: ,
- `census_tract`: ,
- `bin`: ,
- `bbl`: ,
- `nta`: 

In [23]:
hsi_tbl_name = 'hs_info'
hsi_field_list = """
dbn string,
school_name string,
borough string,
building_code string,
phone_number string,
fax_number string,
grade_span_min string,
grade_span_max string,
expgrade_span_min string,
expgrade_span_max string,
bus string,
subway string,
primary_address_line_1 string,
city string,
state_code string,
postcode string,
website string,
total_students string,
campus_name string,
school_type string,
overview_paragraph string,
program_highlights string,
language_classes string,
advancedplacement_courses string,
online_ap_courses string,
online_language_courses string,
extracurricular_activities string,
psal_sports_boys string,
psal_sports_girls string,
psal_sports_coed string,
school_sports string,
partner_cbo string,
partner_hospital string,
partner_highered string,
partner_cultural string,
partner_nonprofit string,
partner_corporate string,
partner_financial string,
partner_other string,
addtl_info1 string,
addtl_info2 string,
start_time string,
end_time string,
se_services string,
ell_programs string,
school_accessibility_description string,
number_programs string,
priority01 string,
priority02 string,
priority03 string,
priority04 string,
priority05 string,
priority06 string,
priority07 string,
priority08 string,
priority09 string,
priority10 string,
location_1 string,
community_board string,
council_district string,
census_tract string,
bin string,
bbl string,
nta string
"""
hsi_s3_raw_data_path = f"s3://{bucket}/raw_data/hs_dir"
print(hsi_s3_raw_data_path)

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=hsi_tbl_name,
                      fields=hsi_field_list,
                      s3_path=hsi_s3_raw_data_path,
                      delim='\\t',
                      comp='',
                      skip="'skip.header.line.count'='1'")

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/hs_dir
Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.hs_info(
dbn string,
school_name string,
borough string,
building_code string,
phone_number string,
fax_number string,
grade_span_min string,
grade_span_max string,
expgrade_span_min string,
expgrade_span_max string,
bus string,
subway string,
primary_address_line_1 string,
city string,
state_code string,
postcode string,
website string,
total_students string,
campus_name string,
school_type string,
overview_paragraph string,
program_highlights string,
language_classes string,
advancedplacement_courses string,
online_ap_courses string,
online_language_courses string,
extracurricular_activities string,
psal_sports_boys string,
psal_sports_girls string,
psal_sports_coed string,
school_sports string,
partner_cbo string,
partner_hospital string,
partner_highered string,
partner_cultural string,
partner_nonprofit string,
partner_corporate string,
partner_financial

Unnamed: 0,tab_name
0,grad_outcomes
1,hs_info



Dataframe contains records: True


### Run A Sample Query

In [24]:
hsi_dbn_id01 = "01M448"

hsi_select_dbn_stmnt = f"""SELECT * FROM {database_name}.{hsi_tbl_name}
WHERE dbn = '{hsi_dbn_id01}' LIMIT 20"""

print(hsi_select_dbn_stmnt)

SELECT * FROM ads508_t8.hs_info
WHERE dbn = '01M448' LIMIT 20


In [25]:
hsi_df01_s01 = pd.read_sql(hsi_select_dbn_stmnt,
                           conn)
hsi_df01_s01.head(5)

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,priority08,priority09,priority10,location_1,community_board,council_district,census_tract,bin,bbl,nta
0,01M448,University Neighborhood High School,Manhattan,M446,212-962-4341,212-267-5611,9,12,,,...,,,,"""200 Monroe Street",,,,,,


In [26]:
if not hsi_df01_s01.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


## Create Athena Table from Local CSV File - `nyc_census_tracts.csv`

### Dataset columns

- `censustract`: ,
- `county`: ,
- `borough`: ,
- `totalpop`: ,
- `men`: ,
- `women`: ,
- `hispanic`: ,
- `white`: ,
- `black`: ,
- `native`: ,
- `asian`: ,
- `citizen`: ,
- `income`: ,
- `incomeerr`: ,
- `incomepercap`: ,
- `incomepercaperr`: ,
- `poverty`: ,
- `childpoverty`: ,
- `professional`: ,
- `service`: ,
- `office`: ,
- `construction`: ,
- `production`: ,
- `drive`: ,
- `carpool`: ,
- `transit`: ,
- `walk`: ,
- `othertransp`: ,
- `workathome`: ,
- `meancommute`: ,
- `employed`: ,
- `privatework`: ,
- `publicwork`: ,
- `selfemployed`: ,
- `familywork`: ,
- `unemployment`: 

In [28]:
cen_tbl_name = 'census'
cen_field_list = """
censustract string,
county string,
borough string,
totalpop string,
men string,
women string,
hispanic string,
white string,
black string,
native string,
asian string,
citizen string,
income string,
incomeerr string,
incomepercap string,
incomepercaperr string,
poverty string,
childpoverty string,
professional string,
service string,
office string,
construction string,
production string,
drive string,
carpool string,
transit string,
walk string,
othertransp string,
workathome string,
meancommute string,
employed string,
privatework string,
publicwork string,
selfemployed string,
familywork string,
unemployment string
"""
cen_s3_raw_data_path = f"s3://{bucket}/raw_data/census"
print(cen_s3_raw_data_path)

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=cen_tbl_name,
                      fields=cen_field_list,
                      s3_path=cen_s3_raw_data_path,
                      comp='',
                      skip="'skip.header.line.count'='1'")

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/census
Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.crime(
censustract string,
county string,
borough string,
totalpop string,
men string,
women string,
hispanic string,
white string,
black string,
native string,
asian string,
citizen string,
income string,
incomeerr string,
incomepercap string,
incomepercaperr string,
poverty string,
childpoverty string,
professional string,
service string,
office string,
construction string,
production string,
drive string,
carpool string,
transit string,
walk string,
othertransp string,
workathome string,
meancommute string,
employed string,
privatework string,
publicwork string,
selfemployed string,
familywork string,
unemployment string
)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION 's3://sagemaker-us-east-ads508-sp23-t8/raw_data/census'
    TBLPROPERTIES ('skip.header.line.count'='1')


Unnamed: 0,tab_name
0,crime
1,grad_outcomes
2,hs_info



Dataframe contains records: True


### Run A Sample Query

In [29]:
cen_bourough_id01 = "Bronx"

cen_select_dbn_stmnt = f"""SELECT * FROM {database_name}.{cen_tbl_name}
WHERE borough = '{cen_bourough_id01}' LIMIT 20"""

print(cen_select_dbn_stmnt)

SELECT * FROM ads508_t8.crime
WHERE borough = 'Bronx' LIMIT 20


In [30]:
cen_df01_s01 = pd.read_sql(cen_select_dbn_stmnt,
                           conn)
cen_df01_s01.head(5)

Unnamed: 0,censustract,county,borough,totalpop,men,women,hispanic,white,black,native,...,walk,othertransp,workathome,meancommute,employed,privatework,publicwork,selfemployed,familywork,unemployment
0,36005000100,Bronx,Bronx,7703,7133,570,29.9,6.1,60.9,0.2,...,,,,,0,,,,,
1,36005000200,Bronx,Bronx,5403,2659,2744,75.8,2.3,16.0,0.0,...,2.9,0.0,0.0,43.0,2308,80.8,16.2,2.9,0.0,7.7
2,36005000400,Bronx,Bronx,5915,2896,3019,62.7,3.6,30.7,0.0,...,1.4,0.5,2.1,45.0,2675,71.7,25.3,2.5,0.6,9.5
3,36005001600,Bronx,Bronx,5879,2558,3321,65.1,1.6,32.4,0.0,...,8.6,1.6,1.7,38.8,2120,75.0,21.3,3.8,0.0,8.7
4,36005001900,Bronx,Bronx,2591,1206,1385,55.4,9.0,29.0,0.0,...,3.0,2.4,6.2,45.4,1083,76.8,15.5,7.7,0.0,19.2


In [31]:
if not cen_df01_s01.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


## Create Athena Table from Local TSV File - `NYPD_Complaint_Data_Historic (1).csv`

### Dataset columns

- `cmplnt_num`: ,
- `cmplnt_fr_dt`: ,
- `cmplnt_fr_tm`: ,
- `cmplnt_to_dt`: ,
- `cmplnt_to_tm`: ,
- `addr_pct_cd`: ,
- `rpt_dt`: ,
- `ky_cd`: ,
- `ofns_desc`: ,
- `pd_cd`: ,
- `pd_desc`: ,
- `crm_atpt_cptd_cd`: ,
- `law_cat_cd`: ,
- `boro_nm`: ,
- `loc_of_occur_desc`: ,
- `prem_typ_desc`: ,
- `juris_desc`: ,
- `jurisdiction_code`: ,
- `parks_nm`: ,
- `hadevelopt`: ,
- `housing_psa`: ,
- `x_coord_cd`: ,
- `y_coord_cd`: ,
- `susp_age_group`: ,
- `susp_race`: ,
- `susp_sex`: ,
- `transit_district`: ,
- `latitude`: ,
- `longitude`: ,
- `lat_lon`: ,
- `patrol_boro`: ,
- `station_name`: ,
- `vic_age_group`: ,
- `vic_race`: ,
- `vic_sex`: 

In [23]:
cri_tbl_name = 'hs_info'
cri_field_list = """
cmplnt_num string,
cmplnt_fr_dt string,
cmplnt_fr_tm string,
cmplnt_to_dt string,
cmplnt_to_tm string,
addr_pct_cd string,
rpt_dt string,
ky_cd string,
ofns_desc string,
pd_cd string,
pd_desc string,
crm_atpt_cptd_cd string,
law_cat_cd string,
boro_nm string,
loc_of_occur_desc string,
prem_typ_desc string,
juris_desc string,
jurisdiction_code string,
parks_nm string,
hadevelopt string,
housing_psa string,
x_coord_cd string,
y_coord_cd string,
susp_age_group string,
susp_race string,
susp_sex string,
transit_district string,
latitude string,
longitude string,
lat_lon string,
patrol_boro string,
station_name string,
vic_age_group string,
vic_race string,
vic_sex string
"""
cri_s3_raw_data_path = f"s3://{bucket}/raw_data/crime"
print(cri_s3_raw_data_path)

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=cri_tbl_name,
                      fields=cri_field_list,
                      s3_path=cri_s3_raw_data_path,
                      delim='\\t',
                      comp="'compressionType'='gzip'",
                      skip="'skip.header.line.count'='1'")

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/hs_dir
Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.hs_info(
dbn string,
school_name string,
borough string,
building_code string,
phone_number string,
fax_number string,
grade_span_min string,
grade_span_max string,
expgrade_span_min string,
expgrade_span_max string,
bus string,
subway string,
primary_address_line_1 string,
city string,
state_code string,
postcode string,
website string,
total_students string,
campus_name string,
school_type string,
overview_paragraph string,
program_highlights string,
language_classes string,
advancedplacement_courses string,
online_ap_courses string,
online_language_courses string,
extracurricular_activities string,
psal_sports_boys string,
psal_sports_girls string,
psal_sports_coed string,
school_sports string,
partner_cbo string,
partner_hospital string,
partner_highered string,
partner_cultural string,
partner_nonprofit string,
partner_corporate string,
partner_financial

Unnamed: 0,tab_name
0,grad_outcomes
1,hs_info



Dataframe contains records: True


### Run A Sample Query

In [24]:
cri_law_cat_cd01 = "MISDEMEANOR"

cri_select_dbn_stmnt = f"""SELECT * FROM {database_name}.{cri_tbl_name}
WHERE law_cat_cd = '{cri_law_cat_cd01}' LIMIT 20"""

print(cri_select_dbn_stmnt)

SELECT * FROM ads508_t8.hs_info
WHERE dbn = '01M448' LIMIT 20


In [25]:
cri_df01_s01 = pd.read_sql(cri_select_dbn_stmnt,
                           conn)
cri_df01_s01.head(5)

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,priority08,priority09,priority10,location_1,community_board,council_district,census_tract,bin,bbl,nta
0,01M448,University Neighborhood High School,Manhattan,M446,212-962-4341,212-267-5611,9,12,,,...,,,,"""200 Monroe Street",,,,,,


In [26]:
if not cri_df01_s01.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


## Create Athena Table from Local TSV File - `Evictions.tsv`

### Dataset columns

- `court_index_number`: ,
- `docket_number`: ,
- `eviction_address`: ,
- `eviction_apartment_number`: ,
- `executed_date`: ,
- `marshal_first_name`: ,
- `marshal_last_name`: ,
- `residential_or_commercial`: ,
- `borough`: ,
- `eviction_postcode`: ,
- `ejectment`: ,
- `eviction_or_legal_possession`: ,
- `latitude`: ,
- `longitude`: ,
- `community_board`: ,
- `council_district`: ,
- `census_tract`: ,
- `bin`: ,
- `bbl`: ,
- `nta`: 

In [23]:
evi_tbl_name = 'evictions'
evi_field_list = """
court_index_number string,
docket_number string,
eviction_address string,
eviction_apartment_number string,
executed_date string,
marshal_first_name string,
marshal_last_name string,
residential_or_commercial string,
borough string,
eviction_postcode string,
ejectment string,
eviction_or_legal_possession string,
latitude string,
longitude string,
community_board string,
council_district string,
census_tract string,
bin string,
bbl string,
nta string
"""
evi_s3_raw_data_path = f"s3://{bucket}/raw_data/evictions"
print(evi_s3_raw_data_path)

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=evi_tbl_name,
                      fields=evi_field_list,
                      s3_path=evi_s3_raw_data_path,
                      delim='\\t',
                      comp='',
                      skip="'skip.header.line.count'='1'")

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/hs_dir
Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.hs_info(
dbn string,
school_name string,
borough string,
building_code string,
phone_number string,
fax_number string,
grade_span_min string,
grade_span_max string,
expgrade_span_min string,
expgrade_span_max string,
bus string,
subway string,
primary_address_line_1 string,
city string,
state_code string,
postcode string,
website string,
total_students string,
campus_name string,
school_type string,
overview_paragraph string,
program_highlights string,
language_classes string,
advancedplacement_courses string,
online_ap_courses string,
online_language_courses string,
extracurricular_activities string,
psal_sports_boys string,
psal_sports_girls string,
psal_sports_coed string,
school_sports string,
partner_cbo string,
partner_hospital string,
partner_highered string,
partner_cultural string,
partner_nonprofit string,
partner_corporate string,
partner_financial

Unnamed: 0,tab_name
0,grad_outcomes
1,hs_info



Dataframe contains records: True


### Run A Sample Query

In [24]:
evi_borough01 = "BRONX"

evi_select_dbn_stmnt = f"""SELECT * FROM {database_name}.{evi_tbl_name}
WHERE borough = '{evi_borough01}' LIMIT 20"""

print(evi_select_dbn_stmnt)

SELECT * FROM ads508_t8.hs_info
WHERE dbn = '01M448' LIMIT 20


In [25]:
evi_df01_s01 = pd.read_sql(evi_select_dbn_stmnt,
                           conn)
evi_df01_s01.head(5)

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,priority08,priority09,priority10,location_1,community_board,council_district,census_tract,bin,bbl,nta
0,01M448,University Neighborhood High School,Manhattan,M446,212-962-4341,212-267-5611,9,12,,,...,,,,"""200 Monroe Street",,,,,,


In [26]:
if not evi_df01_s01.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


## Create Athena Table from Local TSV File - `NYC _Jobs.tsv`

### Dataset columns

- `job_id`: ,
- `agency`: ,
- `posting_type`: ,
- `num_of_positions`: ,
- `business_title`: ,
- `civil_service_title`: ,
- `title_classification`: ,
- `title_code_no`: ,
- `level`: ,
- `job_category`: ,
- `fulltime_or_parttime_indicator`: ,
- `career_level`: ,
- `salary_range_from`: ,
- `salary_range_to`: ,
- `salary_frequency`: ,
- `work_location`: ,
- `division_or_work_unit`: ,
- `job_description`: ,
- `minimum_qual_requirements`: ,
- `preferred_skills`: ,
- `additional_information`: ,
- `to_apply`: ,
- `hours_or_shift`: ,
- `work_location_1`: ,
- `recruitment_contact`: ,
- `residency_requirement`: ,
- `posting_date`: ,
- `post_until`: ,
- `posting_updated`: ,
- `process_date`: 

In [23]:
job_tbl_name = 'jobs'
job_field_list = """
job_id string,
agency string,
posting_type string,
num_of_positions string,
business_title string,
civil_service_title string,
title_classification string,
title_code_no string,
level string,
job_category string,
fulltime_or_parttime_indicator string,
career_level string,
salary_range_from string,
salary_range_to string,
salary_frequency string,
work_location string,
division_or_work_unit string,
job_description string,
minimum_qual_requirements string,
preferred_skills string,
additional_information string,
to_apply string,
hours_or_shift string,
work_location_1 string,
recruitment_contact string,
residency_requirement string,
posting_date string,
post_until string,
posting_updated string,
process_date string
"""
job_s3_raw_data_path = f"s3://{bucket}/raw_data/jobs"
print(job_s3_raw_data_path)

create_athena_tbl_tsv(conn=conn,
                      db=database_name,
                      tbl_name=job_tbl_name,
                      fields=job_field_list,
                      s3_path=job_s3_raw_data_path,
                      delim='\\t',
                      comp='',
                      skip="'skip.header.line.count'='1'")

s3://sagemaker-us-east-ads508-sp23-t8/raw_data/hs_dir
Create table statement:

    CREATE EXTERNAL TABLE IF NOT EXISTS ads508_t8.hs_info(
dbn string,
school_name string,
borough string,
building_code string,
phone_number string,
fax_number string,
grade_span_min string,
grade_span_max string,
expgrade_span_min string,
expgrade_span_max string,
bus string,
subway string,
primary_address_line_1 string,
city string,
state_code string,
postcode string,
website string,
total_students string,
campus_name string,
school_type string,
overview_paragraph string,
program_highlights string,
language_classes string,
advancedplacement_courses string,
online_ap_courses string,
online_language_courses string,
extracurricular_activities string,
psal_sports_boys string,
psal_sports_girls string,
psal_sports_coed string,
school_sports string,
partner_cbo string,
partner_hospital string,
partner_highered string,
partner_cultural string,
partner_nonprofit string,
partner_corporate string,
partner_financial

Unnamed: 0,tab_name
0,grad_outcomes
1,hs_info



Dataframe contains records: True


### Run A Sample Query

In [24]:
job_agency01 = "HOUSING"

job_select_dbn_stmnt = f"""SELECT * FROM {database_name}.{job_tbl_name}
WHERE agency LIKE '%{job_agency01}%' LIMIT 20"""

print(job_select_dbn_stmnt)

SELECT * FROM ads508_t8.hs_info
WHERE dbn = '01M448' LIMIT 20


In [25]:
job_df01_s01 = pd.read_sql(job_select_dbn_stmnt,
                           conn)
job_df01_s01.head(5)

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,priority08,priority09,priority10,location_1,community_board,council_district,census_tract,bin,bbl,nta
0,01M448,University Neighborhood High School,Manhattan,M446,212-962-4341,212-267-5611,9,12,,,...,,,,"""200 Monroe Street",,,,,,


In [26]:
if not job_df01_s01.empty:
    print("[OK]")
else:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")

[OK]


## Review the New Athena Table in the Glue Catalog

In [32]:
display(
    HTML(
        f'<b>Review <a target="top" href="https://console.aws.amazon.com/glue/home?region={region}#">AWS Glue Catalog</a></b>'
    )
)

## Store Variables for the Next Notebooks

In [33]:
%store

Stored variables and their in-db values:
ingest_create_athena_db_passed                    -> True
ingest_create_athena_table_tsv_passed             -> True
s3_private_path_tsv                               -> 's3://sagemaker-us-east-1-657724983756/amazon-revi
s3_public_path_tsv                                -> 's3://amazon-reviews-pds/tsv'
setup_dependencies_passed                         -> True
setup_iam_roles_passed                            -> True
setup_instance_check_passed                       -> True
setup_s3_bucket_passed                            -> True


## Release Resources

In [34]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [35]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>