<a href="https://colab.research.google.com/github/davidandw190/pytorch-deep-learning-workspace/blob/main/ml-ops/dc-fares/01_data_exploratation_and_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ML-Ops | DC Taxi Fares**

## 01 - Data Exploration & Preparation

### The `BUCKET_ID` file

This notebook assumes that fact that you have a backup copy of the `BUCKET_ID` file created in the prior notebook before proceeding. The contents of the `BUCKET_ID` file are reused later in this notebook and in the other notebooks.

In [None]:
import os
from pathlib import Path
assert Path('BUCKET_ID').exists(), "Place the BUCKET_ID file in the current directory before proceeding"

BUCKET_ID = Path('BUCKET_ID').read_text().strip()
os.environ['BUCKET_ID'] = BUCKET_ID
os.environ['BUCKET_ID']

### Conifigure AWS Credentials

Modify the contents of the next cell to specify your AWS credentials as strings.

If you see the following exception:

`TypeError: str expected, not NoneType`

It means that you did not specify the credentials correctly.

In [None]:
import os
# *** REPLACE None in the next 2 lines with your AWS key values ***
os.environ['AWS_ACCESS_KEY_ID'] = None
os.environ['AWS_SECRET_ACCESS_KEY'] = None

Run the next cell to validate your credentials.


In [None]:
%%bash
aws sts get-caller-identity

### Specify the region

Replace the None in the next cell with your AWS region name, for example `eu-north-1`.

In [None]:
# *** REPLACE None in the next line with your AWS region ***
os.environ['AWS_DEFAULT_REGION'] = None

In [None]:
%%bash
echo $AWS_DEFAULT_REGION

### Downloading a tiny sample

Download a tiny sample of the dataset from https://gist.github.com/osipov/1fc0265f8f829d9d9eee8393657423a9 to a `trips_sample.csv` file which you are going to use to learn about using the Athena interface.

In [None]:
%%bash
wget -q https://gist.githubusercontent.com/osipov/1fc0265f8f829d9d9eee8393657423a9/raw/9957c1f09cdfa64f8b8d89cfec532a0e150d5178/trips_sample.csv
ls -ltr trips_sample.csv
cat trips_sample.csv

### Uploading `trips_sample.csv` to S3 bucket

In [None]:
%%bash
aws s3 cp trips_sample.csv s3://dc-taxi-$BUCKET_ID-$AWS_DEFAULT_REGION/samples/trips_sample.csv
aws s3 ls s3://dc-taxi-$BUCKET_ID-$AWS_DEFAULT_REGION/samples/trips_sample.csv

### Creating an Athena workgroup

In [None]:
%%bash
aws athena delete-work-group --work-group dc_taxi_athena_workgroup --recursive-delete-option 2> /dev/null
aws athena create-work-group --name dc_taxi_athena_workgroup \
--configuration "ResultConfiguration={OutputLocation=s3://dc-taxi-$BUCKET_ID-$AWS_DEFAULT_REGION/athena},EnforceWorkGroupConfiguration=false,PublishCloudWatchMetricsEnabled=false"

### Querying Athena and Reporting on Query Status

In [None]:
%%bash
SQL="
CREATE EXTERNAL TABLE IF NOT EXISTS dc_taxi_db.dc_taxi_csv_sample_strings(
        fareamount STRING,
        origin_block_latitude STRING,
        origin_block_longitude STRING,
        destination_block_latitude STRING,
        destination_block_longitude STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://dc-taxi-$BUCKET_ID-$AWS_DEFAULT_REGION/samples/'
TBLPROPERTIES ('skip.header.line.count'='1');"

ATHENA_QUERY_ID=$(aws athena start-query-execution \
--work-group dc_taxi_athena_workgroup \
--query 'QueryExecutionId' \
--output text \
--query-string "$SQL")

echo $SQL

echo $ATHENA_QUERY_ID
until aws athena get-query-execution \
--query 'QueryExecution.Status.State' \
--output text \
--query-execution-id $ATHENA_QUERY_ID | grep -v "RUNNING";
do
printf '.'
sleep 1;
done

### Downloading and Previewing a Utility Script to Query Athena

The script is downloaded as `utils.sh` and is loaded in the upcoming cells using `source utils.sh` command.

In [None]:
%%bash
wget -q https://raw.githubusercontent.com/osipov/smlbook/master/utils.sh
ls -l utils.sh

### Outputing Athena Query to a Text Table


In [None]:
%%bash
source utils.sh
SQL="
SELECT

origin_block_latitude || ' , ' || origin_block_longitude
    AS origin,

destination_block_latitude || '  , ' || destination_block_longitude
    AS destination

FROM
    dc_taxi_db.dc_taxi_csv_sample_strings
"
athena_query_to_table "$SQL" "ResultSet.Rows[*].[Data[0].VarCharValue,Data[1].VarCharValue]"

### Outputing Athena Query to JSON for a Pandas DataFrame



In [None]:
%%bash
source utils.sh ; athena_query_to_pandas """
SELECT

origin_block_latitude || ' , ' || origin_block_longitude
    AS origin,

destination_block_latitude || '  , ' || destination_block_longitude
    AS destination

FROM
    dc_taxi_db.dc_taxi_csv_sample_strings
"""

### Creating a Utility Function to Read AWS CLI JSON as Pandas

*Note that the `utils.sh` script saves the output from Athena to `/tmp/awscli.json`

In [None]:
import pandas as pd
def awscli_to_df():
  json_df = pd.read_json('/tmp/awscli.json')
  df = pd.DataFrame(json_df[0].tolist(), index = json_df.index, columns = json_df[0].tolist()[0]).drop(0, axis = 0)
  return df

In [None]:
awscli_to_df()

### Applying Athena schema-on-read with columns as DOUBLE

In [None]:
%%bash
source utils.sh ; athena_query "
CREATE EXTERNAL TABLE IF NOT EXISTS dc_taxi_db.dc_taxi_csv_sample_double(
        fareamount DOUBLE,
        origin_block_latitude DOUBLE,
        origin_block_longitude DOUBLE,
        destination_block_latitude DOUBLE,
        destination_block_longitude DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://dc-taxi-
AWS_DEFAULT_REGION/samples/'
TBLPROPERTIES ('skip.header.line.count'='1');

In [None]:
%%bash
source utils.sh ; athena_query_to_pandas "
SELECT ROUND(MAX(fareamount) - MIN(fareamount), 2)
FROM dc_taxi_db.dc_taxi_csv_sample_double
"

In [None]:
awscli_to_df()