# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [7]:
%iam_role arn:aws:iam::212430227630:role/LabRole
%region us-east-1
%number_of_workers 2

%idle_timeout 30
%glue_version 4.0
%worker_type G.1X

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.5 
Current iam_role is arn:aws:iam::212430227630:role/LabRole
iam_role has been set to arn:aws:iam::212430227630:role/LabRole.
Previous region: us-east-1
Setting new region to: us-east-1
Region is set to: us-east-1
Previous number of workers: None
Setting new number of workers to: 2
Current idle_timeout is None minutes.
idle_timeout has been set to 30 minutes.
Setting Glue version to: 4.0
Previous worker type: None
Setting new worker type to: G.1X


####  Run this cell to set up and start your interactive session.


In [10]:
%load_ext autoreload
%autoreload 2

In [1]:
import sys
import boto3

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 2
Idle Timeout: 30
Session ID: 3cf2b110-5d25-4963-92e0-3b5b2aaa4d3c
Applying the following default arguments:
--glue_kernel_version 1.0.5
--enable-glue-datacatalog true
Waiting for session 3cf2b110-5d25-4963-92e0-3b5b2aaa4d3c to get into ready status...
Session 3cf2b110-5d25-4963-92e0-3b5b2aaa4d3c has been created.



## Save Raw data to Silver - FOREX


In [2]:
from datetime import datetime, timedelta, timezone

import pyspark.sql.functions as F




### Set AWS Storage parameters


In [3]:
BUCKET_NAME = "cryptoengineer"
BRONZE_PREFIX = "datalake/bronze/commodities"




### Load job parameters

In [6]:
glue_client = boto3.client("glue")

if '--WORKFLOW_NAME' in sys.argv and '--WORKFLOW_RUN_ID' in sys.argv:
    print("Running in Glue Workflow")
    
    glue_args = getResolvedOptions(
        sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID']
    )
    
    print("Reading the workflow parameters")
    workflow_args = glue_client.get_workflow_run_properties(
        Name=glue_args['WORKFLOW_NAME'], RunId=glue_args['WORKFLOW_RUN_ID']
    )["RunProperties"]

    
    time_frame = int(workflow_args['time_frame'])
    symbols = workflow_args['symbols']

else:
    try:
        print("Running as Job")
        args = getResolvedOptions(sys.argv,
                                  ['JOB_NAME',
                                   'time_frame',
                                   'symbols'
                                   ])

        time_frame = int(args['time_frame'])
        symbols = args['symbols']
    except:
        time_frame = 24
        symbols = "BZUSD"


Running as Job


In [7]:
print("Time Frame: ", time_frame)
print("Symbols: ", symbols)

Time Frame:  120
Symbols:  BZUSD


#### Set the start and end dates for the data you want to load

In [8]:
# Start date
start_date = (datetime.utcnow() - timedelta(hours=time_frame)).strftime("%Y-%m-%d")
end_date = datetime.utcnow().strftime("%Y-%m-%d")

print("Start date; ",start_date," End date: ",end_date)

Start date;  2024-08-28  End date:  2024-09-02


## Load the Bronze/Raw data for the time frame and symbol

In [9]:
path=f"s3://{BUCKET_NAME}/{BRONZE_PREFIX}"
print("Path:",path)

Path: s3://cryptoengineer/datalake/raw/commodities


In [10]:
df= (
    spark
    .read
    .parquet(path)
    .filter(F.col("load_date").between(start_date, end_date))
)




In [21]:
print("Records: ", df.count())

Records:  87610


In [12]:
df.show(5)

+-------------------+-----+-----+-----+-----+------+----+-----+---+--------+----------+-------------+------+---------+------+--------------------+-----+----------+
|           datetime| open|  low| high|close|volume|year|month|day|    time|      date|base_currency|source|frequency|symbol|          audit_time| type| load_date|
+-------------------+-----+-----+-----+-----+------+----+-----+---+--------+----------+-------------+------+---------+------+--------------------+-----+----------+
|2023-09-29 14:00:00|95.34|95.32|95.34|95.33|     7|2023|   09| 29|14:00:00|2023-09-29|          USD|   FMP|    15min| BZUSD|2024-08-30 09:59:...|FOREX|2024-08-30|
|2023-09-29 13:45:00|95.32|95.32|95.32|95.32|     2|2023|   09| 29|13:45:00|2023-09-29|          USD|   FMP|    15min| BZUSD|2024-08-30 09:59:...|FOREX|2024-08-30|
|2023-09-29 12:45:00|95.33|95.33|95.33|95.33|     1|2023|   09| 29|12:45:00|2023-09-29|          USD|   FMP|    15min| BZUSD|2024-08-30 09:59:...|FOREX|2024-08-30|
|2023-09-29 12:1

### Temprorary: Set type to Commodities

In [20]:
df = df.withColumn('type', F.lit('COMMODITIES'))




In [None]:
"""
df = df.withColumn('type', F.lit('COMMODITIES'))
(
    df
    .groupBy('source','type','symbol','year','frequency')
    .count().alias('count')
    .filter(F.col('frequency') == "15min")
    .show(50)
)
"""

## Remove and filter values

In [None]:
"""
(
    df
    .groupBy('symbol','year','datetime','source','frequency','type')
    .count().alias('count')
    .filter(F.col('count') > 1)
    .show()
)
"""

In [24]:
df = (
    df
    .dropDuplicates(['symbol','year','datetime','source','frequency','type'])
)




In [25]:
print("After dropDuplicates: ", df.count())

After dropDuplicates:  87036


## Append the batch data to RAW table

Set the destination raw table

In [27]:
BUCKET_NAME = "cryptoengineer"
SILVER_PREFIX = "datalake/silver/commodities"




In [28]:
path=f"s3://{BUCKET_NAME}/{SILVER_PREFIX}"
print("Path:",path)

Path: s3://cryptoengineer/datalake/silver/commodities


In [29]:
(
    df
    .repartition("year")
    .write
    .format("parquet")
    .mode("append")
    .partitionBy(['symbol','year'])
    .save(path)
)


