# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [7]:
%iam_role arn:aws:iam::212430227630:role/LabRole
%region us-east-1
%number_of_workers 2

%idle_timeout 30
%glue_version 4.0
%worker_type G.1X

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.5 
Current iam_role is arn:aws:iam::212430227630:role/LabRole
iam_role has been set to arn:aws:iam::212430227630:role/LabRole.
Previous region: us-east-1
Setting new region to: us-east-1
Region is set to: us-east-1
Previous number of workers: None
Setting new number of workers to: 2
Current idle_timeout is None minutes.
idle_timeout has been set to 30 minutes.
Setting Glue version to: 4.0
Previous worker type: None
Setting new worker type to: G.1X


####  Run this cell to set up and start your interactive session.


In [10]:
%load_ext autoreload
%autoreload 2

In [1]:
import sys
import boto3

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 2
Idle Timeout: 30
Session ID: 1d6ff7e8-85b3-4a6e-8106-76cac3db6d51
Applying the following default arguments:
--glue_kernel_version 1.0.5
--enable-glue-datacatalog true
Waiting for session 1d6ff7e8-85b3-4a6e-8106-76cac3db6d51 to get into ready status...
Session 1d6ff7e8-85b3-4a6e-8106-76cac3db6d51 has been created.



## Save Raw data to Silver - FOREX


In [2]:
from datetime import datetime, timedelta, timezone

import pyspark.sql.functions as F




### Set AWS Storage parameters


In [3]:
BUCKET_NAME = "cryptoengineer"
PREFIX_BRONZE = "datalake/bronze/forex"




### Load job parameters

In [7]:
glue_client = boto3.client("glue")

if '--WORKFLOW_NAME' in sys.argv and '--WORKFLOW_RUN_ID' in sys.argv:
    print("Running in Glue Workflow")
    
    glue_args = getResolvedOptions(
        sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID']
    )
    
    print("Reading the workflow parameters")
    workflow_args = glue_client.get_workflow_run_properties(
        Name=glue_args['WORKFLOW_NAME'], RunId=glue_args['WORKFLOW_RUN_ID']
    )["RunProperties"]

    
    time_frame = int(workflow_args['time_frame'])
    symbols = workflow_args['symbols']

else:
    try:
        print("Running as Job")
        args = getResolvedOptions(sys.argv,
                                  ['JOB_NAME',
                                   'time_frame',
                                   'symbols'
                                   ])

        time_frame = int(args['time_frame'])
        symbols = args['symbols']
    except:
        time_frame = 24
        symbols = "USDEUR"


Running as Job


In [8]:
print("Time Frame: ", time_frame)
print("Symbols: ", symbols)

Time Frame:  120
Symbols:  USDEUR


#### Set the start and end dates for the data you want to load

In [9]:
# Start date
start_date = (datetime.utcnow() - timedelta(hours=time_frame)).strftime("%Y-%m-%d")
end_date = datetime.utcnow().strftime("%Y-%m-%d")

print("Start date; ",start_date," End date: ",end_date)

Start date;  2024-08-28  End date:  2024-09-02


## Load the Bronze/Raw data for the time frame and symbol

In [10]:
path=f"s3://{BUCKET_NAME}/{PREFIX_BRONZE}"
print("Path:",path)

Path: s3://cryptoengineer/datalake/raw/forex


In [17]:
df= (
    spark
    .read
    .parquet(path)
    .filter(F.col("load_date").between(start_date, end_date))
)




In [18]:
print("Records: ", df.count())

Records:  742890


In [13]:
df.show(5)

+-------------------+-------+-------+-------+-------+------+----+-----+---+--------+----------+-------------+------+---------+------+--------------------+-----+----------+
|           datetime|   open|    low|   high|  close|volume|year|month|day|    time|      date|base_currency|source|frequency|symbol|          audit_time| type| load_date|
+-------------------+-------+-------+-------+-------+------+----+-----+---+--------+----------+-------------+------+---------+------+--------------------+-----+----------+
|2024-09-01 19:45:00|0.90537| 0.9051|0.90543| 0.9051|   301|2024|   09| 01|19:45:00|2024-09-01|          USD|   FMP|    15min|USDEUR|2024-09-02 00:01:...|FOREX|2024-09-02|
|2024-09-01 19:30:00|0.90526|  0.905|0.90538|0.90534|   285|2024|   09| 01|19:30:00|2024-09-01|          USD|   FMP|    15min|USDEUR|2024-09-02 00:01:...|FOREX|2024-09-02|
|2024-09-01 19:15:00|0.90517|0.90517|0.90537|0.90529|   274|2024|   09| 01|19:15:00|2024-09-01|          USD|   FMP|    15min|USDEUR|2024-09

In [14]:
df.printSchema()

root
 |-- datetime: string (nullable = true)
 |-- open: double (nullable = true)
 |-- low: double (nullable = true)
 |-- high: double (nullable = true)
 |-- close: double (nullable = true)
 |-- volume: long (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- date: string (nullable = true)
 |-- base_currency: string (nullable = true)
 |-- source: string (nullable = true)
 |-- frequency: string (nullable = true)
 |-- symbol: string (nullable = true)
 |-- audit_time: timestamp (nullable = true)
 |-- type: string (nullable = true)
 |-- load_date: date (nullable = true)


## Remove and filter Rows

In [25]:
# To check duplicates
"""
(
    df
    .groupBy('symbol','year','datetime','source','frequency','type')
    .count().alias('count')
    .filter(F.col('count') > 1)
    .show()
)
"""

+------+----+--------+------+---------+----+-----+
|symbol|year|datetime|source|frequency|type|count|
+------+----+--------+------+---------+----+-----+
+------+----+--------+------+---------+----+-----+


In [24]:
df = (
    df
    .dropDuplicates(['symbol','year','datetime','source','frequency','type'])
)




In [26]:
print("After dropDuplicates: ", df.count())

After dropDuplicates:  730420


## Checking correct values

In [None]:
"""
(
    df
    .groupBy('source','type','symbol','year','frequency')
    .count().alias('count')
    #.filter(F.col('count') > 1)
    .show()
)
"""

## Append the batch data to RAW table

Set the destination raw table

In [29]:
BUCKET_NAME = "cryptoengineer"
PREFIX_SILVER = "datalake/silver/forex"




In [30]:
path=f"s3://{BUCKET_NAME}/{PREFIX_SILVER}"
print("Path:",path)

Path: s3://cryptoengineer/datalake/silver/forex


In [31]:
(
    df
    .repartition("year")
    .write
    .format("parquet")
    .mode("append")
    .partitionBy(['symbol','year'])
    .save(path)
)


