# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


####  Run this cell to set up and start your interactive session.


In [7]:
%iam_role arn:aws:iam::212430227630:role/LabRole
%region us-east-1
%number_of_workers 2

%idle_timeout 30
%glue_version 4.0
%worker_type G.1X

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.5 
Current iam_role is arn:aws:iam::212430227630:role/LabRole
iam_role has been set to arn:aws:iam::212430227630:role/LabRole.
Previous region: us-east-1
Setting new region to: us-east-1
Region is set to: us-east-1
Previous number of workers: None
Setting new number of workers to: 2
Current idle_timeout is None minutes.
idle_timeout has been set to 30 minutes.
Setting Glue version to: 4.0
Previous worker type: None
Setting new worker type to: G.1X


In [10]:
%extra_py_files s3://cryptoengineer/gluejobs-py-modules/load.py, s3://cryptoengineer/gluejobs-py-modules/storage.py
%additional_python_modules yfinance

Extra py files to be included:
s3://cryptoengineer/gluejobs-py-modules/load.py
s3://cryptoengineer/gluejobs-py-modules/storage.py
Additional python modules to be included:
yfinance


In [13]:
%load_ext autoreload
%autoreload 2

In [1]:
import sys
import boto3

from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 2
Idle Timeout: 30
Session ID: 3c4cf63f-6a07-4a5b-a920-ca77494c104d
Applying the following default arguments:
--glue_kernel_version 1.0.5
--enable-glue-datacatalog true
--extra-py-files s3://cryptoengineer/gluejobs-py-modules/load.py,s3://cryptoengineer/gluejobs-py-modules/storage.py
--additional-python-modules yfinance
Waiting for session 3c4cf63f-6a07-4a5b-a920-ca77494c104d to get into ready status...
Session 3c4cf63f-6a07-4a5b-a920-ca77494c104d has been created.



## HISTORICAL LOAD

### Load modules

In [2]:
from datetime import datetime, timedelta, timezone

import load




### Set AWS storage parameters

In [3]:
BUCKET_NAME = "cryptoengineer"
PREFIX = "datalake/bronze/commodities"




## Load job parameters

In [35]:
glue_client = boto3.client("glue")

if '--WORKFLOW_NAME' in sys.argv and '--WORKFLOW_RUN_ID' in sys.argv:
    print("Running in Glue Workflow")
    
    glue_args = getResolvedOptions(
        sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID']
    )
    
    print("Reading the workflow parameters")
    workflow_args = glue_client.get_workflow_run_properties(
        Name=glue_args['WORKFLOW_NAME'], RunId=glue_args['WORKFLOW_RUN_ID']
    )["RunProperties"]

    
    base= workflow_args['base']
    start_date = workflow_args['start_date']
    end_date = workflow_args['end_date']
    symbols = workflow_args['symbols']
    api_key = workflow_args['api_key']

else:
    try:
        args = getResolvedOptions(sys.argv,
                                  ['JOB_NAME',
                                   'base',
                                   'start_date',
                                   'end_date',
                                   'symbols',
                                   'api_key'])
        base= args['base']
        start_date = args['start_date']
        end_date = args['end_date']
        symbols = args['symbols']
        api_key = args['api_key']
        print("Running as Job")        
    except:
        print("Running as Notebook")
        base= 'USD'
        start_date = '2023-07-01'
        end_date = '2024-08-29'
        symbols = "GCUSD"
        api_key= ""


Running as Notebook


In [None]:
print("base: ", base)
print("Start Date: ", start_date)
print("End Date: ", end_date)
print("Symbols: ", symbols)
print("API Key: ", api_key)


## Load the historical rates 15min frequency

In [38]:
df = load.load_historical_freq_rates(base=base,
                                      start_date=start_date,
                                      end_date=end_date,
                                      freq='15min',
                                      symbol=symbols,
                                      api_key=api_key,
                                      source='FMP'
)


Year:  2023  Month: 7
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  0
Year:  2023  Month: 8
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  0
Year:  2023  Month: 8
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  0
Year:  2023  Month: 9
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  460
Year:  2023  Month: 10
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  2508
Year:  2023  Month: 11
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  4203
Year:  2023  Month: 12
https://financialmodelingprep.com/api/v3/historical-chart/15min
Reading month
Lectura API correcta
Leidos  6017
Year:  2024  Month: 1
https://financialmodelingprep.com/api/v3/histor

In [39]:
print("Records: ", len(df))

Records:  21704


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21704 entries, 0 to 21703
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    21704 non-null  object 
 1   open    21704 non-null  float64
 2   low     21704 non-null  float64
 3   high    21704 non-null  float64
 4   close   21704 non-null  float64
 5   volume  21704 non-null  int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 1017.5+ KB


In [41]:
df.head(5)

                  date    open     low    high   close  volume
0  2023-09-29 16:45:00  1864.7  1864.6  1865.6  1864.6     755
1  2023-09-29 16:30:00  1864.8  1864.2  1866.1  1864.8    1095
2  2023-09-29 16:15:00  1864.4  1864.1  1865.2  1864.9     723
3  2023-09-29 16:00:00  1864.3  1863.8  1864.5  1864.3    1052
4  2023-09-29 15:45:00  1866.0  1864.2  1866.2  1864.4    1567


### Set the schema

In [42]:
freq='15min'
symbol=symbols
source='FMP'




In [43]:
df = load.set_schema_table(df, symbol, source, freq, base)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21704 entries, 0 to 21703
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   datetime       21704 non-null  object             
 1   open           21704 non-null  float64            
 2   low            21704 non-null  float64            
 3   high           21704 non-null  float64            
 4   close          21704 non-null  float64            
 5   volume         21704 non-null  int64              
 6   year           21704 non-null  object             
 7   month          21704 non-null  object             
 8   day            21704 non-null  object             
 9   time           21704 non-null  object             
 10  date           21704 non-null  object             
 11  base_currency  21704 non-null  object             
 12  source         21704 non-null  object             
 13  frequency      21704 non-null  object         

## Save a backup copy as CSV

In [44]:
# Set the path to the S3 location
path=f"s3://{BUCKET_NAME}/historic_bck/GCUSD.csv"
# Si no particionamos por symbol
print("Path:",path)

Path: s3://cryptoengineer/historic_bck/GCUSD.csv


In [45]:
df.to_csv(path, header=True, index=False)




## Save dataframe to raw in parquet format

In [46]:
# Set the path to the S3 location
path=f"s3://{BUCKET_NAME}/{PREFIX}"
# Si no particionamos por symbol
print("Path:",path)

Path: s3://cryptoengineer/datalake/raw/commodities


In [47]:
(
    spark.createDataFrame(df)
    .repartition("load_date")
    .write
    .format("parquet")
    .mode("append")
    .partitionBy(['load_date'])
    .save(path)
)

