# CARGA HISTÓRICA DE DATOS
##### El presente notebook tiene como objeto procesar los ficheros .csv que KRAKEN pone a disposición pública con periodicidad trimestral, consolidando así la información que se genera en su plataforma.


####  Run this cell to set up and start your interactive session.


In [7]:
#%help

In [1]:
%region us-east-1
%number_of_workers 2
%idle_timeout 30
%worker_type G.1X
%glue_version 4.0

BUCKET = 'cryptoengineer'

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.5 
Previous region: us-east-1
Setting new region to: us-east-1
Region is set to: us-east-1
Previous number of workers: None
Setting new number of workers to: 2
Current idle_timeout is None minutes.
idle_timeout has been set to 30 minutes.
Previous worker type: None
Setting new worker type to: G.1X
Setting Glue version to: 4.0
Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 2
Idle Timeout: 30
Session ID: 497b675d-6627-4dcf-bbb0-62682d1ca4dc
Applying the following default arguments:
--glue_kernel_version 1.0.5
--enable-glue-datacatalog true
Waiting for session 497b675d-6627-4dcf-bbb0-6268

In [4]:
#Importación de librerías necesarias
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, IntegerType, TimestampType, DateType
from pyspark.sql.functions import col, from_unixtime, lit, regexp_replace, current_date, min as spark_min, max as spark_max
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
import os

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)




In [5]:
#Defino función para determinar los ficheros a procesar
def list_s3_files(bucket_name, folder_prefix):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=folder_prefix)
    
    files = []
    for page in page_iterator:
        if 'Contents' in page:
            for obj in page['Contents']:
                # Check if the object is a file and not a directory
                if not obj['Key'].endswith('/'):
                    # Extract the file name from the full S3 key
                    file_name = os.path.basename(obj['Key'])
                    files.append(file_name)
    return files




In [6]:
#Lectura de ficheros CSV a procesar
files = list_s3_files(BUCKET, 'datalake/historic_data/cryptos')

print('Los ficheros a procesar son: ')
for file in files:
    print(file)

Los ficheros a procesar son: 
1INCHUSD_15.csv
XBTUSDC_15.csv


In [7]:
#Creación del esquema 
historic_schema = StructType([
    StructField('TIMESTAMP', StringType(), True),
    StructField('OPEN', DoubleType(), True),
    StructField('HIGH', DoubleType(), True),
    StructField('LOW', DoubleType(), True),
    StructField('CLOSE', DoubleType(), True),
    StructField('VOLUME', DoubleType(), True),
    StructField('TRADES', IntegerType(), True),
    StructField('ORIGIN', StringType(), True),
    StructField('LOAD_DATE', DateType(), True),
    StructField('SYMBOL', StringType(), True),
    StructField('DATETIME', TimestampType(), True),
    StructField('YEAR', IntegerType(), True)
])

#Creación del DF de destino
historic_df = spark.createDataFrame([], historic_schema)




In [8]:
#Iteración a través de todos los ficheros 
for file in files:
    print('Procesado el fichero: ' + file)
    #Lectura del fichero
    file_df = (
        spark.read
        .format("csv")
        .schema(historic_schema)
        .option('header', 'false')
        .load('s3://' + BUCKET + '/datalake/historic_data/cryptos/' + file)

        #Transformaciones básicas
        .withColumn('origin', lit('historic'))
        .withColumn('load_date', current_date())
        .withColumn('symbol', regexp_replace(lit(file), '_15.csv', ''))
        .withColumn('datetime', from_unixtime(col('timestamp')).cast('timestamp'))
        .withColumn('year', col('datetime').substr(0, 4).cast('int'))

    )
    historic_df = historic_df.unionAll(file_df)

Procesado el fichero: 1INCHUSD_15.csv
Procesado el fichero: XBTUSDC_15.csv


In [9]:
historic_df.printSchema()

root
 |-- TIMESTAMP: string (nullable = true)
 |-- OPEN: double (nullable = true)
 |-- HIGH: double (nullable = true)
 |-- LOW: double (nullable = true)
 |-- CLOSE: double (nullable = true)
 |-- VOLUME: double (nullable = true)
 |-- TRADES: integer (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- LOAD_DATE: date (nullable = true)
 |-- SYMBOL: string (nullable = true)
 |-- DATETIME: timestamp (nullable = true)
 |-- YEAR: integer (nullable = true)


In [10]:
historic_df.show(5)

+----------+-----+-----+-----+-----+-------------+------+--------+----------+--------+-------------------+----+
| TIMESTAMP| OPEN| HIGH|  LOW|CLOSE|       VOLUME|TRADES|  ORIGIN| LOAD_DATE|  SYMBOL|           DATETIME|YEAR|
+----------+-----+-----+-----+-----+-------------+------+--------+----------+--------+-------------------+----+
|1628609400|2.965|3.121|2.764|2.776|6725.61545935|    33|historic|2024-08-23|1INCHUSD|2021-08-10 15:30:00|2021|
|1628610300|2.793|  2.8| 2.77| 2.77|1306.77452192|    13|historic|2024-08-23|1INCHUSD|2021-08-10 15:45:00|2021|
|1628611200|2.756|2.833|2.756| 2.76|4855.65736797|    17|historic|2024-08-23|1INCHUSD|2021-08-10 16:00:00|2021|
|1628612100|2.748|2.754|2.748|2.754|      29.6403|     3|historic|2024-08-23|1INCHUSD|2021-08-10 16:15:00|2021|
|1628613000| 2.69|2.708| 2.69|2.699|    276.09198|     4|historic|2024-08-23|1INCHUSD|2021-08-10 16:30:00|2021|
+----------+-----+-----+-----+-----+-------------+------+--------+----------+--------+------------------

In [6]:
#Persistencia de datos
(
    historic_df
        .write
        .format('parquet')
        #.partitionBy('symbol','year')
        .partitionBy('LOAD_DATE')
        .mode('append')
        .save('s3://' + BUCKET + '/datalake/bronze/cryptos')
)

print('Datos guardados en s3://' + BUCKET + '/datalake/bronze/cryptos')

Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 2
Idle Timeout: 30
Session ID: 497b675d-6627-4dcf-bbb0-62682d1ca4dc
Applying the following default arguments:
--glue_kernel_version 1.0.5
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AlreadyExistsException) when calling the CreateSession operation: Session already created, sessionId=497b675d-6627-4dcf-bbb0-62682d1ca4dc 

Error message: Session already created, sessionId=497b675d-6627-4dcf-bbb0-62682d1ca4dc 

Traceback (most recent call last):
  File "/home/jupyter-user/.local/lib/python3.9/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 100, in create_session
    response = self.glue_client.create_session(
  File "/home/jupyter-user/.local/lib/python3.9/site-packages/botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/jupyter-user/.local/lib/python3.9/site-packages/botocore/client.py", line 1009, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.AlreadyExistsException: An error occurred (AlreadyExistsException) when calling the CreateSession o