### üß± Step 1: Initialize SparkSession

We start by creating the Spark session with Delta support enabled to process and store the crypto data.


In [9]:
# Initialize Spark with cleaner logging
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Transform and Save Delta") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")  # üëà Hide warnings like SparkUI port binding, etc.


### üß¨ Step 2: Convert Pandas to Spark DataFrame

We load the pandas DataFrame using the API function and convert it into a distributed Spark DataFrame for further processing.


In [10]:
import sys
sys.path.append("../src")

from api_utils import get_market_data

btc_df = get_market_data("bitcoin", "usd", 30)

# Convert to Spark
btc_spark_df = spark.createDataFrame(btc_df)

btc_spark_df.printSchema()
btc_spark_df.show(5)


root
 |-- timestamp: timestamp (nullable = true)
 |-- price: double (nullable = true)

+--------------------+-----------------+
|           timestamp|            price|
+--------------------+-----------------+
|2025-04-10 19:05:...|79529.67606306932|
|2025-04-10 20:09:...|79724.89465979146|
|2025-04-10 21:08:...|79878.62357184727|
|2025-04-10 22:05:...|79709.36877328751|
|2025-04-10 23:04:...|  79715.508089801|
+--------------------+-----------------+
only showing top 5 rows



### üßπ Step 3: Data Cleaning and Type Casting

We ensure that columns have appropriate types for analysis and storage in Delta format.

In [11]:
from pyspark.sql.functions import col

btc_spark_df_clean = btc_spark_df.select(
    col("timestamp").cast("timestamp"),
    col("price").cast("double")
)

btc_spark_df_clean.printSchema()
btc_spark_df_clean.show(5)


root
 |-- timestamp: timestamp (nullable = true)
 |-- price: double (nullable = true)

+--------------------+-----------------+
|           timestamp|            price|
+--------------------+-----------------+
|2025-04-10 19:05:...|79529.67606306932|
|2025-04-10 20:09:...|79724.89465979146|
|2025-04-10 21:08:...|79878.62357184727|
|2025-04-10 22:05:...|79709.36877328751|
|2025-04-10 23:04:...|  79715.508089801|
+--------------------+-----------------+
only showing top 5 rows



### üíæ Step 4: Save Cleaned Data to Delta Lake

Finally, we store the transformed DataFrame in the Delta format within Microsoft Fabric's default lakehouse location.


In [12]:
# ü™£ Try saving to Delta (only works on Fabric or local Delta setup)
try:
    btc_spark_df_clean.write \
        .format("delta") \
        .mode("overwrite") \
        .save("Tables/bitcoin")

    print("‚úÖ Data written to Delta successfully.")

except Exception as e:
    print("‚ö†Ô∏è Could not write to Delta format. This step requires Microsoft Fabric or Delta Lake setup.\n")
    print("Error details:\n", e)

‚ö†Ô∏è Could not write to Delta format. This step requires Microsoft Fabric or Delta Lake setup.

Error details:
 An error occurred while calling o134.save.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: delta. Please find packages at `https://spark.apache.org/third-party-projects.html`.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:725)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:697)
	at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:873)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:260)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:243)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.intern