# 📥 Notebook 01: Ingest Crypto Data from CoinGecko API

This notebook ingests historical market data for selected cryptocurrencies using the public CoinGecko API. It prepares the data for further transformation and Delta Lake storage.


# 🧪 Step 1: Ingest Crypto Data from CoinGecko API

This notebook initializes the PySpark session and fetches historical price data for Bitcoin using the CoinGecko API.


In [5]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("Crypto Ingestion") \
    .getOrCreate()


## 🔌 Step 2: Import API Utility Function

We will use a custom helper function from `src/api_utils.py` to retrieve 30 days of historical price data for Bitcoin.


In [6]:
import sys
sys.path.append("../src")

from api_utils import get_market_data

# Get Bitcoin data (returns Pandas DataFrame)
btc_df = get_market_data("bitcoin", "usd", 30)
btc_df.head()


Unnamed: 0,timestamp,price
0,2025-04-09 21:05:41.946,83004.702761
1,2025-04-09 22:04:42.421,83315.439591
2,2025-04-09 23:04:59.794,83058.347759
3,2025-04-10 00:04:50.148,82595.42668
4,2025-04-10 01:04:28.190,82182.124707


## 🔄 Step 3: Convert to Spark DataFrame

Now we convert the Pandas DataFrame to a Spark DataFrame for distributed processing.


In [7]:
# Convert to Spark DataFrame
btc_spark_df = spark.createDataFrame(btc_df)

# Display first 5 rows
btc_spark_df.show(5)

# Optional: Inspect schema
btc_spark_df.printSchema()


+--------------------+-----------------+
|           timestamp|            price|
+--------------------+-----------------+
|2025-04-09 21:05:...| 83004.7027605787|
|2025-04-09 22:04:...|83315.43959137528|
|2025-04-09 23:04:...|83058.34775924393|
|2025-04-10 00:04:...|82595.42668033838|
|2025-04-10 01:04:...|82182.12470713933|
+--------------------+-----------------+
only showing top 5 rows

root
 |-- timestamp: timestamp (nullable = true)
 |-- price: double (nullable = true)

