# Vehicle Maintenance Data Pipeline Setup
## Environment Configuration and Storage Setup

This notebook handles the initial setup for our data pipeline:
1. Configure cloud storage mounting (GCS)
2. Set up Databricks secrets and authentication
3. Create Delta Lake database structure
4. Install required dependencies

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import *
import os
import json

# Initialize Spark session with Delta Lake support
spark = SparkSession.builder \
    .appName("VehicleMaintenanceETL") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

## Cloud Storage Configuration

Configure GCS storage mounting using service account credentials. Make sure you have:
1. Created a GCS bucket for the project
2. Set up service account with appropriate permissions
3. Downloaded service account key JSON file

In [None]:
# GCS configuration
storage_bucket = "vehicle-maintenance-data"  # Replace with your bucket name
mount_point = "/mnt/vehicle-data"

# Configure GCS using service account
service_account = dbutils.secrets.get(scope="vehicle-maintenance", key="gcs-service-account")
spark.conf.set("google.cloud.auth.service.account.json.keyfile", service_account)

# Mount GCS bucket
try:
    dbutils.fs.mount(
        source=f"gs://{storage_bucket}",
        mount_point=mount_point,
        extra_configs={
            "google.cloud.auth.service.account.enable": "true",
            "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
        }
    )
    print(f"Successfully mounted {storage_bucket} to {mount_point}")
except Exception as e:
    if "already mounted" in str(e):
        print(f"Bucket {storage_bucket} already mounted to {mount_point}")
    else:
        raise e

## Create Delta Lake Database Structure

Initialize the database and create the folder structure for our medallion architecture:
- Bronze: Raw data ingestion
- Silver: Cleaned and validated data
- Gold: Business-ready aggregated data

In [None]:
# Create database if not exists
spark.sql("CREATE DATABASE IF NOT EXISTS vehicle_maintenance")
spark.sql("USE vehicle_maintenance")

# Create delta table locations
delta_base_path = f"{mount_point}/delta"
bronze_path = f"{delta_base_path}/bronze"
silver_path = f"{delta_base_path}/silver"
gold_path = f"{delta_base_path}/gold"

# Create directories if they don't exist
for path in [bronze_path, silver_path, gold_path]:
    dbutils.fs.mkdirs(path)
    print(f"Created directory: {path}")

# Create example bronze table schema
vehicle_maintenance_schema = """
    vehicle_id STRING,
    maintenance_date TIMESTAMP,
    service_type STRING,
    mileage LONG,
    cost DOUBLE,
    technician STRING,
    notes STRING,
    parts_used ARRAY<STRING>,
    ingestion_timestamp TIMESTAMP
"""

# Create bronze table
spark.sql(f"""
CREATE TABLE IF NOT EXISTS vehicle_maintenance.bronze_maintenance (
    {vehicle_maintenance_schema}
)
USING DELTA
LOCATION '{bronze_path}/maintenance'
""")

print("Database and table structure created successfully")