# Lakehouse Initialization

This notebook initializes the Databricks Lakehouse:
- Uses `workspace` catalog
- Creates bronze, silver, and gold schemas
- Creates a volume for raw CSV source files

## Select Catalog

In [0]:
%sql
USE CATALOG workspace;

## Create Lakehouse Schemas

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS bronze
COMMENT 'Bronze layer: raw ingested data';

CREATE SCHEMA IF NOT EXISTS silver
COMMENT 'Silver layer: cleaned and transformed data';

CREATE SCHEMA IF NOT EXISTS gold
COMMENT 'Gold layer: business-ready data';

## Create Volume

In [0]:
%sql
CREATE VOLUME IF NOT EXISTS workspace.bronze.raw_sources
COMMENT 'Volume for raw source files (CSV)';

## Move Data from GitHub to Volume
Since this project uses **Unity Catalog Volumes** for data ingestion, we need to copy the CSV files from the cloned GitHub repository into the designated Volume. 

This script automates the process so the project works "out of the box" for anyone who clones it.

In [0]:
%python
import os
import shutil

def find_datasets_path():
    """
    Dynamically locates the 'datasets' folder within the cloned repository.
    """
    curr = os.getcwd()
    while curr != "/":
        possible_path = os.path.join(curr, "datasets")
        if os.path.exists(possible_path):
            return possible_path
        curr = os.path.dirname(curr)
    return None

# Identify source and destination
datasets_local_path = find_datasets_path()
volume_base = "/Volumes/workspace/bronze/raw_sources"

if datasets_local_path:
    try:
        # Ensure destination sub-directories exist in the Volume
        os.makedirs(f"{volume_base}/source_crm", exist_ok=True)
        os.makedirs(f"{volume_base}/source_erp", exist_ok=True)

        # Iterate through folders and copy CSV files
        for folder in ["source_crm", "source_erp"]:
            src_dir = os.path.join(datasets_local_path, folder)
            dest_dir = os.path.join(volume_base, folder)
            
            print(f"Syncing folder: {folder}...")
            for file_name in os.listdir(src_dir):
                if file_name.endswith(".csv"):
                    src_file = os.path.join(src_dir, file_name)
                    dest_file = os.path.join(dest_dir, file_name)
                    
                    # Using shutil.copy to bypass WorkspaceLocalFileSystem restrictions
                    shutil.copy(src_file, dest_file)
                    print(f"  ✅ Copied: {file_name}")
        
        print(f"\n Success! All datasets have been migrated to: {volume_base}")
        
    except Exception as e:
        print(f"❌ Error during data seeding: {e}")
else:
    print("❌ Critical Error: 'datasets' folder not found in the repository path!")