# Running Instructions

## Data Location
All project data is stored in the **data submissions container** under `/FileStore/tables/paris_project/deployment/`.

## Quick Start - Main Experiment Only
To run the main experiment with pre-processed data:
1. Download only the **Diamond layer** data from `deployment/diamond/`
2. Download the `utils.py` file to your Databricks workspace directory
3. Run the notebook: **`5. Local Model Training and Evaluation`**

## Full Replication - Complete Pipeline
To replicate the entire data processing pipeline:
1. Download the complete **deployment directory** (includes Bronze, Silver, Gold, and Diamond layers)
2. Download the `utils.py` file to your Databricks workspace directory
3. Run notebooks **sequentially in order 1-5**:
   * Notebook 1: Data Collection and Integration
   * Notebook 2: Data Cleaning and Feature Selection
   * Notebook 3: Paris specific features
   * Notebook 4: Training The Global Model
   * Notebook 5: Main Experiment - Local model training and evaluation
- This will generate the data files under /FileStore/tables/paris_project/ and save the original deployment files untouched. to run the experiment with your file, you'll need to switch the loaded address in the very begining of notebook 5.   

## Important Notes
* The `utils.py` file must be in the same directory as the experiment notebooks
* For quick experimentation, use the Diamond layer data with Notebook 5
* For full reproducibility, run the complete pipeline (Notebooks 1-5)

-----
Enjoy Paris! Bon voyage 🗼

# Data Flow

In [0]:
def get_parquet_file_count_and_size(path_str):
    """
    Analyzes the storage layout of a given path and
    prints the number of parquet files and their average size in MB.
    """
    try:
        files = dbutils.fs.ls(path_str)
        parquet_files = [f for f in files if f.name.endswith(".parquet")]
        file_count = len(parquet_files)
        
        if file_count == 0:
            print(f"Path: {path_str} -> No parquet files found directly (might be nested directories).")
            return 0,0
        
        file_sizes = [f.size for f in parquet_files] 
        total_size_bytes = sum(file_sizes)
        
        return file_count, total_size_bytes

    except Exception as e:
        print(f"Error accessing {path_str}: {e}")
        return 0,0

## Data Flow Architecture - Medallion Layers

This project follows a **medallion architecture** with multiple data quality layers:

### 🥉 **Bronze Layer**
Raw data ingestion zone - stores data in its original format as received from source systems. Minimal transformations, preserving data lineage.

### 🥈 **Silver Layer** 
Cleaned and validated data - applies data quality rules, deduplication, standardization, and basic transformations on target columns. Data is split into training and test sets.

### 🥇 **Gold Layer**
Aggregated and enriched data - contains business-level aggregations, metrics, and feature engineering for analytics and ML. Includes Paris-specific geographic features (metros, monuments).

### 💎 **Diamond Layer**
Production-ready datasets with model predictions - contains local training and test sets from Gold layer with global model prediction columns. Ready for final experiment runs and deployment.

In [0]:
# Verify the final deployment structure
deployment_base = "/FileStore/tables/paris_project/deployment"

print("\n" + "="*60)
print("DEPLOYMENT DIRECTORY STRUCTURE")
print("="*60 + "\n")

# Check each layer
layers_to_check = ['bronze', 'silver', 'gold', 'diamond']

for layer in layers_to_check:
    layer_path = f"{deployment_base}/{layer}"
    try:
        items = dbutils.fs.ls(layer_path)
        print(f"\n📁 {layer.upper()} ({len(items)} items):")
        for item in items:
            n_files, size = get_parquet_file_count_and_size(item.path)
            size_mb = size / (1024 * 1024)
            print(f"  • {item.name} - {n_files} files, {size_mb:.2f} MB")
    except Exception as e:
        print(f"\n⚠️ {layer.upper()}: Not found or empty")

print("\n" + "="*60)


DEPLOYMENT DIRECTORY STRUCTURE


📁 BRONZE (4 items):
  • global_org_df.parquet/ - 8 files, 13932.58 MB
  • local_org_df.parquet/ - 1 files, 54.36 MB
  • paris_metros_org.parquet/ - 1 files, 0.01 MB
  • paris_monuments_org.parquet/ - 1 files, 0.00 MB

📁 SILVER (3 items):
  • global_train_v2.parquet/ - 8 files, 9026.16 MB
  • local_train_pool_v2.parquet/ - 1 files, 30.15 MB
  • test_set_v2.parquet/ - 1 files, 7.44 MB

📁 GOLD (5 items):
  • global_train_features_v4.parquet/ - 4 files, 53.04 MB
  • local_train_pool_v4.parquet/ - 1 files, 1.77 MB
  • local_train_pool_v5_with_paris_features.parquet/ - 1 files, 6.39 MB
  • test_set_v4.parquet/ - 1 files, 0.51 MB
  • test_set_v5_with_paris_features.parquet/ - 1 files, 1.77 MB

📁 DIAMOND (2 items):
  • local_train_with_global_pred_v7.parquet/ - 1 files, 8.13 MB
  • test_set_with_global_pred_v7.parquet/ - 1 files, 2.16 MB

