# Quote Math Validator

This notebook validates the pricing math used by the form-builder quote engine.

The PHP `QuoteEngine` and JavaScript `quote.js` share the same deterministic formula:

```
subtotal  = (base_rate * complexity_multiplier) + addon_total
range_low  = round(subtotal * 0.9)
range_high = round(subtotal * 1.2)
```

**Goals**
1. Load `quote_math_validation.csv` — 2 500+ correct rows + ~360 intentionally broken rows.
2. Re-implement the formula in Python and assert it matches the CSV ground truth.
3. Train a lightweight `DecisionTreeClassifier` to detect rows where the stored numbers are wrong.
4. Expose a single `predict_quote(...)` function that returns `True` when the math checks out.

All packages used (`pandas`, `numpy`, `scikit-learn`) are available on the Kaggle free tier with no GPU required.

## 1. Import Required Libraries

In [None]:
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report,
)
from sklearn.preprocessing import LabelEncoder

print("pandas", pd.__version__, "| numpy", np.__version__)

import sklearn; print("scikit-learn", sklearn.__version__)

## 2. Load and Explore the Dataset

`quote_math_validation.csv` was generated by `gen_data.py` which:
- Enumerates every `service × complexity` pair
- Pairs each with no addons, each single addon, every 2-addon combination, every 3-addon combination, and a sample of 4- and 5-addon combinations
- Adds ~15% intentionally miscalculated rows (wrong multiplier, wrong range_low, or wrong range_high) labelled `math_correct = 0`

In [None]:
from pathlib import Path
import subprocess, sys

KAGGLE_BASE = Path("/kaggle/input/datasets/crissymoon/quote-math-validation")
KAGGLE_CSV  = KAGGLE_BASE / "quote_math_validation.csv"
KAGGLE_GEN  = KAGGLE_BASE / "gen_data.py"
LOCAL_BASE  = Path("/kaggle/working")
LOCAL_CSV   = LOCAL_BASE / "quote_math_validation.csv"

# Priority: Kaggle input dataset -> already-generated working copy -> regenerate now
if KAGGLE_CSV.exists():
    CSV_PATH = KAGGLE_CSV
elif LOCAL_CSV.exists():
    CSV_PATH = LOCAL_CSV
else:
    gen_script = KAGGLE_GEN if KAGGLE_GEN.exists() else None
    if gen_script:
        print("Regenerating CSV from gen_data.py ...")
        subprocess.run([sys.executable, str(gen_script)], check=True)
        CSV_PATH = LOCAL_CSV
    else:
        raise FileNotFoundError(
            f"CSV not found.\nExpected: {KAGGLE_CSV}\n"
            "Attach the dataset 'crissymoon/quote-math-validation' to this notebook."
        )

df = pd.read_csv(CSV_PATH)
print("CSV loaded from:", CSV_PATH)
print("Shape:", df.shape)
print("\nClass balance:")
print(df["math_correct"].value_counts())
print("\nError type distribution:")
print(df["error_type"].value_counts())
df.head(8)

## 3. Data Preprocessing

In [None]:
assert df.isnull().sum().sum() == 0, "Unexpected nulls — re-run gen_data.py"

le_service    = LabelEncoder().fit(df["service_type"])
le_complexity = LabelEncoder().fit(df["complexity"])

df["service_enc"]    = le_service.transform(df["service_type"])
df["complexity_enc"] = le_complexity.transform(df["complexity"])

print("Service classes   :", list(le_service.classes_))
print("Complexity classes:", list(le_complexity.classes_))

ADDON_COLS = [c for c in df.columns if c.startswith("addon_")]
NUM_COLS   = ["base_rate", "complexity_multiplier", "addon_total",
              "subtotal", "range_low", "range_high"]
CAT_COLS   = ["service_enc", "complexity_enc"]
FEATURES   = CAT_COLS + NUM_COLS + ADDON_COLS
TARGET     = "math_correct"

X = df[FEATURES]
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTrain: {len(X_train)}  Test: {len(X_test)}")
print(f"Train positive rate: {y_train.mean():.3f}")

## 4. Feature Engineering

The numeric columns already encode everything the formula uses, but we add three derived features that make the decision boundary trivially learnable for the tree:

| Derived feature | Formula |
|---|---|
| `expected_subtotal` | `base_rate * complexity_multiplier + addon_total` |
| `low_delta` | `range_low - round(expected_subtotal * 0.9)` |
| `high_delta` | `range_high - round(expected_subtotal * 1.2)` |

A correct row has `low_delta == 0` and `high_delta == 0`. The model learns this quickly, but the exercise is still useful for catching float-rounding edge cases or JS/PHP divergence you might introduce later.

In [None]:
def add_derived(frame):
    """Add expected_subtotal, low_delta, high_delta to a copy of the frame."""
    f = frame.copy()
    f["expected_subtotal"] = (
        f["base_rate"] * f["complexity_multiplier"] + f["addon_total"]
    )
    f["low_delta"]  = f["range_low"]  - (f["expected_subtotal"] * 0.9).round()
    f["high_delta"] = f["range_high"] - (f["expected_subtotal"] * 1.2).round()
    return f

X_train_fe = add_derived(X_train)
X_test_fe  = add_derived(X_test)

FEATURES_FE = FEATURES + ["expected_subtotal", "low_delta", "high_delta"]

print("Features:", FEATURES_FE)
print("Train shape:", X_train_fe[FEATURES_FE].shape)

## 5. Build the Lightweight Model

A shallow `DecisionTreeClassifier` (`max_depth=5`) is used because:
- The pricing formula is entirely deterministic — a tree of depth 2–3 should already reach near-100% accuracy.
- It is fast, interpretable, and produces no float precision issues.
- It can be serialised to a tiny JSON/dict for use inside the PHP or JS layer if needed.

In [None]:
model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_leaf=2,
    random_state=42,
    class_weight="balanced",
)

print("Model:", model)

## 6. Train the Model

In [None]:
model.fit(X_train_fe[FEATURES_FE], y_train)

train_acc = accuracy_score(y_train, model.predict(X_train_fe[FEATURES_FE]))
print(f"Training accuracy : {train_acc:.4f}")
print(f"Tree depth used   : {model.get_depth()}")
print(f"Leaf nodes        : {model.get_n_leaves()}")

## 7. Evaluate the Model

In [None]:
y_pred = model.predict(X_test_fe[FEATURES_FE])

print("Test set results")
print("----------------")
print(classification_report(y_test, y_pred, target_names=["wrong (0)", "correct (1)"]))

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm,
    index=["actual: wrong", "actual: correct"],
    columns=["pred: wrong", "pred: correct"],
)
print("Confusion matrix:")
print(cm_df)

print(f"\nAccuracy : {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall   : {recall_score(y_test, y_pred):.4f}")
print(f"F1       : {f1_score(y_test, y_pred):.4f}")

In [None]:
# Feature importance
importance = pd.Series(model.feature_importances_, index=FEATURES_FE)
importance_sorted = importance[importance > 0].sort_values(ascending=False)

print("Feature importances (non-zero only):")
print(importance_sorted.to_string())

## 8. Run Inference on New Form Inputs

`predict_quote(...)` mirrors the `QuoteEngine::calculate()` PHP signature.  
Pass the raw form values and it returns `True` if the numbers check out, `False` otherwise.

In [None]:
BASE_RATES = {
    "web_design": 1500, "web_development": 3500, "ecommerce": 4500,
    "software": 7500, "ai_web_app": 9500, "ai_native_app": 14000,
}
COMPLEXITY_MULTIPLIERS = {
    "simple": 1.0, "moderate": 1.4, "complex": 2.0, "custom": 2.8,
}
ADDON_RATES = {
    "seo_basic": 500, "seo_advanced": 1200, "copywriting": 800,
    "branding": 1800, "maintenance": 1200, "hosting_setup": 350,
    "api_integration": 1500, "automation": 2200,
}
ADDONS_LIST = list(ADDON_RATES.keys())


def predict_quote(
    service_type: str,
    complexity: str,
    addons: list,
    claimed_subtotal: int,
    claimed_range_low: int,
    claimed_range_high: int,
) -> bool:
    """
    Returns True if the claimed figures match the QuoteEngine formula exactly.

    Parameters mirror the PHP QuoteEngine::calculate() method.
    The call also runs the trained model as a secondary check; both must agree.
    """
    base       = BASE_RATES[service_type]
    multiplier = COMPLEXITY_MULTIPLIERS[complexity]
    addon_total = sum(ADDON_RATES[a] for a in addons if a in ADDON_RATES)
    expected_subtotal = round(base * multiplier + addon_total)
    expected_low      = round(expected_subtotal * 0.9)
    expected_high     = round(expected_subtotal * 1.2)

    # Deterministic rule check (no ML needed for production)
    rule_ok = (
        claimed_subtotal   == expected_subtotal and
        claimed_range_low  == expected_low      and
        claimed_range_high == expected_high
    )

    # ML model check (secondary)
    addon_flags = {f"addon_{a}": (1 if a in addons else 0) for a in ADDONS_LIST}
    row = {
        "service_enc"            : le_service.transform([service_type])[0],
        "complexity_enc"         : le_complexity.transform([complexity])[0],
        **{f"addon_{a}": addon_flags[f"addon_{a}"] for a in ADDONS_LIST},
        "base_rate"              : base,
        "complexity_multiplier"  : multiplier,
        "addon_total"            : addon_total,
        "subtotal"               : claimed_subtotal,
        "range_low"              : claimed_range_low,
        "range_high"             : claimed_range_high,
        "expected_subtotal"      : base * multiplier + addon_total,
        "low_delta"              : claimed_range_low  - round((base * multiplier + addon_total) * 0.9),
        "high_delta"             : claimed_range_high - round((base * multiplier + addon_total) * 1.2),
    }
    row_df   = pd.DataFrame([row])[FEATURES_FE]
    model_ok = bool(model.predict(row_df)[0])

    return rule_ok and model_ok


# --- Test cases ---
cases = [
    # Correct: AI Web App, complex, branding + api_integration
    dict(service_type="ai_web_app", complexity="complex",
         addons=["branding", "api_integration"],
         claimed_subtotal=22300, claimed_range_low=20070, claimed_range_high=26760),

    # Correct: web_design, simple, no addons
    dict(service_type="web_design", complexity="simple",
         addons=[], claimed_subtotal=1500, claimed_range_low=1350, claimed_range_high=1800),

    # Wrong: web_development, moderate, wrong range_high
    dict(service_type="web_development", complexity="moderate",
         addons=["seo_basic"],
         claimed_subtotal=5400, claimed_range_low=4860, claimed_range_high=9999),
]

print(f"{'service':<20} {'complexity':<10} {'addons':<25} {'result'}")
print("-" * 75)
for c in cases:
    ok = predict_quote(**c)
    print(f"{c['service_type']:<20} {c['complexity']:<10} {str(c['addons']):<25} {'CORRECT' if ok else 'WRONG'}")