# ü¶Ü Data Quality for Your ML Pipeline with DuckGuard

**Bad data is the #1 reason ML models fail in production.** Before you train, validate.

This notebook shows how to profile, validate, and fix data quality issues in under 30 seconds ‚Äî using [DuckGuard](https://github.com/XDataHubAI/duckguard), a pytest-like data quality library powered by DuckDB.

**Works with:** CSV, Parquet, S3, Snowflake, Databricks, BigQuery, and 15+ sources.

[![GitHub](https://img.shields.io/github/stars/XDataHubAI/duckguard?style=social)](https://github.com/XDataHubAI/duckguard)
[![PyPI](https://img.shields.io/pypi/v/duckguard.svg)](https://pypi.org/project/duckguard/)
[![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://xdatahubai.github.io/duckguard/)

## 1. Install & Setup

In [None]:
!pip install -q duckguard

## 2. Create Sample Data

We'll create a realistic e-commerce dataset with intentional quality issues ‚Äî the kind you'd find in production data.

In [None]:
import csv, os

ORDERS_CSV = """order_id,customer_id,product_name,quantity,unit_price,subtotal,tax,shipping,total_amount,status,country,email,phone,created_at,ship_date
ORD001,CUST001,Widget Pro,2,29.99,59.98,5.40,4.99,70.37,shipped,US,alice@example.com,555-0101,2024-01-15,2024-01-17
ORD002,CUST002,Gadget Plus,1,49.99,49.99,4.50,0.00,54.49,delivered,US,bob@example.com,555-0102,2024-01-15,2024-01-18
ORD003,,Widget Pro,-3,29.99,-89.97,-8.10,4.99,-93.08,pending,UK,charlie@example.com,+44-20-7946-0958,2024-01-16,
ORD004,CUST004,Super Gizmo,1,199.99,199.99,18.00,0.00,217.99,shipped,US,,555-0104,2024-01-16,2024-01-19
ORD005,CUST005,Widget Pro,500,29.99,14995.00,1349.55,4.99,16349.54,pending,CA,eve@example.com,555-0105,2024-01-17,
ORD006,CUST006,Gadget Plus,2,49.99,99.98,9.00,4.99,113.97,INVALID,US,frank@example.com,555-0106,2024-01-17,2024-01-20
ORD007,CUST007,Basic Widget,1,9.99,9.99,0.90,4.99,15.88,delivered,US,grace@example,555-0107,2024-01-18,2024-01-20
ORD008,CUST008,Premium Bundle,3,99.99,299.97,27.00,0.00,326.97,shipped,DE,hans@example.de,+49-30-12345678,2024-01-18,2024-01-22
ORD009,CUST009,Widget Pro,1,29.99,29.99,2.70,4.99,37.68,delivered,US,ivan@example.com,,2024-01-19,2024-01-21
ORD010,CUST010,Super Gizmo,2,199.99,399.98,36.00,0.00,435.98,pending,JP,jun@example.jp,+81-3-1234-5678,2024-01-19,
"""

with open("orders.csv", "w") as f:
    f.write(ORDERS_CSV.strip())

print("‚úÖ Created orders.csv with 10 rows (and intentional quality issues)")

## 3. Connect & Profile

DuckGuard auto-detects file types, column types, and semantic types (email, phone, PII, etc.).

In [None]:
from duckguard import connect, AutoProfiler, SemanticAnalyzer

orders = connect("orders.csv")
print(f"Rows: {orders.row_count}")
print(f"Columns: {orders.columns}")

In [None]:
# Full auto-profile with quality scoring
profiler = AutoProfiler()
profile = profiler.profile(orders)

print(f"\nüìä Quality Grade: {profile.overall_quality_grade} ({profile.overall_quality_score:.1f}/100)\n")
print(f"{'Column':<20} {'Type':<12} {'Nulls %':<10} {'Unique %':<10} {'Grade'}")
print("-" * 62)
for col in profile.columns:
    print(f"{col.name:<20} {col.dtype:<12} {col.null_percent:<10.1f} {col.unique_percent:<10.1f} {col.quality_grade}")

## 4. PII Detection

Before sharing data or training models, check for personally identifiable information.

In [None]:
analysis = SemanticAnalyzer().analyze(orders)

print("üîí PII Detection Results:\n")
for col in analysis.columns:
    if col.is_pii:
        print(f"  ‚ö†Ô∏è  {col.name}: {col.semantic_type.value} (confidence: {col.confidence:.0%})")

if analysis.pii_columns:
    print(f"\n  Found PII in {len(analysis.pii_columns)} columns: {analysis.pii_columns}")
    print("  ‚Üí Consider masking these before training!")
else:
    print("  No PII detected.")

## 5. Validate ‚Äî The Core

pytest-like assertions. Each one returns a `ValidationResult` with details on what failed and why.

In [None]:
# Null checks
result = orders.customer_id.is_not_null()
print(f"customer_id not null: {'‚úÖ' if result.passed else '‚ùå'}")
if not result.passed:
    print(f"  ‚Üí {result.summary()}")

# Uniqueness
result = orders.order_id.is_unique()
print(f"order_id unique: {'‚úÖ' if result.passed else '‚ùå'}")

# Range checks
result = orders.quantity.between(1, 100)
print(f"quantity in [1, 100]: {'‚úÖ' if result.passed else '‚ùå'}")
if not result.passed:
    print(f"  ‚Üí {result.summary()}")

# Enum checks
result = orders.status.isin(["pending", "shipped", "delivered", "cancelled"])
print(f"status valid: {'‚úÖ' if result.passed else '‚ùå'}")
if not result.passed:
    print(f"  ‚Üí {result.summary()}")

## 6. Row-Level Error Debugging

Don't just know *that* something failed ‚Äî see *exactly which rows* and *why*.

In [None]:
result = orders.quantity.between(1, 100)

if not result.passed:
    print("Failed rows:")
    for row in result.failed_rows:
        print(f"  Row {row.row_number}: quantity={row.value} ‚Äî {row.reason}")
    print(f"\nFailed values: {result.get_failed_values()}")
    print(f"Failed row indices: {result.get_failed_row_indices()}")

## 7. Quality Scoring

Get a composite quality score across 4 dimensions: completeness, uniqueness, validity, and consistency.

In [None]:
score = orders.score()

print(f"Overall Grade: {score.grade}")
print(f"Overall Score: {score.overall:.1f}/100")
print(f"")
print(f"  Completeness: {score.completeness:.1f}%  (non-null values)")
print(f"  Uniqueness:   {score.uniqueness:.1f}%  (distinct values in key columns)")
print(f"  Validity:     {score.validity:.1f}%  (values passing type/range checks)")
print(f"  Consistency:  {score.consistency:.1f}%  (consistent formatting)")

## 8. Anomaly Detection

7 built-in methods: z-score, IQR, modified z-score, percent change, baseline, KS-test, seasonal.

In [None]:
from duckguard import detect_anomalies

report = detect_anomalies(orders, method="zscore", columns=["quantity", "total_amount"])

print(f"Anomalies found: {report.has_anomalies}")
print(f"Count: {report.anomaly_count}")
for a in report.anomalies:
    status = "üö® ANOMALY" if a.is_anomaly else "‚úÖ Normal"
    print(f"  {a.column}: score={a.score:.2f} ‚Üí {status}")

## 9. Auto-Suggest Validation Rules

DuckGuard can analyze your data and generate YAML rules automatically.

In [None]:
from duckguard import generate_rules

yaml_rules = generate_rules(orders, dataset_name="orders")
print(yaml_rules)

## 10. Use with Any Source

Everything above works identically on any data source ‚Äî not just CSV:

```python
# Parquet files (local or cloud)
data = connect("s3://my-bucket/data.parquet")

# Snowflake
data = connect("snowflake://account/db", table="orders")

# Databricks
data = connect("databricks://workspace.databricks.com", table="orders")

# BigQuery
data = connect("bigquery://project", table="orders")

# Delta Lake
data = connect("delta://path/to/delta_table")

# pandas DataFrame
data = connect(your_dataframe)
```

Install connectors as needed: `pip install duckguard[snowflake]` or `pip install duckguard[all]`

## 11. In Your pytest Suite

DuckGuard validations are just Python assertions ‚Äî they work natively in pytest:

```python
# tests/test_data_quality.py
from duckguard import connect

def test_orders():
    orders = connect("s3://warehouse/orders.parquet")
    assert orders.row_count > 0
    assert orders.order_id.is_not_null()
    assert orders.order_id.is_unique()
    assert orders.total_amount.between(0, 10000)
```

Run with `pytest` ‚Äî data quality as part of CI/CD.

---

## Summary

| What | How |
|------|-----|
| Install | `pip install duckguard` |
| Connect | `connect("file.csv")` / `connect("snowflake://...")` |
| Validate | `assert data.col.is_not_null()` |
| Profile | `AutoProfiler().profile(data)` |
| Score | `data.score()` ‚Üí A/B/C/D/F |
| Anomalies | `detect_anomalies(data, method="zscore")` |
| PII | `SemanticAnalyzer().analyze(data)` |
| Rules | `generate_rules(data)` ‚Üí YAML |

**3 lines of code. 10x faster than Great Expectations. Works with any data source.**

üìö [Full Docs](https://xdatahubai.github.io/duckguard/) ¬∑ ‚≠ê [GitHub](https://github.com/XDataHubAI/duckguard) ¬∑ üì¶ [PyPI](https://pypi.org/project/duckguard/)