# DuckGuard - Data Quality in 60 Seconds

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/XDataHubAI/duckguard/blob/main/examples/colab_quickstart.ipynb)
[![PyPI](https://img.shields.io/pypi/v/duckguard.svg)](https://pypi.org/project/duckguard/)

**DuckGuard** is a Python-native data quality tool built on DuckDB. 10x faster than pandas-based tools.

Features:
- Quality Scoring (A-F grades)
- YAML-based Rules
- Semantic Type Detection (PII, emails, etc.)
- Data Contracts
- Anomaly Detection

In [None]:
# Install DuckGuard
!pip install duckguard -q
print("DuckGuard installed!")

In [None]:
# Create sample data
import pandas as pd

df = pd.DataFrame({
    'order_id': ['ORD-001', 'ORD-002', 'ORD-003', 'ORD-004', 'ORD-005'],
    'customer_id': ['CUST-001', 'CUST-002', None, 'CUST-004', 'CUST-005'],
    'email': ['john@example.com', 'jane@example.com', 'bob@example.com', 'invalid-email', 'alice@example.com'],
    'amount': [99.99, 149.50, 75.00, -10.00, 200.00],
    'status': ['delivered', 'shipped', 'pending', 'unknown', 'delivered']
})

df.to_csv('orders.csv', index=False)
print("Sample data created!")
df

## 1. Connect and Explore

In [None]:
from duckguard import connect

# Connect to data
orders = connect("orders.csv")

print(f"Rows: {orders.row_count}")
print(f"Columns: {orders.columns}")

## 2. Quality Score

In [None]:
# Get instant quality score
result = orders.score()

print(f"Quality Score: {result.overall:.1f}/100")
print(f"Grade: {result.grade}")
print(f"\nDimensions:")
print(f"  Completeness: {result.completeness:.1f}")
print(f"  Uniqueness: {result.uniqueness:.1f}")
print(f"  Validity: {result.validity:.1f}")

## 3. Column Statistics

In [None]:
# Check column quality
print(f"customer_id null %: {orders.customer_id.null_percent:.1f}%")
print(f"order_id unique %: {orders.order_id.unique_percent:.1f}%")
print(f"amount min: {orders.amount.min}")
print(f"amount max: {orders.amount.max}")

## 4. Semantic Type Detection (PII)

In [None]:
from duckguard import detect_types_for_dataset
from duckguard.semantic import SemanticAnalyzer

# Detect semantic types
types = detect_types_for_dataset(orders)
for col, sem_type in types.items():
    print(f"{col}: {sem_type.value if sem_type else 'generic'}")

# Check for PII
analysis = SemanticAnalyzer().analyze(orders)
if analysis.pii_columns:
    print(f"\n⚠️ PII detected in: {analysis.pii_columns}")

## 5. YAML Rules

In [None]:
from duckguard import load_rules_from_string, execute_rules

yaml_rules = """
dataset: orders
rules:
  - order_id is not null
  - order_id is unique
  - customer_id null_percent < 50
  - amount >= 0
  - status in ['pending', 'shipped', 'delivered']
"""

rules = load_rules_from_string(yaml_rules)
result = execute_rules(rules, dataset=orders)

print(f"Passed: {result.passed_count}/{result.total_checks}")
print(f"\nResults:")
for r in result.results:
    status = "✓" if r.passed else "✗"
    print(f"  {status} {r.check.expression}")

## 6. Anomaly Detection

In [None]:
from duckguard import detect_anomalies

report = detect_anomalies(orders, method="zscore", threshold=2.0)

print(f"Anomalies found: {report.anomaly_count}")
for a in report.anomalies:
    if a.is_anomaly:
        print(f"  ⚠️ {a.column}: {a.message}")

## 7. Data Contracts

In [None]:
from duckguard import generate_contract, validate_contract

# Generate contract from data
contract = generate_contract(orders, name="orders_contract")

print(f"Contract: {contract.name}")
print(f"Schema:")
for field in contract.schema:
    print(f"  {field.name}: {field.type.value}")

# Validate
result = validate_contract(contract, orders)
print(f"\nValid: {result.is_valid}")

## Next Steps

- **GitHub**: https://github.com/XDataHubAI/duckguard
- **PyPI**: https://pypi.org/project/duckguard/
- **Full docs**: See `examples/getting_started.ipynb`

```bash
# CLI usage
duckguard check data.csv
duckguard discover data.csv --output rules.yaml
duckguard anomaly data.csv
```