# Data Quality with DuckGuard - 10x Faster Than Pandas

**DuckGuard** is a Python-native data quality tool built on DuckDB.

In this notebook, we'll use DuckGuard to:
1. Profile the Titanic dataset
2. Get instant quality scores
3. Detect semantic types and PII
4. Create validation rules
5. Detect anomalies

[![GitHub](https://img.shields.io/badge/GitHub-XDataHubAI%2Fduckguard-blue)](https://github.com/XDataHubAI/duckguard)
[![PyPI](https://img.shields.io/pypi/v/duckguard.svg)](https://pypi.org/project/duckguard/)

In [None]:
# Install DuckGuard
!pip install duckguard -q
print("DuckGuard installed!")

In [None]:
from duckguard import connect, score, detect_anomalies
from duckguard.semantic import SemanticAnalyzer
from duckguard import load_rules_from_string, execute_rules

# Connect to Kaggle dataset (adjust path as needed)
# For Titanic competition:
# data = connect("/kaggle/input/titanic/train.csv")

# For this demo, we'll create sample data
import pandas as pd
df = pd.DataFrame({
    'PassengerId': range(1, 101),
    'Survived': [0, 1] * 50,
    'Pclass': [1, 2, 3, 1, 2] * 20,
    'Name': [f'Passenger {i}' for i in range(1, 101)],
    'Sex': ['male', 'female'] * 50,
    'Age': [25, 30, None, 45, 22] * 20,
    'Fare': [50.0, 75.5, 10.0, 200.0, 15.5] * 20,
    'Embarked': ['S', 'C', 'Q', 'S', None] * 20
})
df.to_csv('titanic_sample.csv', index=False)

data = connect('titanic_sample.csv')
print(f"Loaded {data.row_count} rows, {data.column_count} columns")

## 1. Instant Quality Score

In [None]:
quality = data.score()

print("=" * 50)
print("DATA QUALITY REPORT")
print("=" * 50)
print(f"\nOverall Score: {quality.overall:.1f}/100")
print(f"Grade: {quality.grade}")
print(f"\nDimension Scores:")
print(f"  Completeness: {quality.completeness:.1f}")
print(f"  Uniqueness: {quality.uniqueness:.1f}")
print(f"  Validity: {quality.validity:.1f}")
print(f"  Consistency: {quality.consistency:.1f}")

## 2. Column Statistics

In [None]:
# Check for missing values
print("Missing Values:")
for col in data.columns:
    null_pct = getattr(data, col).null_percent
    if null_pct > 0:
        print(f"  {col}: {null_pct:.1f}% null")

print(f"\nFare Statistics:")
print(f"  Min: {data.Fare.min}")
print(f"  Max: {data.Fare.max}")
print(f"  Mean: {data.Fare.mean:.2f}")

## 3. Semantic Type Detection

In [None]:
from duckguard import detect_types_for_dataset

types = detect_types_for_dataset(data)
print("Detected Semantic Types:")
for col, sem_type in types.items():
    type_name = sem_type.value if sem_type else "generic"
    print(f"  {col}: {type_name}")

# Check for PII
analyzer = SemanticAnalyzer()
analysis = analyzer.analyze(data)
if analysis.pii_columns:
    print(f"\n⚠️ Potential PII in: {analysis.pii_columns}")
else:
    print("\n✓ No obvious PII detected")

## 4. YAML Validation Rules

In [None]:
rules_yaml = """
dataset: titanic
rules:
  - PassengerId is not null
  - PassengerId is unique
  - Survived in [0, 1]
  - Pclass in [1, 2, 3]
  - Age >= 0
  - Fare >= 0
  - Sex in ['male', 'female']
"""

rules = load_rules_from_string(rules_yaml)
result = execute_rules(rules, dataset=data)

print(f"Results: {result.passed_count}/{result.total_checks} passed")
print(f"Quality Score: {result.quality_score:.1f}%")
print("\nDetails:")
for r in result.results:
    status = "✓ PASS" if r.passed else "✗ FAIL"
    print(f"  [{status}] {r.check.expression}")

## 5. Anomaly Detection

In [None]:
report = detect_anomalies(data, method="iqr", threshold=1.5)

print(f"Anomaly Detection (IQR method):")
print(f"Columns checked: {report.statistics.get('columns_checked', 0)}")
print(f"Anomalies found: {report.anomaly_count}")

for a in report.anomalies:
    status = "⚠️ ANOMALY" if a.is_anomaly else "✓ OK"
    print(f"  {status} {a.column}: {a.message}")

## 6. Auto-Generate Rules

In [None]:
from duckguard import generate_rules

# Auto-generate rules from data
generated = generate_rules(data, dataset_name="titanic")
print("Auto-Generated Rules:")
print(generated)

## Summary

DuckGuard provides:
- **Instant quality scores** with A-F grades
- **Semantic type detection** including PII
- **YAML-based validation rules**
- **Anomaly detection** with multiple methods
- **10x faster** than pandas-based tools

### Install & Learn More
```bash
pip install duckguard
```

- GitHub: https://github.com/XDataHubAI/duckguard
- PyPI: https://pypi.org/project/duckguard/