# Polars Data Quality and Validation - Comprehensive Guide

This notebook covers data quality assessment and validation techniques in Polars.

## What You'll Learn:
- Data profiling and statistical summaries
- Schema validation and enforcement
- Value range checks and constraints
- Data type validation
- Outlier detection methods
- Data quality scoring
- Assertions and testing
- Building data quality pipelines

In [None]:
import polars as pl
import numpy as np
from datetime import datetime, date, timedelta
from typing import Dict, List, Any

print(f"Polars version: {pl.__version__}")

---
# Part 1: Data Profiling

Understanding your data through statistical summaries.

## 1.1 Basic Statistical Summary

In [None]:
# Create sample dataset
np.random.seed(42)
n = 1000

df = pl.DataFrame({
    'customer_id': range(1, n+1),
    'age': np.random.randint(18, 80, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'purchase_amount': np.random.exponential(100, n),
    'satisfaction_score': np.random.randint(1, 11, n),
    'is_premium': np.random.choice([True, False], n, p=[0.3, 0.7]),
    'signup_date': [date(2024, 1, 1) + timedelta(days=int(np.random.randint(0, 365))) for _ in range(n)]
})

print("Sample dataset:")
print(df.head())
print(f"\nShape: {df.shape}")

In [None]:
# Basic describe() - statistical summary
print("Statistical summary:")
print(df.describe())

## 1.2 Comprehensive Data Profile

In [None]:
def profile_dataframe(df: pl.DataFrame) -> pl.DataFrame:
    """Generate comprehensive profile of DataFrame."""
    
    profiles = []
    
    for col in df.columns:
        col_data = df[col]
        dtype = col_data.dtype
        
        profile = {
            'column': col,
            'dtype': str(dtype),
            'count': len(col_data),
            'null_count': col_data.null_count(),
            'null_pct': round(col_data.null_count() / len(col_data) * 100, 2),
            'unique_count': col_data.n_unique(),
            'unique_pct': round(col_data.n_unique() / len(col_data) * 100, 2)
        }
        
        # Add statistics for numeric columns
        if dtype in [pl.Int64, pl.Int32, pl.Float64, pl.Float32, pl.UInt32, pl.UInt64]:
            profile.update({
                'mean': round(float(col_data.mean()), 2) if col_data.mean() is not None else None,
                'std': round(float(col_data.std()), 2) if col_data.std() is not None else None,
                'min': float(col_data.min()) if col_data.min() is not None else None,
                'q25': float(col_data.quantile(0.25)) if col_data.quantile(0.25) is not None else None,
                'median': float(col_data.median()) if col_data.median() is not None else None,
                'q75': float(col_data.quantile(0.75)) if col_data.quantile(0.75) is not None else None,
                'max': float(col_data.max()) if col_data.max() is not None else None
            })
        else:
            profile.update({
                'mean': None,
                'std': None,
                'min': None,
                'q25': None,
                'median': None,
                'q75': None,
                'max': None
            })
        
        profiles.append(profile)
    
    return pl.DataFrame(profiles)

profile = profile_dataframe(df)
print("Comprehensive data profile:")
print(profile)

## 1.3 Distribution Analysis

In [None]:
# Value counts for categorical columns
print("Premium member distribution:")
print(df.group_by('is_premium').agg(pl.count()).sort('is_premium'))

print("\nSatisfaction score distribution:")
satisfaction_dist = df.group_by('satisfaction_score').agg(
    pl.count().alias('count')
).sort('satisfaction_score')
print(satisfaction_dist)

In [None]:
# Quantile analysis for numeric columns
quantile_analysis = df.select([
    pl.col('age').quantile(0.01).alias('age_p01'),
    pl.col('age').quantile(0.05).alias('age_p05'),
    pl.col('age').quantile(0.25).alias('age_p25'),
    pl.col('age').quantile(0.50).alias('age_median'),
    pl.col('age').quantile(0.75).alias('age_p75'),
    pl.col('age').quantile(0.95).alias('age_p95'),
    pl.col('age').quantile(0.99).alias('age_p99'),
])

print("Age quantiles:")
print(quantile_analysis)

---
# Part 2: Schema Validation

## 2.1 Expected Schema Definition

In [None]:
# Define expected schema
expected_schema = {
    'customer_id': pl.Int64,
    'age': pl.Int64,
    'income': pl.Float64,
    'purchase_amount': pl.Float64,
    'satisfaction_score': pl.Int64,
    'is_premium': pl.Boolean,
    'signup_date': pl.Date
}

print("Expected schema:")
for col, dtype in expected_schema.items():
    print(f"  {col}: {dtype}")

In [None]:
def validate_schema(df: pl.DataFrame, expected_schema: Dict[str, pl.DataType]) -> Dict[str, Any]:
    """Validate DataFrame schema against expected schema.
    
    Returns:
        Dictionary with validation results and errors
    """
    actual_schema = df.schema
    errors = []
    warnings = []
    
    # Check missing columns
    missing_cols = set(expected_schema.keys()) - set(actual_schema.keys())
    if missing_cols:
        errors.append(f"Missing columns: {missing_cols}")
    
    # Check extra columns
    extra_cols = set(actual_schema.keys()) - set(expected_schema.keys())
    if extra_cols:
        warnings.append(f"Extra columns: {extra_cols}")
    
    # Check type mismatches
    for col in expected_schema:
        if col in actual_schema:
            if actual_schema[col] != expected_schema[col]:
                errors.append(
                    f"Type mismatch in '{col}': expected {expected_schema[col]}, "
                    f"got {actual_schema[col]}"
                )
    
    is_valid = len(errors) == 0
    
    return {
        'is_valid': is_valid,
        'errors': errors,
        'warnings': warnings
    }

# Validate
validation = validate_schema(df, expected_schema)

if validation['is_valid']:
    print("✅ Schema validation passed!")
else:
    print("❌ Schema validation failed!")
    for error in validation['errors']:
        print(f"  ERROR: {error}")

if validation['warnings']:
    for warning in validation['warnings']:
        print(f"  WARNING: {warning}")

## 2.2 Column Presence Validation

In [None]:
def validate_required_columns(df: pl.DataFrame, required_cols: List[str]) -> Dict[str, Any]:
    """Check that all required columns are present."""
    missing = set(required_cols) - set(df.columns)
    
    return {
        'is_valid': len(missing) == 0,
        'missing_columns': list(missing)
    }

# Test
required = ['customer_id', 'age', 'income', 'email']  # 'email' is missing
result = validate_required_columns(df, required)

if result['is_valid']:
    print("✅ All required columns present")
else:
    print(f"❌ Missing required columns: {result['missing_columns']}")

---
# Part 3: Value Range Validation

## 3.1 Numeric Range Checks

In [None]:
def validate_numeric_range(df: pl.DataFrame, col: str, min_val: float, max_val: float) -> Dict[str, Any]:
    """Validate numeric column is within expected range."""
    out_of_range = df.filter(
        (pl.col(col) < min_val) | (pl.col(col) > max_val)
    )
    
    actual_min = float(df[col].min())
    actual_max = float(df[col].max())
    
    return {
        'is_valid': len(out_of_range) == 0,
        'violations': len(out_of_range),
        'expected_range': (min_val, max_val),
        'actual_range': (actual_min, actual_max),
        'violation_examples': out_of_range.head(5) if len(out_of_range) > 0 else None
    }

# Validate age range
age_validation = validate_numeric_range(df, 'age', 18, 100)

print("Age range validation:")
if age_validation['is_valid']:
    print("✅ All ages within valid range")
else:
    print(f"❌ {age_validation['violations']} values out of range")
    print(f"Expected: {age_validation['expected_range']}")
    print(f"Actual: {age_validation['actual_range']}")

In [None]:
# Validate satisfaction score
satisfaction_validation = validate_numeric_range(df, 'satisfaction_score', 1, 10)

print("Satisfaction score validation:")
if satisfaction_validation['is_valid']:
    print("✅ All scores within valid range (1-10)")
else:
    print(f"❌ {satisfaction_validation['violations']} scores out of range")

## 3.2 Date Range Validation

In [None]:
def validate_date_range(df: pl.DataFrame, col: str, min_date: date, max_date: date) -> Dict[str, Any]:
    """Validate date column is within expected range."""
    out_of_range = df.filter(
        (pl.col(col) < min_date) | (pl.col(col) > max_date)
    )
    
    return {
        'is_valid': len(out_of_range) == 0,
        'violations': len(out_of_range),
        'expected_range': (min_date, max_date),
        'violation_examples': out_of_range.head(5) if len(out_of_range) > 0 else None
    }

# Validate signup_date
date_validation = validate_date_range(
    df, 'signup_date', 
    date(2024, 1, 1), 
    date.today()
)

print("Signup date validation:")
if date_validation['is_valid']:
    print("✅ All dates within valid range")
else:
    print(f"❌ {date_validation['violations']} dates out of range")

## 3.3 Categorical Value Validation

In [None]:
def validate_categorical(df: pl.DataFrame, col: str, valid_values: List[Any]) -> Dict[str, Any]:
    """Validate categorical column only contains valid values."""
    invalid = df.filter(~pl.col(col).is_in(valid_values))
    unique_actual = df[col].unique().to_list()
    
    return {
        'is_valid': len(invalid) == 0,
        'violations': len(invalid),
        'valid_values': valid_values,
        'actual_values': unique_actual,
        'invalid_values': list(set(unique_actual) - set(valid_values)),
        'violation_examples': invalid.head(5) if len(invalid) > 0 else None
    }

# Example: Add status column and validate
df_with_status = df.with_columns(
    pl.when(pl.col('satisfaction_score') >= 8)
      .then(pl.lit('satisfied'))
      .when(pl.col('satisfaction_score') >= 5)
      .then(pl.lit('neutral'))
      .otherwise(pl.lit('unsatisfied'))
      .alias('status')
)

status_validation = validate_categorical(
    df_with_status, 
    'status', 
    ['satisfied', 'neutral', 'unsatisfied']
)

print("Status validation:")
if status_validation['is_valid']:
    print("✅ All status values valid")
    print(f"Values: {status_validation['actual_values']}")
else:
    print(f"❌ {status_validation['violations']} invalid values")
    print(f"Invalid values: {status_validation['invalid_values']}")

---
# Part 4: Outlier Detection

## 4.1 IQR Method (Interquartile Range)

In [None]:
def detect_outliers_iqr(df: pl.DataFrame, col: str, multiplier: float = 1.5) -> pl.DataFrame:
    """Detect outliers using IQR method.
    
    Outliers are values outside [Q1 - multiplier*IQR, Q3 + multiplier*IQR]
    """
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    
    lower_bound = q1 - multiplier * iqr
    upper_bound = q3 + multiplier * iqr
    
    outliers = df.filter(
        (pl.col(col) < lower_bound) | (pl.col(col) > upper_bound)
    )
    
    print(f"Outlier detection for '{col}' (IQR method):")
    print(f"  Q1: {q1:.2f}")
    print(f"  Q3: {q3:.2f}")
    print(f"  IQR: {iqr:.2f}")
    print(f"  Lower bound: {lower_bound:.2f}")
    print(f"  Upper bound: {upper_bound:.2f}")
    print(f"  Outliers found: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
    
    return outliers

# Detect outliers in income
income_outliers = detect_outliers_iqr(df, 'income')
print("\nOutlier examples:")
print(income_outliers.head())

## 4.2 Z-Score Method

In [None]:
def detect_outliers_zscore(df: pl.DataFrame, col: str, threshold: float = 3.0) -> pl.DataFrame:
    """Detect outliers using Z-score method.
    
    Outliers are values with |Z-score| > threshold (typically 3)
    """
    mean = df[col].mean()
    std = df[col].std()
    
    df_with_zscore = df.with_columns(
        ((pl.col(col) - mean) / std).alias('z_score')
    )
    
    outliers = df_with_zscore.filter(
        pl.col('z_score').abs() > threshold
    )
    
    print(f"Outlier detection for '{col}' (Z-score method):")
    print(f"  Mean: {mean:.2f}")
    print(f"  Std Dev: {std:.2f}")
    print(f"  Threshold: {threshold}")
    print(f"  Outliers found: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
    
    return outliers

# Detect outliers in purchase_amount
purchase_outliers = detect_outliers_zscore(df, 'purchase_amount')
print("\nOutlier examples:")
print(purchase_outliers.select(['customer_id', 'purchase_amount', 'z_score']).head())

## 4.3 Percentile Method

In [None]:
def detect_outliers_percentile(df: pl.DataFrame, col: str, lower_pct: float = 0.01, upper_pct: float = 0.99) -> pl.DataFrame:
    """Detect outliers using percentile method.
    
    Outliers are values below lower_pct or above upper_pct
    """
    lower_bound = df[col].quantile(lower_pct)
    upper_bound = df[col].quantile(upper_pct)
    
    outliers = df.filter(
        (pl.col(col) < lower_bound) | (pl.col(col) > upper_bound)
    )
    
    print(f"Outlier detection for '{col}' (Percentile method):")
    print(f"  {lower_pct*100:.0f}th percentile: {lower_bound:.2f}")
    print(f"  {upper_pct*100:.0f}th percentile: {upper_bound:.2f}")
    print(f"  Outliers found: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
    
    return outliers

# Detect extreme age values
age_outliers = detect_outliers_percentile(df, 'age', 0.01, 0.99)
print("\nOutlier examples:")
print(age_outliers.head())

---
# Part 5: Data Quality Scoring

In [None]:
def calculate_quality_score(df: pl.DataFrame) -> Dict[str, float]:
    """Calculate overall data quality score (0-100).
    
    Factors:
    - Completeness (null percentage)
    - Uniqueness (duplicate percentage)
    - Consistency (type correctness)
    """
    total_cells = len(df) * len(df.columns)
    
    # Completeness: Percentage of non-null values
    null_cells = df.null_count().sum_horizontal()[0]
    completeness = (1 - null_cells / total_cells) * 100
    
    # Uniqueness: For columns that should be unique (e.g., customer_id)
    # Here we check if customer_id has duplicates
    if 'customer_id' in df.columns:
        duplicates = df['customer_id'].is_duplicated().sum()
        uniqueness = (1 - duplicates / len(df)) * 100
    else:
        uniqueness = 100
    
    # Consistency: Check if numeric columns are numeric, etc.
    # Simplified: assuming schema is correct
    consistency = 100
    
    # Overall score (weighted average)
    overall = (completeness * 0.4 + uniqueness * 0.3 + consistency * 0.3)
    
    return {
        'completeness': round(completeness, 2),
        'uniqueness': round(uniqueness, 2),
        'consistency': round(consistency, 2),
        'overall': round(overall, 2)
    }

quality = calculate_quality_score(df)

print("Data Quality Score:")
for metric, score in quality.items():
    print(f"  {metric.capitalize()}: {score}/100")

if quality['overall'] >= 90:
    print("\n✅ Excellent data quality!")
elif quality['overall'] >= 70:
    print("\n⚠️ Good data quality, some improvements needed")
else:
    print("\n❌ Poor data quality, significant cleanup required")

---
# Part 6: Assertions and Testing

## 6.1 Data Assertions

In [None]:
def assert_no_nulls(df: pl.DataFrame, columns: List[str] = None):
    """Assert that specified columns have no null values."""
    cols_to_check = columns if columns else df.columns
    
    for col in cols_to_check:
        null_count = df[col].null_count()
        assert null_count == 0, f"Column '{col}' has {null_count} null values"
    
    print(f"✅ No nulls in {cols_to_check}")

def assert_unique(df: pl.DataFrame, column: str):
    """Assert that column has only unique values."""
    duplicates = df[column].is_duplicated().sum()
    assert duplicates == 0, f"Column '{column}' has {duplicates} duplicates"
    print(f"✅ Column '{column}' has only unique values")

def assert_range(df: pl.DataFrame, column: str, min_val, max_val):
    """Assert that all values are within specified range."""
    violations = df.filter(
        (pl.col(column) < min_val) | (pl.col(column) > max_val)
    )
    assert len(violations) == 0, f"Column '{column}' has {len(violations)} values outside [{min_val}, {max_val}]"
    print(f"✅ All '{column}' values in range [{min_val}, {max_val}]")

# Run assertions
try:
    assert_no_nulls(df, ['customer_id', 'age'])
    assert_unique(df, 'customer_id')
    assert_range(df, 'age', 18, 100)
    assert_range(df, 'satisfaction_score', 1, 10)
    print("\n✅ All assertions passed!")
except AssertionError as e:
    print(f"\n❌ Assertion failed: {e}")

## 6.2 Data Quality Tests

In [None]:
class DataQualityTest:
    """Data quality test suite."""
    
    def __init__(self, df: pl.DataFrame):
        self.df = df
        self.results = []
    
    def test_no_duplicates(self, column: str) -> bool:
        """Test: No duplicates in column."""
        duplicates = self.df[column].is_duplicated().sum()
        passed = duplicates == 0
        self.results.append({
            'test': f'no_duplicates_{column}',
            'passed': passed,
            'message': f"Found {duplicates} duplicates" if not passed else "OK"
        })
        return passed
    
    def test_no_nulls(self, column: str) -> bool:
        """Test: No nulls in column."""
        nulls = self.df[column].null_count()
        passed = nulls == 0
        self.results.append({
            'test': f'no_nulls_{column}',
            'passed': passed,
            'message': f"Found {nulls} nulls" if not passed else "OK"
        })
        return passed
    
    def test_value_range(self, column: str, min_val, max_val) -> bool:
        """Test: Values within expected range."""
        violations = self.df.filter(
            (pl.col(column) < min_val) | (pl.col(column) > max_val)
        )
        passed = len(violations) == 0
        self.results.append({
            'test': f'range_{column}',
            'passed': passed,
            'message': f"Found {len(violations)} out of range" if not passed else "OK"
        })
        return passed
    
    def test_data_types(self, expected_schema: Dict[str, pl.DataType]) -> bool:
        """Test: Correct data types."""
        mismatches = []
        for col, expected_type in expected_schema.items():
            if col in self.df.schema:
                if self.df.schema[col] != expected_type:
                    mismatches.append(col)
        
        passed = len(mismatches) == 0
        self.results.append({
            'test': 'data_types',
            'passed': passed,
            'message': f"Type mismatches: {mismatches}" if not passed else "OK"
        })
        return passed
    
    def run_all_tests(self) -> pl.DataFrame:
        """Run all tests and return results."""
        # Run tests
        self.test_no_duplicates('customer_id')
        self.test_no_nulls('customer_id')
        self.test_no_nulls('age')
        self.test_value_range('age', 18, 100)
        self.test_value_range('satisfaction_score', 1, 10)
        self.test_data_types(expected_schema)
        
        return pl.DataFrame(self.results)

# Run test suite
tester = DataQualityTest(df)
test_results = tester.run_all_tests()

print("Data Quality Test Results:")
print(test_results)

# Summary
total_tests = len(test_results)
passed_tests = test_results.filter(pl.col('passed')).height
print(f"\n{passed_tests}/{total_tests} tests passed")

if passed_tests == total_tests:
    print("✅ All tests passed!")
else:
    print("❌ Some tests failed")
    print("\nFailed tests:")
    print(test_results.filter(~pl.col('passed')))

---
# Part 7: Complete Data Quality Pipeline

In [None]:
def comprehensive_data_quality_report(df: pl.DataFrame, 
                                       expected_schema: Dict[str, pl.DataType] = None) -> Dict:
    """Generate comprehensive data quality report."""
    
    report = {
        'dataset_info': {
            'rows': len(df),
            'columns': len(df.columns),
            'total_cells': len(df) * len(df.columns)
        },
        'profile': profile_dataframe(df),
        'quality_score': calculate_quality_score(df),
        'issues': []
    }
    
    # Check for issues
    # 1. Null values
    for col in df.columns:
        null_count = df[col].null_count()
        if null_count > 0:
            report['issues'].append({
                'type': 'nulls',
                'column': col,
                'count': null_count,
                'percentage': round(null_count / len(df) * 100, 2)
            })
    
    # 2. Duplicates in key columns
    if 'customer_id' in df.columns:
        dup_count = df['customer_id'].is_duplicated().sum()
        if dup_count > 0:
            report['issues'].append({
                'type': 'duplicates',
                'column': 'customer_id',
                'count': dup_count
            })
    
    # 3. Schema validation (if expected schema provided)
    if expected_schema:
        schema_result = validate_schema(df, expected_schema)
        if not schema_result['is_valid']:
            for error in schema_result['errors']:
                report['issues'].append({
                    'type': 'schema',
                    'message': error
                })
    
    return report

# Generate comprehensive report
report = comprehensive_data_quality_report(df, expected_schema)

print("=" * 60)
print("COMPREHENSIVE DATA QUALITY REPORT")
print("=" * 60)

print("\n📊 Dataset Info:")
for key, value in report['dataset_info'].items():
    print(f"  {key}: {value:,}")

print("\n🎯 Quality Score:")
for metric, score in report['quality_score'].items():
    print(f"  {metric.capitalize()}: {score}/100")

print("\n⚠️ Issues Found:")
if report['issues']:
    for i, issue in enumerate(report['issues'], 1):
        print(f"  {i}. {issue}")
else:
    print("  ✅ No issues found!")

print("\n📈 Column Profile:")
print(report['profile'])

print("\n" + "=" * 60)

---
# Summary

## Data Quality Dimensions:

| Dimension | What to Check | Methods |
|-----------|---------------|----------|
| **Completeness** | Missing values | null_count(), fill strategies |
| **Uniqueness** | Duplicates | is_duplicated(), unique() |
| **Validity** | Value ranges, types | Range checks, schema validation |
| **Consistency** | Data type correctness | Type validation, format checks |
| **Accuracy** | Outliers, anomalies | IQR, Z-score, percentile methods |

## Best Practices:

1. ✅ **Profile early** - Understand data before processing
2. ✅ **Define expectations** - Schema, ranges, constraints
3. ✅ **Automate validation** - Test suite for data pipelines
4. ✅ **Score quality** - Track quality over time
5. ✅ **Document issues** - Log all data quality problems
6. ✅ **Use assertions** - Fail fast on bad data
7. ✅ **Monitor outliers** - Multiple detection methods
8. ✅ **Validate schema** - Type safety prevents errors

## Common Patterns:
```python
# Profile data
df.describe()
df.null_count()

# Validate schema
assert df.schema == expected_schema

# Check ranges
assert df.filter((pl.col('age') < 0) | (pl.col('age') > 120)).height == 0

# Detect outliers
q1, q3 = df['value'].quantile([0.25, 0.75])
iqr = q3 - q1
outliers = df.filter((pl.col('value') < q1 - 1.5*iqr) | (pl.col('value') > q3 + 1.5*iqr))
```

## Data Quality Pipeline:
1. **Profile** - Understand data distribution
2. **Validate** - Check schema and constraints
3. **Clean** - Fix issues (nulls, duplicates)
4. **Monitor** - Track quality metrics
5. **Alert** - Notify on quality degradation

---
# Practice Exercises

In [None]:
# Exercise 1: Create a custom validation function
# TODO: Write a function to validate email format using regex


In [None]:
# Exercise 2: Build a data quality dashboard
# TODO: Create a summary showing all quality metrics


In [None]:
# Exercise 3: Implement custom outlier detection
# TODO: Combine IQR and Z-score methods for robust detection
