## **Feature:** Data Validation

**Names:** Gia Bao Ngo
### **What it does**
Provides comprehensive data validation capabilities including email format validation, phone number validation, numeric range validation, cross-column consistency checks, categorical value validation, and generates detailed validation reports with data quality scoring.

In [1]:
# Load dotenv
import os
from dotenv import load_dotenv
load_dotenv()

# Get API Key
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    print("OpenAI API Key not found")

# Import libraries
from pathlib import Path
import pandas as pd
import numpy as np
# Additional imports for validation
import math
import re
import datetime
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')

# Langchain imports
from langchain_openai import ChatOpenAI  
from langchain.schema import HumanMessage, SystemMessage

### **Helper Functions**
- `validate_email_format(series)` - Check email format validity using regex patterns
- `validate_phone_format(series, country_code=None)` - Phone number validation with international support
- `validate_numeric_ranges(df, column, min_val=None, max_val=None)` - Range validation with boundary checks
- `check_data_consistency(df)` - Cross-column consistency checks
- `validate_categorical_values(df, column, allowed_values)` - Check against allowed value lists
- `generate_validation_report(df, rules_dict)` - Comprehensive validation report with quality scoring

In [2]:
def validate_email_format(series):
    """
    Check email format validity using regex patterns.
    
    Parameters:
    - series: pandas Series containing email addresses
    
    Returns:
    - Dictionary with validation results and statistics
    """
    # Comprehensive email regex pattern
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    
    # Remove null values for validation
    non_null_series = series.dropna()
    total_count = len(series)
    non_null_count = len(non_null_series)
    null_count = total_count - non_null_count
    
    if non_null_count == 0:
        return {
            'valid_emails': 0,
            'invalid_emails': 0,
            'null_count': null_count,
            'total_count': total_count,
            'validity_rate': 0.0,
            'invalid_samples': [],
            'common_issues': []
        }
    
    # Convert to string and validate
    email_strings = non_null_series.astype(str).str.strip().str.lower()
    valid_mask = email_strings.str.match(email_pattern, na=False)
    
    valid_count = valid_mask.sum()
    invalid_count = non_null_count - valid_count
    validity_rate = (valid_count / non_null_count) * 100
    
    # Get invalid email samples
    invalid_emails = email_strings[~valid_mask].head(10).tolist()
    
    # Analyze common issues
    common_issues = []
    if invalid_count > 0:
        invalid_series = email_strings[~valid_mask]
        
        # Check for missing @ symbol
        missing_at = invalid_series.str.contains('@', na=False).sum()
        if missing_at < len(invalid_series):
            common_issues.append(f"Missing @ symbol: {len(invalid_series) - missing_at} cases")
        
        # Check for missing domain
        has_at = invalid_series.str.contains('@', na=False)
        if has_at.any():
            at_emails = invalid_series[has_at]
            missing_domain = at_emails.str.split('@').str[1].str.contains(r'\.', na=False).sum()
            if missing_domain < len(at_emails):
                common_issues.append(f"Missing domain extension: {len(at_emails) - missing_domain} cases")
        
        # Check for whitespace issues
        has_whitespace = invalid_series.str.contains(r'\s', na=False).sum()
        if has_whitespace > 0:
            common_issues.append(f"Contains whitespace: {has_whitespace} cases")
        
        # Check for multiple @ symbols
        multiple_at = invalid_series.str.count('@') > 1
        if multiple_at.any():
            common_issues.append(f"Multiple @ symbols: {multiple_at.sum()} cases")
    
    print(f"=== EMAIL VALIDATION RESULTS ===")
    print(f"Total records: {total_count}")
    print(f"Non-null records: {non_null_count}")
    print(f"Valid emails: {valid_count}")
    print(f"Invalid emails: {invalid_count}")
    print(f"Null emails: {null_count}")
    print(f"Validity rate: {validity_rate:.1f}%")
    
    if invalid_emails:
        print(f"\nSample invalid emails:")
        for email in invalid_emails[:5]:
            print(f"  {email}")
    
    if common_issues:
        print(f"\nCommon issues found:")
        for issue in common_issues:
            print(f"  {issue}")
    
    return {
        'valid_emails': valid_count,
        'invalid_emails': invalid_count,
        'null_count': null_count,
        'total_count': total_count,
        'validity_rate': validity_rate,
        'invalid_samples': invalid_emails,
        'common_issues': common_issues
    }

In [3]:
def validate_phone_format(series, country_code=None):
    """
    Phone number validation with international support using regex patterns.
    
    Parameters:
    - series: pandas Series containing phone numbers
    - country_code: country code for validation (e.g., 'US', 'GB', None for international)
    
    Returns:
    - Dictionary with validation results and statistics
    """
    # Remove null values for validation
    non_null_series = series.dropna()
    total_count = len(series)
    non_null_count = len(non_null_series)
    null_count = total_count - non_null_count
    
    if non_null_count == 0:
        return {
            'valid_phones': 0,
            'invalid_phones': 0,
            'null_count': null_count,
            'total_count': total_count,
            'validity_rate': 0.0,
            'invalid_samples': [],
            'common_issues': []
        }
    
    # Convert to string and clean
    phone_strings = non_null_series.astype(str).str.strip()
    
    # Define phone patterns
    patterns = {
        'US': [
            r'^\+1[2-9]\d{2}[2-9]\d{2}\d{4}$',  # +1XXXXXXXXXX
            r'^1[2-9]\d{2}[2-9]\d{2}\d{4}$',    # 1XXXXXXXXXX
            r'^[2-9]\d{2}[2-9]\d{2}\d{4}$',     # XXXXXXXXXX
            r'^\([2-9]\d{2}\)\s?[2-9]\d{2}-\d{4}$',  # (XXX) XXX-XXXX
            r'^[2-9]\d{2}-[2-9]\d{2}-\d{4}$',   # XXX-XXX-XXXX
            r'^[2-9]\d{2}\.[2-9]\d{2}\.\d{4}$'  # XXX.XXX.XXXX
        ],
        'international': [
            r'^\+\d{1,3}\d{4,14}$',  # +CCXXXXXXXXX (country code + number)
            r'^\d{7,15}$'            # Basic number validation
        ]
    }
    
    # Choose patterns based on country code
    if country_code == 'US':
        validation_patterns = patterns['US']
    else:
        validation_patterns = patterns['international']
    
    # Clean phone numbers for validation
    cleaned_phones = phone_strings.str.replace(r'[\s\-\(\)\.]', '', regex=True)
    
    valid_count = 0
    invalid_phones = []
    common_issues = []
    
    for phone in cleaned_phones:
        is_valid = False
        for pattern in validation_patterns:
            # For pattern matching, use original format
            original_phone = phone_strings[cleaned_phones == phone].iloc[0] if len(phone_strings[cleaned_phones == phone]) > 0 else phone
            
            if re.match(pattern, phone) or re.match(pattern, original_phone):
                is_valid = True
                break
        
        if is_valid:
            valid_count += 1
        else:
            if len(invalid_phones) < 10:
                invalid_phones.append(original_phone if 'original_phone' in locals() else phone)
    
    invalid_count = non_null_count - valid_count
    validity_rate = (valid_count / non_null_count) * 100
    
    # Analyze common issues
    if invalid_count > 0:
        invalid_series = phone_strings[~phone_strings.isin([p for p in phone_strings if any(re.match(pat, p.replace(r'[\s\-\(\)\.]', '')) or re.match(pat, p) for pat in validation_patterns)])]
        
        # Check for too short numbers
        too_short = cleaned_phones.str.len() < 7
        if too_short.any():
            common_issues.append(f"Too short (< 7 digits): {too_short.sum()} cases")
        
        # Check for too long numbers
        too_long = cleaned_phones.str.len() > 15
        if too_long.any():
            common_issues.append(f"Too long (> 15 digits): {too_long.sum()} cases")
        
        # Check for non-numeric characters
        has_letters = phone_strings.str.contains(r'[a-zA-Z]', na=False)
        if has_letters.any():
            common_issues.append(f"Contains letters: {has_letters.sum()} cases")
        
        # Check for missing country code (international)
        if country_code != 'US':
            missing_plus = ~phone_strings.str.startswith('+')
            if missing_plus.any():
                common_issues.append(f"Missing country code (+): {missing_plus.sum()} cases")
    
    print(f"=== PHONE VALIDATION RESULTS ===")
    print(f"Country code: {country_code or 'International'}")
    print(f"Total records: {total_count}")
    print(f"Non-null records: {non_null_count}")
    print(f"Valid phones: {valid_count}")
    print(f"Invalid phones: {invalid_count}")
    print(f"Null phones: {null_count}")
    print(f"Validity rate: {validity_rate:.1f}%")
    
    if invalid_phones:
        print(f"\nSample invalid phones:")
        for phone in invalid_phones[:5]:
            print(f"  {phone}")
    
    if common_issues:
        print(f"\nCommon issues found:")
        for issue in common_issues:
            print(f"  {issue}")
    
    return {
        'valid_phones': valid_count,
        'invalid_phones': invalid_count,
        'null_count': null_count,
        'total_count': total_count,
        'validity_rate': validity_rate,
        'invalid_samples': invalid_phones,
        'common_issues': common_issues
    }

In [4]:
def validate_numeric_ranges(df, column, min_val=None, max_val=None):
    """
    Range validation with boundary checks for numeric columns.
    
    Parameters:
    - df: pandas DataFrame
    - column: column name to validate
    - min_val: minimum allowed value (None = no minimum)
    - max_val: maximum allowed value (None = no maximum)
    
    Returns:
    - Dictionary with validation results
    """
    if column not in df.columns:
        print(f"Error: Column '{column}' not found in DataFrame")
        return None
    
    series = df[column]
    
    # Check if column is numeric
    if not pd.api.types.is_numeric_dtype(series):
        print(f"Error: Column '{column}' is not numeric (dtype: {series.dtype})")
        return None
    
    total_count = len(series)
    non_null_series = series.dropna()
    non_null_count = len(non_null_series)
    null_count = total_count - non_null_count
    
    if non_null_count == 0:
        return {
            'column': column,
            'total_count': total_count,
            'null_count': null_count,
            'valid_count': 0,
            'invalid_count': 0,
            'validity_rate': 0.0,
            'out_of_range_samples': []
        }
    
    # Apply range validation
    valid_mask = pd.Series([True] * non_null_count, index=non_null_series.index)
    
    below_min_count = 0
    above_max_count = 0
    out_of_range_samples = []
    
    if min_val is not None:
        below_min_mask = non_null_series < min_val
        below_min_count = below_min_mask.sum()
        valid_mask &= ~below_min_mask
        
        # Get samples of values below minimum
        below_min_values = non_null_series[below_min_mask].head(5).tolist()
        out_of_range_samples.extend([f"Below min ({min_val}): {val}" for val in below_min_values])
    
    if max_val is not None:
        above_max_mask = non_null_series > max_val
        above_max_count = above_max_mask.sum()
        valid_mask &= ~above_max_mask
        
        # Get samples of values above maximum
        above_max_values = non_null_series[above_max_mask].head(5).tolist()
        out_of_range_samples.extend([f"Above max ({max_val}): {val}" for val in above_max_values])
    
    valid_count = valid_mask.sum()
    invalid_count = non_null_count - valid_count
    validity_rate = (valid_count / non_null_count) * 100
    
    # Statistics
    series_stats = {
        'min': non_null_series.min(),
        'max': non_null_series.max(),
        'mean': non_null_series.mean(),
        'std': non_null_series.std()
    }
    
    print(f"=== NUMERIC RANGE VALIDATION: {column} ===")
    print(f"Range constraints: {min_val if min_val is not None else 'No min'} to {max_val if max_val is not None else 'No max'}")
    print(f"Total records: {total_count}")
    print(f"Non-null records: {non_null_count}")
    print(f"Valid values: {valid_count}")
    print(f"Invalid values: {invalid_count}")
    print(f"Null values: {null_count}")
    print(f"Validity rate: {validity_rate:.1f}%")
    
    if min_val is not None and below_min_count > 0:
        print(f"Values below minimum ({min_val}): {below_min_count}")
    if max_val is not None and above_max_count > 0:
        print(f"Values above maximum ({max_val}): {above_max_count}")
    
    print(f"\nData statistics:")
    print(f"  Actual range: {series_stats['min']:.2f} to {series_stats['max']:.2f}")
    print(f"  Mean: {series_stats['mean']:.2f}")
    print(f"  Std Dev: {series_stats['std']:.2f}")
    
    if out_of_range_samples:
        print(f"\nOut-of-range samples:")
        for sample in out_of_range_samples[:10]:
            print(f"  {sample}")
    
    return {
        'column': column,
        'total_count': total_count,
        'null_count': null_count,
        'valid_count': valid_count,
        'invalid_count': invalid_count,
        'validity_rate': validity_rate,
        'below_min_count': below_min_count,
        'above_max_count': above_max_count,
        'out_of_range_samples': out_of_range_samples,
        'statistics': series_stats
    }

In [5]:
def check_data_consistency(df):
    """
    Cross-column consistency checks to identify logical inconsistencies in data.
    
    Parameters:
    - df: pandas DataFrame
    
    Returns:
    - Dictionary with consistency check results
    """
    consistency_issues = []
    total_checks = 0
    failed_checks = 0
    
    print(f"=== DATA CONSISTENCY CHECK ===")
    print(f"DataFrame shape: {df.shape}")
    
    # Check 1: Date consistency (if date columns exist)
    date_columns = df.select_dtypes(include=['datetime64']).columns.tolist()
    if len(date_columns) >= 2:
        for i in range(len(date_columns)):
            for j in range(i+1, len(date_columns)):
                col1, col2 = date_columns[i], date_columns[j]
                
                # Check if there are logical date relationships
                if 'start' in col1.lower() and 'end' in col2.lower():
                    total_checks += 1
                    invalid_dates = df[df[col1] > df[col2]].dropna(subset=[col1, col2])
                    if len(invalid_dates) > 0:
                        failed_checks += 1
                        consistency_issues.append({
                            'type': 'date_logic',
                            'description': f'{col1} after {col2}',
                            'count': len(invalid_dates),
                            'sample_indices': invalid_dates.index[:5].tolist()
                        })
                        print(f"❌ Date logic issue: {len(invalid_dates)} records where {col1} > {col2}")
    
    # Check 2: Numeric relationships
    numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Look for potential percentage columns that should sum to 100
    potential_percentage_cols = [col for col in numeric_columns if 'percent' in col.lower() or '%' in col]
    if len(potential_percentage_cols) >= 2:
        total_checks += 1
        # Check if rows sum to approximately 100
        row_sums = df[potential_percentage_cols].sum(axis=1)
        tolerance = 5  # 5% tolerance
        invalid_sums = df[(row_sums < (100 - tolerance)) | (row_sums > (100 + tolerance))].dropna(subset=potential_percentage_cols)
        
        if len(invalid_sums) > 0:
            failed_checks += 1
            consistency_issues.append({
                'type': 'percentage_sum',
                'description': f'Percentage columns {potential_percentage_cols} do not sum to ~100%',
                'count': len(invalid_sums),
                'sample_indices': invalid_sums.index[:5].tolist()
            })
            print(f"❌ Percentage sum issue: {len(invalid_sums)} records with percentage sums outside 95-105%")
    
    # Check 3: Age consistency (if birth date and age columns exist)
    age_cols = [col for col in df.columns if 'age' in col.lower()]
    birth_cols = [col for col in date_columns if 'birth' in col.lower() or 'dob' in col.lower()]
    
    if age_cols and birth_cols and len(age_cols) > 0 and len(birth_cols) > 0:
        for age_col in age_cols:
            for birth_col in birth_cols:
                if pd.api.types.is_numeric_dtype(df[age_col]):
                    total_checks += 1
                    # Calculate expected age from birth date
                    current_date = datetime.datetime.now()
                    expected_ages = (current_date - df[birth_col]).dt.days / 365.25
                    age_diff = abs(df[age_col] - expected_ages)
                    
                    # Allow 1 year tolerance
                    invalid_ages = df[age_diff > 1].dropna(subset=[age_col, birth_col])
                    if len(invalid_ages) > 0:
                        failed_checks += 1
                        consistency_issues.append({
                            'type': 'age_birth_mismatch',
                            'description': f'{age_col} inconsistent with {birth_col}',
                            'count': len(invalid_ages),
                            'sample_indices': invalid_ages.index[:5].tolist()
                        })
                        print(f"❌ Age-birth date mismatch: {len(invalid_ages)} records with >1 year difference")
    
    # Check 4: Geographic consistency (if state/country columns exist)
    geo_columns = [col for col in df.columns if any(geo_term in col.lower() for geo_term in ['state', 'country', 'city', 'zip', 'postal'])]
    
    # Check for impossible zip codes vs states (US example)
    zip_cols = [col for col in geo_columns if 'zip' in col.lower() or 'postal' in col.lower()]
    state_cols = [col for col in geo_columns if 'state' in col.lower()]
    
    if zip_cols and state_cols:
        for zip_col in zip_cols:
            for state_col in state_cols:
                total_checks += 1
                # Basic zip code format check for US
                us_states = ['CA', 'NY', 'TX', 'FL', 'IL', 'PA', 'OH', 'GA', 'NC', 'MI']  # Sample states
                state_zip_ranges = {
                    'CA': [(90000, 96999)],
                    'NY': [(10000, 14999)],
                    'TX': [(73000, 79999), (75000, 79999)],
                    'FL': [(32000, 34999)],
                }
                
                # This is a simplified check - in practice you'd have a complete mapping
                inconsistent_zip_state = 0
                for state, zip_ranges in state_zip_ranges.items():
                    state_data = df[df[state_col].str.upper() == state]
                    if len(state_data) > 0:
                        for zip_range in zip_ranges:
                            zip_nums = pd.to_numeric(state_data[zip_col].astype(str).str[:5], errors='coerce')
                            invalid_zips = state_data[~((zip_nums >= zip_range[0]) & (zip_nums <= zip_range[1]))].dropna(subset=[zip_col])
                            inconsistent_zip_state += len(invalid_zips)
                
                if inconsistent_zip_state > 0:
                    failed_checks += 1
                    consistency_issues.append({
                        'type': 'geographic_mismatch',
                        'description': f'Zip codes inconsistent with states',
                        'count': inconsistent_zip_state,
                        'sample_indices': []
                    })
                    print(f"❌ Geographic inconsistency: ~{inconsistent_zip_state} zip-state mismatches detected")
    
    # Check 5: Duplicate ID checks
    id_columns = [col for col in df.columns if any(id_term in col.lower() for id_term in ['id', 'key', 'uuid'])]
    for id_col in id_columns:
        if id_col.lower() in ['id', 'user_id', 'customer_id', 'primary_key']:
            total_checks += 1
            duplicates = df[df[id_col].duplicated()].dropna(subset=[id_col])
            if len(duplicates) > 0:
                failed_checks += 1
                consistency_issues.append({
                    'type': 'duplicate_ids',
                    'description': f'Duplicate values in ID column {id_col}',
                    'count': len(duplicates),
                    'sample_indices': duplicates.index[:5].tolist()
                })
                print(f"❌ Duplicate IDs: {len(duplicates)} duplicate values in {id_col}")
    
    # Summary
    consistency_rate = ((total_checks - failed_checks) / total_checks * 100) if total_checks > 0 else 100
    
    print(f"\n=== CONSISTENCY SUMMARY ===")
    print(f"Total consistency checks performed: {total_checks}")
    print(f"Checks passed: {total_checks - failed_checks}")
    print(f"Checks failed: {failed_checks}")
    print(f"Consistency rate: {consistency_rate:.1f}%")
    
    if consistency_issues:
        print(f"\nIssues found:")
        for issue in consistency_issues:
            print(f"  • {issue['description']}: {issue['count']} records")
    else:
        print("✅ No consistency issues detected!")
    
    return {
        'total_checks': total_checks,
        'passed_checks': total_checks - failed_checks,
        'failed_checks': failed_checks,
        'consistency_rate': consistency_rate,
        'issues': consistency_issues
    }

In [6]:
def validate_categorical_values(df, column, allowed_values):
    """
    Check categorical values against a list of allowed values.
    
    Parameters:
    - df: pandas DataFrame
    - column: column name to validate
    - allowed_values: list/set of allowed categorical values
    
    Returns:
    - Dictionary with validation results
    """
    if column not in df.columns:
        print(f"Error: Column '{column}' not found in DataFrame")
        return None
    
    series = df[column]
    total_count = len(series)
    non_null_series = series.dropna()
    non_null_count = len(non_null_series)
    null_count = total_count - non_null_count
    
    if non_null_count == 0:
        return {
            'column': column,
            'total_count': total_count,
            'null_count': null_count,
            'valid_count': 0,
            'invalid_count': 0,
            'validity_rate': 0.0,
            'invalid_values': [],
            'allowed_values': list(allowed_values)
        }
    
    # Convert allowed values to set for faster lookup
    allowed_set = set(allowed_values)
    
    # Convert series values to string for comparison
    series_values = non_null_series.astype(str)
    
    # Find valid and invalid values
    valid_mask = series_values.isin([str(val) for val in allowed_set])
    valid_count = valid_mask.sum()
    invalid_count = non_null_count - valid_count
    validity_rate = (valid_count / non_null_count) * 100
    
    # Get unique invalid values
    invalid_values = series_values[~valid_mask].unique().tolist()
    invalid_counts = series_values[~valid_mask].value_counts().head(10).to_dict()
    
    # Get unique values in the column
    unique_values = series_values.unique()
    unique_count = len(unique_values)
    
    # Suggest potential matches for invalid values (fuzzy matching)
    suggestions = {}
    if invalid_values:
        from difflib import get_close_matches
        for invalid_val in invalid_values[:5]:  # Only check top 5 invalid values
            matches = get_close_matches(str(invalid_val), [str(v) for v in allowed_values], n=3, cutoff=0.6)
            if matches:
                suggestions[invalid_val] = matches
    
    print(f"=== CATEGORICAL VALIDATION: {column} ===")
    print(f"Total records: {total_count}")
    print(f"Non-null records: {non_null_count}")
    print(f"Valid values: {valid_count}")
    print(f"Invalid values: {invalid_count}")
    print(f"Null values: {null_count}")
    print(f"Validity rate: {validity_rate:.1f}%")
    print(f"Unique values found: {unique_count}")
    print(f"Allowed values: {len(allowed_values)}")
    
    if invalid_values:
        print(f"\nInvalid values found:")
        for invalid_val, count in list(invalid_counts.items())[:10]:
            print(f"  '{invalid_val}': {count} occurrences")
        
        if suggestions:
            print(f"\nPossible corrections:")
            for invalid_val, matches in suggestions.items():
                print(f"  '{invalid_val}' → {matches}")
    
    # Show allowed values if reasonable number
    if len(allowed_values) <= 20:
        print(f"\nAllowed values: {sorted(list(allowed_set))}")
    else:
        print(f"\nAllowed values (sample): {sorted(list(allowed_set))[:10]}... and {len(allowed_values)-10} more")
    
    return {
        'column': column,
        'total_count': total_count,
        'null_count': null_count,
        'valid_count': valid_count,
        'invalid_count': invalid_count,
        'validity_rate': validity_rate,
        'invalid_values': invalid_values,
        'invalid_counts': invalid_counts,
        'unique_count': unique_count,
        'allowed_values': list(allowed_values),
        'suggestions': suggestions
    }

In [7]:
def generate_validation_report(df, rules_dict):
    """
    Generate a comprehensive validation report based on provided rules.
    
    Parameters:
    - df: pandas DataFrame
    - rules_dict: dictionary with validation rules
        Example: {
            'email_columns': ['email', 'contact_email'],
            'phone_columns': ['phone', 'mobile'],
            'range_rules': {'age': {'min': 0, 'max': 150}, 'salary': {'min': 0}},
            'categorical_rules': {'status': ['active', 'inactive'], 'category': ['A', 'B', 'C']},
            'required_columns': ['id', 'name', 'email']
        }
    
    Returns:
    - Dictionary with comprehensive validation report
    """
    print(f"=== COMPREHENSIVE VALIDATION REPORT ===")
    print(f"DataFrame shape: {df.shape}")
    print(f"Validation started at: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    validation_results = {
        'metadata': {
            'shape': df.shape,
            'total_records': len(df),
            'total_columns': len(df.columns),
            'validation_timestamp': datetime.datetime.now().isoformat()
        },
        'column_validations': {},
        'consistency_check': {},
        'data_quality_score': 0.0,
        'summary': {
            'total_issues': 0,
            'critical_issues': 0,
            'warnings': 0,
            'recommendations': []
        }
    }
    
    total_validation_points = 0
    passed_validation_points = 0
    
    # 1. Required columns check
    if 'required_columns' in rules_dict:
        print(f"\n--- Required Columns Check ---")
        required_cols = rules_dict['required_columns']
        missing_required = [col for col in required_cols if col not in df.columns]
        
        if missing_required:
            print(f"❌ Missing required columns: {missing_required}")
            validation_results['summary']['critical_issues'] += len(missing_required)
        else:
            print(f"✅ All required columns present")
        
        validation_results['required_columns'] = {
            'required': required_cols,
            'missing': missing_required,
            'status': 'pass' if not missing_required else 'fail'
        }
    
    # 2. Email validation
    if 'email_columns' in rules_dict:
        print(f"\n--- Email Validation ---")
        for email_col in rules_dict['email_columns']:
            if email_col in df.columns:
                email_results = validate_email_format(df[email_col])
                validation_results['column_validations'][email_col] = {
                    'type': 'email',
                    'results': email_results
                }
                total_validation_points += email_results['total_count']
                passed_validation_points += email_results['valid_emails']
                
                if email_results['validity_rate'] < 90:
                    validation_results['summary']['warnings'] += 1
                if email_results['validity_rate'] < 70:
                    validation_results['summary']['critical_issues'] += 1
    
    # 3. Phone validation
    if 'phone_columns' in rules_dict:
        print(f"\n--- Phone Validation ---")
        country_code = rules_dict.get('phone_country_code', None)
        for phone_col in rules_dict['phone_columns']:
            if phone_col in df.columns:
                phone_results = validate_phone_format(df[phone_col], country_code)
                validation_results['column_validations'][phone_col] = {
                    'type': 'phone',
                    'results': phone_results
                }
                total_validation_points += phone_results['total_count']
                passed_validation_points += phone_results['valid_phones']
                
                if phone_results['validity_rate'] < 90:
                    validation_results['summary']['warnings'] += 1
                if phone_results['validity_rate'] < 70:
                    validation_results['summary']['critical_issues'] += 1
    
    # 4. Numeric range validation
    if 'range_rules' in rules_dict:
        print(f"\n--- Numeric Range Validation ---")
        for col, range_rule in rules_dict['range_rules'].items():
            if col in df.columns:
                min_val = range_rule.get('min')
                max_val = range_rule.get('max')
                range_results = validate_numeric_ranges(df, col, min_val, max_val)
                
                if range_results:
                    validation_results['column_validations'][col] = {
                        'type': 'numeric_range',
                        'results': range_results
                    }
                    total_validation_points += range_results['total_count']
                    passed_validation_points += range_results['valid_count']
                    
                    if range_results['validity_rate'] < 95:
                        validation_results['summary']['warnings'] += 1
                    if range_results['validity_rate'] < 85:
                        validation_results['summary']['critical_issues'] += 1
    
    # 5. Categorical validation
    if 'categorical_rules' in rules_dict:
        print(f"\n--- Categorical Validation ---")
        for col, allowed_values in rules_dict['categorical_rules'].items():
            if col in df.columns:
                cat_results = validate_categorical_values(df, col, allowed_values)
                
                if cat_results:
                    validation_results['column_validations'][col] = {
                        'type': 'categorical',
                        'results': cat_results
                    }
                    total_validation_points += cat_results['total_count']
                    passed_validation_points += cat_results['valid_count']
                    
                    if cat_results['validity_rate'] < 95:
                        validation_results['summary']['warnings'] += 1
                    if cat_results['validity_rate'] < 80:
                        validation_results['summary']['critical_issues'] += 1
    
    # 6. Data consistency check
    print(f"\n--- Data Consistency Check ---")
    consistency_results = check_data_consistency(df)
    validation_results['consistency_check'] = consistency_results
    
    if consistency_results['failed_checks'] > 0:
        validation_results['summary']['warnings'] += consistency_results['failed_checks']
    
    # 7. Calculate overall data quality score
    if total_validation_points > 0:
        validation_results['data_quality_score'] = (passed_validation_points / total_validation_points) * 100
    else:
        validation_results['data_quality_score'] = 100.0
    
    # 8. Generate recommendations
    recommendations = []
    
    if validation_results['data_quality_score'] < 80:
        recommendations.append("Data quality score is below 80%. Consider comprehensive data cleaning.")
    
    if validation_results['summary']['critical_issues'] > 0:
        recommendations.append(f"Address {validation_results['summary']['critical_issues']} critical data quality issues.")
    
    if validation_results['summary']['warnings'] > 5:
        recommendations.append("Multiple validation warnings detected. Review data entry processes.")
    
    validation_results['summary']['recommendations'] = recommendations
    validation_results['summary']['total_issues'] = validation_results['summary']['critical_issues'] + validation_results['summary']['warnings']
    
    # Final summary
    print(f"\n=== VALIDATION SUMMARY ===")
    print(f"Data Quality Score: {validation_results['data_quality_score']:.1f}%")
    print(f"Critical Issues: {validation_results['summary']['critical_issues']}")
    print(f"Warnings: {validation_results['summary']['warnings']}")
    print(f"Total Issues: {validation_results['summary']['total_issues']}")
    
    if recommendations:
        print(f"\nRecommendations:")
        for i, rec in enumerate(recommendations, 1):
            print(f"{i}. {rec}")
    
    quality_rating = "Excellent" if validation_results['data_quality_score'] >= 95 else \
                    "Good" if validation_results['data_quality_score'] >= 85 else \
                    "Fair" if validation_results['data_quality_score'] >= 70 else "Poor"
    
    print(f"\nOverall Data Quality Rating: {quality_rating}")
    
    return validation_results

In [8]:
helper_docs = """ Helper functions available:
- validate_email_format(series): Check email format validity using regex patterns. Returns dict with validation results.
- validate_phone_format(series, country_code=None): Phone number validation with international support. Returns dict with validation results.
- validate_numeric_ranges(df, column, min_val=None, max_val=None): Range validation with boundary checks. Returns dict with validation results.
- check_data_consistency(df): Cross-column consistency checks (date logic, percentages, age-birth, geographic, duplicate IDs). Returns dict with consistency results.
- validate_categorical_values(df, column, allowed_values): Check against allowed value lists with fuzzy matching suggestions. Returns dict with validation results.
- generate_validation_report(df, rules_dict): Comprehensive validation report with data quality scoring. Returns detailed validation report.

Examples:
- "Validate email addresses" -> email_results = validate_email_format(df['email'])
- "Check phone numbers" -> phone_results = validate_phone_format(df['phone'], 'US')
- "Validate age range" -> age_results = validate_numeric_ranges(df, 'age', min_val=0, max_val=150)
- "Check data consistency" -> consistency = check_data_consistency(df)
- "Validate status values" -> status_results = validate_categorical_values(df, 'status', ['active', 'inactive'])
- "Generate validation report" -> report = generate_validation_report(df, rules_dict)

Rules dictionary format:
rules = {
    'email_columns': ['email', 'contact_email'],
    'phone_columns': ['phone', 'mobile'],
    'phone_country_code': 'US',
    'range_rules': {'age': {'min': 0, 'max': 150}, 'salary': {'min': 0}},
    'categorical_rules': {'status': ['active', 'inactive'], 'category': ['A', 'B', 'C']},
    'required_columns': ['id', 'name', 'email']
}
"""

# **MAIN FEATURE FUNCTION**

In [9]:
def validation(df, user_query):
    """
    Main function that gets called by the main router.
    MUST take (df, user_query) and return df
    """
    
    # Create message chain
    messages = []
    messages.append(SystemMessage(content=helper_docs))
    messages.append(SystemMessage(content=f"""
    You are a data cleaning agent focused on data validation and quality checking.
    
    Dataset info: Shape: {df.shape}, Sample: {df.head(3).to_string()}

    Libraries available:
    - pd (pandas), np (numpy)
    - math, re, datetime
    - sklearn.preprocessing
    - All helper functions listed above
    
    Rules:
    - Return only executable Python code, no explanations, no markdown blocks
    - Use helper functions for validation tasks - they print detailed results automatically
    - ASSUME "df" IS ALREADY DEFINED
    - For validation queries, use appropriate helper functions that print results
    - Most validation functions return dictionaries with results - you can store these in variables if needed
    - ALWAYS assign the result back to df only when modifying the DataFrame
    - In order to generate a response/message to the user use print statements
    print("message")
    - Write a detailed print message to summarise actions taken and validation results
    
    Common query patterns:
    - "Validate email addresses" or "Check email format" -> validate_email_format(df['email_column'])
    - "Check phone numbers" or "Validate phone format" -> validate_phone_format(df['phone_column'], 'US')
    - "Check age range" or "Validate ages" -> validate_numeric_ranges(df, 'age', min_val=0, max_val=150)
    - "Check data consistency" or "Find inconsistencies" -> check_data_consistency(df)
    - "Validate status values" or "Check categories" -> validate_categorical_values(df, 'status', ['active', 'inactive'])
    - "Generate validation report" -> create rules_dict and use generate_validation_report(df, rules_dict)
    - "Find data quality issues" -> check_data_consistency(df) or generate comprehensive validation
    
    For comprehensive validation, create a rules dictionary like:
    rules = {{
        'email_columns': ['email'],
        'phone_columns': ['phone'],
        'range_rules': {{'age': {{'min': 0, 'max': 150}}}},
        'categorical_rules': {{'status': ['active', 'inactive']}},
        'required_columns': ['id', 'name']
    }}
    Then use: generate_validation_report(df, rules)
    """))
    messages.append(HumanMessage(content=f"User request: {user_query}"))
    
    # Call LLM with message chain
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
    response = llm.invoke(messages)
    generated_code = response.content.strip()
    
    # Execute code
    try:
        original_df = df.copy()
        # Create local namespace with our variables
        local_vars = {
            'df': df.copy(),
            'original_df': original_df,
            'pd': pd,
            'np': np,
            'math': math,
            're': re,
            'datetime': datetime,
            'validate_email_format': validate_email_format,
            'validate_phone_format': validate_phone_format,
            'validate_numeric_ranges': validate_numeric_ranges,
            'check_data_consistency': check_data_consistency,
            'validate_categorical_values': validate_categorical_values,
            'generate_validation_report': generate_validation_report,
            'print': print
        }
        
        exec(generated_code, globals(), local_vars)
        return local_vars['df']
    except Exception as e:
        print(f"Error: {e}")
        print(f"Generated Code:{generated_code}")
        return original_df

# **Testing**

In [None]:
# # Create sample data with various validation issues for testing
# test_data = {
#     'id': [1, 2, 3, 4, 5, 5, 7, 8, 9, 10],  # Duplicate ID at index 5
#     'name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown', 'Charlie Wilson', 
#              'Diana Davis', 'Eve Miller', 'Frank Garcia', 'Grace Rodriguez', 'Henry Martinez'],
#     'email': ['john.doe@email.com', 'jane.smith@email.com', 'bob@invalid', 'alice.brown@company.com',
#               'charlie.wilson@email.com', 'diana.davis@email.com', 'eve@', 'frank.garcia@email.com',
#               'grace.rodriguez@email.com', 'henry martinez@email.com'],  # Various email issues
#     'phone': ['(555) 123-4567', '555-987-6543', '123', '+1-555-246-8135', '555.369.2580',
#               '(555) 147-2583', 'abc-def-ghij', '+1-555-789-0123', '555-456-7890', '555-321-9876'],
#     'age': [25, 35, -5, 45, 200, 30, 28, 42, 33, 29],  # Invalid ages: -5, 200
#     'status': ['active', 'inactive', 'pending', 'active', 'inactive', 
#                'active', 'unknown', 'active', 'inactive', 'expired'],  # Invalid: pending, unknown, expired
#     'salary': [50000, 75000, 60000, 85000, -10000, 95000, 70000, 80000, 65000, 55000],  # Invalid: -10000
#     'birth_date': ['1998-01-15', '1988-05-22', '2030-03-10', '1978-12-05', '1824-07-18',  # Invalid dates
#                    '1993-11-30', '1995-08-14', '1981-04-27', '1990-09-12', '1994-06-08'],
#     'percentage_a': [30, 40, 50, 45, 35, 25, 55, 60, 40, 30],
#     'percentage_b': [70, 50, 30, 40, 45, 75, 35, 30, 50, 60],  # Some don't sum to 100%
# }

# test_df = pd.DataFrame(test_data)
# # Convert birth_date to datetime
# test_df['birth_date'] = pd.to_datetime(test_df['birth_date'], errors='coerce')

# print("Test DataFrame created:")
# print(f"Shape: {test_df.shape}")
# print("\\nSample data:")
# print(test_df.head())
# print("\\nData types:")
# print(test_df.dtypes)

Test DataFrame created:
Shape: (10, 10)
\nSample data:
   id            name                     email            phone  age  \
0   1        John Doe        john.doe@email.com   (555) 123-4567   25   
1   2      Jane Smith      jane.smith@email.com     555-987-6543   35   
2   3     Bob Johnson               bob@invalid              123   -5   
3   4     Alice Brown   alice.brown@company.com  +1-555-246-8135   45   
4   5  Charlie Wilson  charlie.wilson@email.com     555.369.2580  200   

     status  salary birth_date  percentage_a  percentage_b  
0    active   50000 1998-01-15            30            70  
1  inactive   75000 1988-05-22            40            50  
2   pending   60000 2030-03-10            50            30  
3    active   85000 1978-12-05            45            40  
4  inactive  -10000 1824-07-18            35            45  
\nData types:
id                       int64
name                    object
email                   object
phone                   object
ag

In [None]:
# # Test individual validation functions
# print("=== TESTING EMAIL VALIDATION ===")
# email_results = validate_email_format(test_df['email'])

# print("\\n=== TESTING PHONE VALIDATION ===")
# phone_results = validate_phone_format(test_df['phone'], 'US')

# print("\\n=== TESTING NUMERIC RANGE VALIDATION ===")
# age_results = validate_numeric_ranges(test_df, 'age', min_val=0, max_val=150)

# print("\\n=== TESTING CATEGORICAL VALIDATION ===")
# status_results = validate_categorical_values(test_df, 'status', ['active', 'inactive'])

# print("\\n=== TESTING DATA CONSISTENCY CHECK ===")
# consistency_results = check_data_consistency(test_df)

=== TESTING EMAIL VALIDATION ===
=== EMAIL VALIDATION RESULTS ===
Total records: 10
Non-null records: 10
Valid emails: 7
Invalid emails: 3
Null emails: 0
Validity rate: 70.0%

Sample invalid emails:
  bob@invalid
  eve@
  henry martinez@email.com

Common issues found:
  Missing domain extension: 2 cases
  Contains whitespace: 1 cases
\n=== TESTING PHONE VALIDATION ===
=== PHONE VALIDATION RESULTS ===
Country code: US
Total records: 10
Non-null records: 10
Valid phones: 6
Invalid phones: 4
Null phones: 0
Validity rate: 60.0%

Sample invalid phones:
  (555) 123-4567
  123
  (555) 147-2583
  abc-def-ghij

Common issues found:
  Too short (< 7 digits): 1 cases
  Contains letters: 1 cases
\n=== TESTING NUMERIC RANGE VALIDATION ===
=== NUMERIC RANGE VALIDATION: age ===
Range constraints: 0 to 150
Total records: 10
Non-null records: 10
Valid values: 8
Invalid values: 2
Null values: 0
Validity rate: 80.0%
Values below minimum (0): 1
Values above maximum (150): 1

Data statistics:
  Actual rang

In [12]:
# # Test comprehensive validation report
# validation_rules = {
#     'email_columns': ['email'],
#     'phone_columns': ['phone'],
#     'phone_country_code': 'US',
#     'range_rules': {
#         'age': {'min': 0, 'max': 150},
#         'salary': {'min': 0}
#     },
#     'categorical_rules': {
#         'status': ['active', 'inactive']
#     },
#     'required_columns': ['id', 'name', 'email']
# }

# print("\\n=== TESTING COMPREHENSIVE VALIDATION REPORT ===")
# report = generate_validation_report(test_df, validation_rules)

In [None]:
# # Test the main validation function with various queries
# print("\\n=== TESTING MAIN VALIDATION FUNCTION ===")

# # Test query 1: Validate email addresses
# query1 = "Validate email addresses in the dataset"
# result1 = validation(test_df, query1)

# print("\\n" + "="*50)

# # Test query 2: Generate validation report
# query2 = "Generate a comprehensive validation report for data quality"
# result2 = validation(test_df, query2)

# print("\\n" + "="*50)

# # Test query 3: Check data consistency
# query3 = "Check for data consistency issues"
# result3 = validation(test_df, query3)

\n=== TESTING MAIN VALIDATION FUNCTION ===


Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')


=== EMAIL VALIDATION RESULTS ===
Total records: 10
Non-null records: 10
Valid emails: 7
Invalid emails: 3
Null emails: 0
Validity rate: 70.0%

Sample invalid emails:
  bob@invalid
  eve@
  henry martinez@email.com

Common issues found:
  Missing domain extension: 2 cases
  Contains whitespace: 1 cases


Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')


=== COMPREHENSIVE VALIDATION REPORT ===
DataFrame shape: (10, 10)
Validation started at: 2025-09-13 17:14:01

--- Required Columns Check ---
✅ All required columns present

--- Email Validation ---
=== EMAIL VALIDATION RESULTS ===
Total records: 10
Non-null records: 10
Valid emails: 7
Invalid emails: 3
Null emails: 0
Validity rate: 70.0%

Sample invalid emails:
  bob@invalid
  eve@
  henry martinez@email.com

Common issues found:
  Missing domain extension: 2 cases
  Contains whitespace: 1 cases

--- Phone Validation ---
=== PHONE VALIDATION RESULTS ===
Country code: International
Total records: 10
Non-null records: 10
Valid phones: 8
Invalid phones: 2
Null phones: 0
Validity rate: 80.0%

Sample invalid phones:
  123
  abc-def-ghij

Common issues found:
  Too short (< 7 digits): 1 cases
  Contains letters: 1 cases
  Missing country code (+): 8 cases

--- Numeric Range Validation ---
=== NUMERIC RANGE VALIDATION: age ===
Range constraints: 0 to 150
Total records: 10
Non-null records: 10

Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')


=== DATA CONSISTENCY CHECK ===
DataFrame shape: (10, 10)
❌ Percentage sum issue: 8 records with percentage sums outside 95-105%
❌ Age-birth date mismatch: 9 records with >1 year difference
❌ Age-birth date mismatch: 10 records with >1 year difference
❌ Age-birth date mismatch: 10 records with >1 year difference
❌ Duplicate IDs: 1 duplicate values in id

=== CONSISTENCY SUMMARY ===
Total consistency checks performed: 5
Checks passed: 0
Checks failed: 5
Consistency rate: 0.0%

Issues found:
  • Percentage columns ['percentage_a', 'percentage_b'] do not sum to ~100%: 8 records
  • age inconsistent with birth_date: 9 records
  • percentage_a inconsistent with birth_date: 10 records
  • percentage_b inconsistent with birth_date: 10 records
  • Duplicate values in ID column id: 1 records


Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
