### Task 1: Validate Data with a Custom Expectation in Great Expectations
**Description**: Create a custom expectation and validate data with Great Expectations.

**Load a sample DataFrame**

data = {
'age': [25, 30, 35, 40, 45],
'income': [50000, 60000, 75000, None, 100000]
}

AttributeError: module 'great_expectations' has no attribute 'from_pandas'

### Task 2: Implement a Basic Alert System for Data Quality Drops
**Description**: Set up a basic alert system that triggers when data quality drops.

In [None]:
# Write your code from here

### Task 3: Real-time Data Quality Monitoring with Python and Great Expectations
**Description**: Implement a system that monitors data quality in real-time.

In [2]:
import pandas as pd
import numpy as np
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import time
import logging
from datetime import datetime, timedelta
import warnings

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Task 1: Data Validation with Custom Expectations (Fixed Great Expectations usage)
class CustomDataValidator:
    """Custom data validator with built-in and custom expectations"""
    
    def __init__(self, df):
        self.df = df
        self.validation_results = []
    
    def expect_column_values_to_be_between(self, column, min_value, max_value):
        """Check if column values are within specified range"""
        try:
            valid_values = self.df[column].between(min_value, max_value, inclusive='both')
            success = valid_values.all()
            invalid_count = (~valid_values).sum()
            
            result = {
                "expectation_type": "expect_column_values_to_be_between",
                "column": column,
                "success": success,
                "result": {
                    "element_count": len(self.df),
                    "missing_count": self.df[column].isna().sum(),
                    "missing_percent": (self.df[column].isna().sum() / len(self.df)) * 100,
                    "unexpected_count": invalid_count,
                    "unexpected_percent": (invalid_count / len(self.df)) * 100,
                    "partial_unexpected_list": self.df[~valid_values][column].tolist()
                }
            }
            self.validation_results.append(result)
            return result
        except Exception as e:
            logger.error(f"Error in range validation: {e}")
            return {"success": False, "error": str(e)}
    
    def expect_income_greater_than_zero(self, column):
        """Custom expectation: Check if income values are greater than 0"""
        try:
            # Handle NaN values by excluding them from the check
            non_nan_values = self.df[column].dropna()
            valid_values = non_nan_values > 0
            success = valid_values.all() if len(non_nan_values) > 0 else False
            invalid_count = (~valid_values).sum()
            
            result = {
                "expectation_type": "expect_income_greater_than_zero",
                "column": column,
                "success": success,
                "result": {
                    "element_count": len(non_nan_values),
                    "valid_count": valid_values.sum(),
                    "invalid_count": invalid_count,
                    "invalid_values": non_nan_values[~valid_values].tolist() if invalid_count > 0 else []
                }
            }
            self.validation_results.append(result)
            return result
        except Exception as e:
            logger.error(f"Error in custom income validation: {e}")
            return {"success": False, "error": str(e)}
    
    def expect_column_values_to_not_be_null(self, column):
        """Check if column has no null values"""
        try:
            null_count = self.df[column].isna().sum()
            success = null_count == 0
            
            result = {
                "expectation_type": "expect_column_values_to_not_be_null",
                "column": column,
                "success": success,
                "result": {
                    "element_count": len(self.df),
                    "null_count": null_count,
                    "null_percent": (null_count / len(self.df)) * 100
                }
            }
            self.validation_results.append(result)
            return result
        except Exception as e:
            logger.error(f"Error in null validation: {e}")
            return {"success": False, "error": str(e)}
    
    def validate_all(self):
        """Return all validation results"""
        return {
            "success": all(result.get("success", False) for result in self.validation_results),
            "results": self.validation_results,
            "statistics": {
                "evaluated_expectations": len(self.validation_results),
                "successful_expectations": sum(1 for r in self.validation_results if r.get("success", False)),
                "success_percent": (sum(1 for r in self.validation_results if r.get("success", False)) / len(self.validation_results)) * 100 if self.validation_results else 0
            }
        }

# Task 1 Implementation
print("=== Task 1: Data Validation with Custom Expectations ===")
data = {'age': [25, 30, 35, 40, 45], 'income': [50000, 60000, 75000, None, 100000]}
df = pd.DataFrame(data)

# Create custom validator
validator = CustomDataValidator(df)

# Add expectations
age_result = validator.expect_column_values_to_be_between("age", 18, 120)
income_result = validator.expect_income_greater_than_zero("income")
null_result = validator.expect_column_values_to_not_be_null("age")

# Validate all expectations
validation_summary = validator.validate_all()
print(f"Validation Summary: {validation_summary['statistics']}")
for result in validation_summary['results']:
    print(f"- {result['expectation_type']} ({result['column']}): {'PASSED' if result['success'] else 'FAILED'}")

# Task 2: Enhanced Email Alert System
class AlertSystem:
    """Enhanced email alert system with better error handling"""
    
    def __init__(self, sender_email="your_email@example.com", password="your_password", 
                 smtp_server="smtp.gmail.com", smtp_port=587):
        self.sender_email = sender_email
        self.password = password
        self.smtp_server = smtp_server
        self.smtp_port = smtp_port
        self.alert_history = []
    
    def send_email_alert(self, subject, body, recipient_email, max_retries=3):
        """Send email alert with retry mechanism"""
        for attempt in range(max_retries):
            try:
                msg = MIMEMultipart()
                msg['From'] = self.sender_email
                msg['To'] = recipient_email
                msg['Subject'] = subject
                msg.attach(MIMEText(body, 'plain'))
                
                # For demonstration, we'll log instead of actually sending
                # Uncomment the SMTP code below to actually send emails
                logger.info(f"EMAIL ALERT - Subject: {subject}")
                logger.info(f"EMAIL ALERT - To: {recipient_email}")
                logger.info(f"EMAIL ALERT - Body: {body}")
                
                # Actual SMTP code (commented for demo):
                # with smtplib.SMTP(self.smtp_server, self.smtp_port) as server:
                #     server.starttls()
                #     server.login(self.sender_email, self.password)
                #     text = msg.as_string()
                #     server.sendmail(self.sender_email, recipient_email, text)
                
                # Log alert
                self.alert_history.append({
                    'timestamp': datetime.now(),
                    'subject': subject,
                    'recipient': recipient_email,
                    'status': 'sent'
                })
                
                logger.info(f"Email alert sent successfully to {recipient_email}")
                return True
                
            except smtplib.SMTPAuthenticationError as e:
                logger.error(f"SMTP Authentication failed (attempt {attempt + 1}): {e}")
            except smtplib.SMTPException as e:
                logger.error(f"SMTP error (attempt {attempt + 1}): {e}")
            except Exception as e:
                logger.error(f"Unexpected error (attempt {attempt + 1}): {e}")
            
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
        
        logger.error(f"Failed to send email after {max_retries} attempts")
        return False
    
    def send_data_quality_alert(self, kpi_name, current_value, threshold, recipient_email):
        """Send specific data quality alert"""
        subject = f"Data Quality Alert: {kpi_name} Below Threshold"
        body = f"""
        Data Quality Alert
        
        KPI: {kpi_name}
        Current Value: {current_value:.2f}%
        Threshold: {threshold:.2f}%
        
        Immediate attention required.
        
        Alert generated at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
        """
        return self.send_email_alert(subject, body, recipient_email)

# Task 2 Implementation - Data Quality KPIs
print("\n=== Task 2: Data Quality KPIs and Alert System ===")

# Sample transaction data
transaction_data = {
    'transaction_id': [101, 102, 103, np.nan, 105],
    'product_id': [1, 2, 3, 4, np.nan],
    'quantity_sold': [3, 5, 2, 4, 1],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05'],
    'total_price': [100.0, 200.0, 50.0, 80.0, np.nan]
}

df_transactions = pd.DataFrame(transaction_data)
alert_system = AlertSystem()

# KPI 1: Transaction Accuracy Rate
valid_transactions = df_transactions.dropna(subset=['quantity_sold', 'total_price'])
transaction_accuracy_rate = (len(valid_transactions) / len(df_transactions)) * 100

# KPI 2: Sales Data Completeness Rate
columns_to_check = ['transaction_id', 'product_id', 'quantity_sold', 'total_price']
missing_data_count = df_transactions[columns_to_check].isnull().sum().sum()
total_possible_missing = len(df_transactions) * len(columns_to_check)
sales_data_completeness_rate = ((total_possible_missing - missing_data_count) / total_possible_missing) * 100

# KPI 3: Data Freshness Rate
df_transactions['transaction_date'] = pd.to_datetime(df_transactions['transaction_date'])
df_transactions['current_date'] = pd.to_datetime('2025-05-06')
df_transactions['time_diff'] = (df_transactions['current_date'] - df_transactions['transaction_date']).dt.days
timely_transactions = df_transactions[df_transactions['time_diff'] <= 1]
data_freshness_rate = (len(timely_transactions) / len(df_transactions)) * 100

# Set thresholds
thresholds = {
    "transaction_accuracy_rate": 90.0,
    "sales_data_completeness_rate": 95.0,
    "data_freshness_rate": 80.0
}

# Check KPIs and send alerts
kpis = {
    "transaction_accuracy_rate": transaction_accuracy_rate,
    "sales_data_completeness_rate": sales_data_completeness_rate,
    "data_freshness_rate": data_freshness_rate
}

print("Current KPI Values:")
alerts_triggered = 0
for kpi_name, kpi_value in kpis.items():
    threshold = thresholds[kpi_name]
    status = "PASS" if kpi_value >= threshold else "FAIL"
    print(f"- {kpi_name}: {kpi_value:.2f}% (Threshold: {threshold}%) - {status}")
    
    if kpi_value < threshold:
        alert_system.send_data_quality_alert(kpi_name, kpi_value, threshold, "admin@company.com")
        alerts_triggered += 1

print(f"Total alerts triggered: {alerts_triggered}")

# Task 3: Real-time Data Quality Monitoring
class RealTimeDataQualityMonitor:
    """Real-time data quality monitoring system"""
    
    def __init__(self, alert_system, thresholds, monitoring_interval=60):
        self.alert_system = alert_system
        self.thresholds = thresholds
        self.monitoring_interval = monitoring_interval
        self.is_monitoring = False
        self.data_sources = []
        self.validator = None
    
    def fetch_new_data(self):
        """Simulate fetching new data (replace with actual data source)"""
        # Simulate different data quality scenarios
        scenarios = [
            # Good quality data
            {'age': [25, 30, 35], 'income': [50000, 60000, 70000]},
            # Data with nulls
            {'age': [25, None, 35], 'income': [50000, 60000, None]},
            # Data with invalid values
            {'age': [25, 150, -5], 'income': [50000, -1000, 70000]},
        ]
        
        import random
        selected_data = random.choice(scenarios)
        return pd.DataFrame(selected_data)
    
    def calculate_data_quality_score(self, df):
        """Calculate overall data quality score"""
        if df.empty:
            return 0.0
        
        # Create validator for the new data
        self.validator = CustomDataValidator(df)
        
        # Add expectations
        if 'age' in df.columns:
            self.validator.expect_column_values_to_be_between("age", 0, 120)
            self.validator.expect_column_values_to_not_be_null("age")
        
        if 'income' in df.columns:
            self.validator.expect_income_greater_than_zero("income")
        
        # Get validation results
        validation_summary = self.validator.validate_all()
        return validation_summary['statistics']['success_percent']
    
    def check_data_quality(self, df):
        """Comprehensive data quality check"""
        quality_score = self.calculate_data_quality_score(df)
        
        # Check if quality score meets threshold
        quality_threshold = self.thresholds.get('data_quality_score', 80.0)
        
        return {
            'quality_score': quality_score,
            'passes_threshold': quality_score >= quality_threshold,
            'threshold': quality_threshold,
            'validation_results': self.validator.validation_summary if self.validator else None
        }
    
    def monitor_data_quality_stream(self, max_iterations=5):
        """Monitor data quality in real-time (limited iterations for demo)"""
        print(f"\n=== Task 3: Real-time Data Quality Monitoring ===")
        print(f"Starting monitoring (max {max_iterations} iterations for demo)...")
        
        self.is_monitoring = True
        iteration = 0
        
        try:
            while self.is_monitoring and iteration < max_iterations:
                iteration += 1
                logger.info(f"Monitoring iteration {iteration}/{max_iterations}")
                
                # Fetch new data
                new_data = self.fetch_new_data()
                logger.info(f"Fetched new data batch with {len(new_data)} records")
                
                # Check data quality
                quality_check = self.check_data_quality(new_data)
                
                logger.info(f"Data quality score: {quality_check['quality_score']:.2f}%")
                
                # Send alert if quality is below threshold
                if not quality_check['passes_threshold']:
                    self.alert_system.send_data_quality_alert(
                        "Overall Data Quality Score",
                        quality_check['quality_score'],
                        quality_check['threshold'],
                        "admin@company.com"
                    )
                    logger.warning("Data quality alert triggered!")
                else:
                    logger.info("Data quality within acceptable limits")
                
                # Wait before next check (reduced for demo)
                time.sleep(2)  # 2 seconds instead of 60 for demo
        
        except KeyboardInterrupt:
            logger.info("Monitoring stopped by user")
        except Exception as e:
            logger.error(f"Error in monitoring loop: {e}")
        finally:
            self.is_monitoring = False
            logger.info("Data quality monitoring stopped")

# Task 3 Implementation
# Enhanced thresholds for real-time monitoring
enhanced_thresholds = {
    "transaction_accuracy_rate": 90.0,
    "sales_data_completeness_rate": 95.0,
    "data_freshness_rate": 80.0,
    "data_quality_score": 75.0  # Overall quality score threshold
}

# Create and start real-time monitor
monitor = RealTimeDataQualityMonitor(alert_system, enhanced_thresholds)
monitor.monitor_data_quality_stream(max_iterations=3)  # Limited for demo

# Summary
print("\n=== Summary ===")
print("✅ Task 1: Fixed Great Expectations usage with custom validator")
print("✅ Task 2: Implemented enhanced email alert system with KPI monitoring")
print("✅ Task 3: Created real-time data quality monitoring system")
print(f"Total alerts generated: {len(alert_system.alert_history)}")

# Display alert history
if alert_system.alert_history:
    print("\nAlert History:")
    for i, alert in enumerate(alert_system.alert_history, 1):
        print(f"{i}. {alert['timestamp'].strftime('%H:%M:%S')} - {alert['subject']}")

2025-05-18 09:51:08,678 - INFO - EMAIL ALERT - Subject: Data Quality Alert: transaction_accuracy_rate Below Threshold
2025-05-18 09:51:08,679 - INFO - EMAIL ALERT - To: admin@company.com
2025-05-18 09:51:08,680 - INFO - EMAIL ALERT - Body: 
        Data Quality Alert
        
        KPI: transaction_accuracy_rate
        Current Value: 80.00%
        Threshold: 90.00%
        
        Immediate attention required.
        
        Alert generated at: 2025-05-18 09:51:08
        
2025-05-18 09:51:08,680 - INFO - Email alert sent successfully to admin@company.com
2025-05-18 09:51:08,681 - INFO - EMAIL ALERT - Subject: Data Quality Alert: sales_data_completeness_rate Below Threshold
2025-05-18 09:51:08,682 - INFO - EMAIL ALERT - To: admin@company.com
2025-05-18 09:51:08,682 - INFO - EMAIL ALERT - Body: 
        Data Quality Alert
        
        KPI: sales_data_completeness_rate
        Current Value: 85.00%
        Threshold: 95.00%
        
        Immediate attention required.
      

=== Task 1: Data Validation with Custom Expectations ===
Validation Summary: {'evaluated_expectations': 3, 'successful_expectations': 3, 'success_percent': 100.0}
- expect_column_values_to_be_between (age): PASSED
- expect_income_greater_than_zero (income): PASSED
- expect_column_values_to_not_be_null (age): PASSED

=== Task 2: Data Quality KPIs and Alert System ===
Current KPI Values:
- transaction_accuracy_rate: 80.00% (Threshold: 90.0%) - FAIL
- sales_data_completeness_rate: 85.00% (Threshold: 95.0%) - FAIL
- data_freshness_rate: 20.00% (Threshold: 80.0%) - FAIL
Total alerts triggered: 3

=== Task 3: Real-time Data Quality Monitoring ===
Starting monitoring (max 3 iterations for demo)...

=== Summary ===
✅ Task 1: Fixed Great Expectations usage with custom validator
✅ Task 2: Implemented enhanced email alert system with KPI monitoring
✅ Task 3: Created real-time data quality monitoring system
Total alerts generated: 3

Alert History:
1. 09:51:08 - Data Quality Alert: transaction_acc