### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [1]:
import pandas as pd
import numpy as np
import great_expectations as ge
from great_expectations.core import ExpectationSuite
from great_expectations.dataset import PandasDataset
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import logging
import smtplib
from email.mime.text import MimeText
from email.mime.multipart import MimeMultipart
import warnings
from typing import Dict, List, Tuple, Optional, Any
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('data_quality.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

class DataQualityMonitor:
    """
    Comprehensive data quality monitoring system with automated profiling,
    real-time monitoring, and AI-powered anomaly detection.
    """
    
    def __init__(self, alert_threshold: float = 0.95):
        """
        Initialize the data quality monitor.
        
        Args:
            alert_threshold: Threshold below which alerts are triggered
        """
        self.alert_threshold = alert_threshold
        self.anomaly_model = None
        self.scaler = StandardScaler()
        self.baseline_metrics = {}
        
    def check_accuracy(self, df: pd.DataFrame, column_name: str, 
                      expected_values: List[Any]) -> Dict[str, Any]:
        """
        Vectorized accuracy check using pandas operations.
        
        Args:
            df: Input DataFrame
            column_name: Column to check
            expected_values: List of expected values
            
        Returns:
            Dictionary containing accuracy metrics
        """
        try:
            if column_name not in df.columns:
                raise ValueError(f"Column '{column_name}' not found in DataFrame")
            
            # Vectorized operation - much faster than loops
            valid_mask = df[column_name].isin(expected_values)
            accuracy_score = valid_mask.sum() / len(df)
            
            # Use Great Expectations for detailed validation
            df_ge = PandasDataset(df)
            ge_result = df_ge.expect_column_values_to_be_in_set(
                column_name, expected_values
            )
            
            result = {
                'accuracy_score': accuracy_score,
                'valid_count': valid_mask.sum(),
                'total_count': len(df),
                'invalid_values': df[~valid_mask][column_name].unique().tolist(),
                'ge_result': ge_result
            }
            
            logger.info(f"Accuracy check for {column_name}: {accuracy_score:.3f}")
            return result
            
        except Exception as e:
            logger.error(f"Error in accuracy check: {str(e)}")
            return {'error': str(e), 'accuracy_score': 0.0}
    
    def check_completeness(self, df: pd.DataFrame) -> Dict[str, Any]:
        """
        Vectorized completeness check for all columns.
        
        Args:
            df: Input DataFrame
            
        Returns:
            Dictionary containing completeness metrics
        """
        try:
            # Vectorized operations for missing data calculation
            missing_counts = df.isnull().sum()
            total_counts = len(df)
            completeness_ratios = 1 - (missing_counts / total_counts)
            
            result = {
                'missing_counts': missing_counts.to_dict(),
                'completeness_ratios': completeness_ratios.to_dict(),
                'overall_completeness': completeness_ratios.mean(),
                'columns_with_missing': missing_counts[missing_counts > 0].index.tolist()
            }
            
            logger.info(f"Overall completeness: {result['overall_completeness']:.3f}")
            return result
            
        except Exception as e:
            logger.error(f"Error in completeness check: {str(e)}")
            return {'error': str(e), 'overall_completeness': 0.0}
    
    def check_consistency(self, df: pd.DataFrame) -> Dict[str, Any]:
        """
        Enhanced consistency check including duplicates and data type consistency.
        
        Args:
            df: Input DataFrame
            
        Returns:
            Dictionary containing consistency metrics
        """
        try:
            # Check for duplicate rows
            duplicate_count = df.duplicated().sum()
            duplicate_percentage = (duplicate_count / len(df)) * 100
            
            # Check data type consistency
            dtype_consistency = {}
            for col in df.columns:
                # Check if column has mixed data types
                if df[col].dtype == 'object':
                    unique_types = set(type(x).__name__ for x in df[col].dropna())
                    dtype_consistency[col] = len(unique_types) == 1
                else:
                    dtype_consistency[col] = True
            
            result = {
                'duplicate_count': duplicate_count,
                'duplicate_percentage': duplicate_percentage,
                'dtype_consistency': dtype_consistency,
                'consistency_score': 1 - (duplicate_percentage / 100)
            }
            
            logger.info(f"Consistency check - Duplicates: {duplicate_count}")
            return result
            
        except Exception as e:
            logger.error(f"Error in consistency check: {str(e)}")
            return {'error': str(e), 'consistency_score': 0.0}
    
    def check_uniqueness(self, df: pd.DataFrame, column_name: str) -> Dict[str, Any]:
        """
        Enhanced uniqueness check with additional metrics.
        
        Args:
            df: Input DataFrame
            column_name: Column to check
            
        Returns:
            Dictionary containing uniqueness metrics
        """
        try:
            if column_name not in df.columns:
                raise ValueError(f"Column '{column_name}' not found in DataFrame")
            
            unique_count = df[column_name].nunique()
            total_count = len(df)
            uniqueness_ratio = unique_count / total_count
            
            # Check for potential ID columns (high uniqueness)
            is_potential_id = uniqueness_ratio > 0.9
            
            result = {
                'unique_count': unique_count,
                'total_count': total_count,
                'uniqueness_ratio': uniqueness_ratio,
                'is_potential_id': is_potential_id,
                'duplicate_values': df[column_name].value_counts().head(10).to_dict()
            }
            
            logger.info(f"Uniqueness check for {column_name}: {uniqueness_ratio:.3f}")
            return result
            
        except Exception as e:
            logger.error(f"Error in uniqueness check: {str(e)}")
            return {'error': str(e), 'uniqueness_ratio': 0.0}
    
    def check_timeliness(self, df: pd.DataFrame, date_column: str) -> Dict[str, Any]:
        """
        Enhanced timeliness check with additional temporal metrics.
        
        Args:
            df: Input DataFrame
            date_column: Date column to check
            
        Returns:
            Dictionary containing timeliness metrics
        """
        try:
            if date_column not in df.columns:
                raise ValueError(f"Column '{date_column}' not found in DataFrame")
            
            # Convert to datetime if not already
            if not pd.api.types.is_datetime64_any_dtype(df[date_column]):
                date_series = pd.to_datetime(df[date_column], errors='coerce')
            else:
                date_series = df[date_column]
            
            # Calculate temporal metrics
            min_date = date_series.min()
            max_date = date_series.max()
            date_range_days = (max_date - min_date).days
            
            # Check for future dates
            current_date = pd.Timestamp.now()
            future_dates = (date_series > current_date).sum()
            
            # Check for reasonable date range (not too old or too future)
            reasonable_min = pd.Timestamp('1900-01-01')
            reasonable_max = pd.Timestamp('2030-12-31')
            unreasonable_dates = ((date_series < reasonable_min) | 
                                (date_series > reasonable_max)).sum()
            
            result = {
                'min_date': min_date,
                'max_date': max_date,
                'date_range_days': date_range_days,
                'future_dates_count': future_dates,
                'unreasonable_dates_count': unreasonable_dates,
                'null_dates_count': date_series.isnull().sum(),
                'timeliness_score': 1 - (unreasonable_dates + future_dates) / len(df)
            }
            
            logger.info(f"Timeliness check for {date_column}: {result['timeliness_score']:.3f}")
            return result
            
        except Exception as e:
            logger.error(f"Error in timeliness check: {str(e)}")
            return {'error': str(e), 'timeliness_score': 0.0}
    
    def generate_profile_report(self, df: pd.DataFrame, output_path: str = None) -> Dict[str, Any]:
        """
        Generate a comprehensive data profile report.
        
        Args:
            df: Input DataFrame
            output_path: Optional path to save the report
            
        Returns:
            Dictionary containing profile metrics
        """
        try:
            # Basic statistics
            profile = {
                'basic_info': {
                    'shape': df.shape,
                    'memory_usage': df.memory_usage(deep=True).sum(),
                    'data_types': df.dtypes.to_dict()
                },
                'completeness': self.check_completeness(df),
                'consistency': self.check_consistency(df),
                'statistical_summary': df.describe(include='all').to_dict()
            }
            
            # Add correlation matrix for numeric columns
            numeric_cols = df.select_dtypes(include=[np.number]).columns
            if len(numeric_cols) > 1:
                profile['correlations'] = df[numeric_cols].corr().to_dict()
            
            # Save report if path provided
            if output_path:
                with open(output_path, 'w') as f:
                    json.dump(profile, f, indent=2, default=str)
                logger.info(f"Profile report saved to {output_path}")
            
            return profile
            
        except Exception as e:
            logger.error(f"Error generating profile report: {str(e)}")
            return {'error': str(e)}
    
    def setup_email_alerts(self, smtp_server: str, smtp_port: int, 
                          email: str, password: str, recipients: List[str]):
        """
        Configure email alerts for data quality issues.
        
        Args:
            smtp_server: SMTP server address
            smtp_port: SMTP server port
            email: Sender email address
            password: Sender email password
            recipients: List of recipient email addresses
        """
        self.email_config = {
            'smtp_server': smtp_server,
            'smtp_port': smtp_port,
            'email': email,
            'password': password,
            'recipients': recipients
        }
        logger.info("Email alerts configured successfully")
    
    def send_alert(self, subject: str, message: str):
        """
        Send email alert for data quality issues.
        
        Args:
            subject: Email subject
            message: Email message content
        """
        try:
            if not hasattr(self, 'email_config'):
                logger.warning("Email configuration not set up")
                return
            
            msg = MimeMultipart()
            msg['From'] = self.email_config['email']
            msg['To'] = ', '.join(self.email_config['recipients'])
            msg['Subject'] = subject
            
            msg.attach(MimeText(message, 'plain'))
            
            server = smtplib.SMTP(self.email_config['smtp_server'], 
                                self.email_config['smtp_port'])
            server.starttls()
            server.login(self.email_config['email'], self.email_config['password'])
            text = msg.as_string()
            server.sendmail(self.email_config['email'], 
                          self.email_config['recipients'], text)
            server.quit()
            
            logger.info(f"Alert sent: {subject}")
            
        except Exception as e:
            logger.error(f"Failed to send email alert: {str(e)}")
    
    def train_anomaly_detector(self, df: pd.DataFrame, 
                             contamination: float = 0.1) -> None:
        """
        Train an Isolation Forest model for anomaly detection.
        
        Args:
            df: Training DataFrame
            contamination: Expected proportion of outliers
        """
        try:
            # Select numeric columns for anomaly detection
            numeric_cols = df.select_dtypes(include=[np.number]).columns
            
            if len(numeric_cols) == 0:
                raise ValueError("No numeric columns found for anomaly detection")
            
            # Prepare data
            X = df[numeric_cols].fillna(df[numeric_cols].median())
            X_scaled = self.scaler.fit_transform(X)
            
            # Train Isolation Forest
            self.anomaly_model = IsolationForest(
                contamination=contamination,
                random_state=42,
                n_estimators=100
            )
            self.anomaly_model.fit(X_scaled)
            
            # Store baseline metrics
            self.baseline_metrics = {
                'mean': df[numeric_cols].mean().to_dict(),
                'std': df[numeric_cols].std().to_dict(),
                'median': df[numeric_cols].median().to_dict()
            }
            
            logger.info("Anomaly detection model trained successfully")
            
        except Exception as e:
            logger.error(f"Error training anomaly detector: {str(e)}")
    
    def detect_anomalies(self, df: pd.DataFrame) -> Dict[str, Any]:
        """
        Detect anomalies in the data using the trained model.
        
        Args:
            df: DataFrame to check for anomalies
            
        Returns:
            Dictionary containing anomaly detection results
        """
        try:
            if self.anomaly_model is None:
                raise ValueError("Anomaly detection model not trained")
            
            # Select numeric columns
            numeric_cols = df.select_dtypes(include=[np.number]).columns
            X = df[numeric_cols].fillna(df[numeric_cols].median())
            X_scaled = self.scaler.transform(X)
            
            # Predict anomalies
            anomaly_predictions = self.anomaly_model.predict(X_scaled)
            anomaly_scores = self.anomaly_model.decision_function(X_scaled)
            
            # Calculate anomaly statistics
            anomaly_mask = anomaly_predictions == -1
            anomaly_count = anomaly_mask.sum()
            anomaly_percentage = (anomaly_count / len(df)) * 100
            
            result = {
                'anomaly_count': anomaly_count,
                'anomaly_percentage': anomaly_percentage,
                'anomaly_indices': df.index[anomaly_mask].tolist(),
                'anomaly_scores': anomaly_scores.tolist(),
                'threshold_exceeded': anomaly_percentage > (100 - self.alert_threshold * 100)
            }
            
            # Trigger alert if threshold exceeded
            if result['threshold_exceeded']:
                self.send_alert(
                    "Data Quality Alert: High Anomaly Rate",
                    f"Detected {anomaly_percentage:.2f}% anomalies in the data"
                )
            
            logger.info(f"Anomaly detection completed: {anomaly_percentage:.2f}% anomalies")
            return result
            
        except Exception as e:
            logger.error(f"Error in anomaly detection: {str(e)}")
            return {'error': str(e)}
    
    def custom_outlier_detection(self, df: pd.DataFrame, 
                               column: str, method: str = 'iqr') -> Dict[str, Any]:
        """
        Custom outlier detection using statistical methods.
        
        Args:
            df: Input DataFrame
            column: Column to check for outliers
            method: Method to use ('iqr', 'zscore', 'modified_zscore')
            
        Returns:
            Dictionary containing outlier detection results
        """
        try:
            if column not in df.columns:
                raise ValueError(f"Column '{column}' not found in DataFrame")
            
            if not pd.api.types.is_numeric_dtype(df[column]):
                raise ValueError(f"Column '{column}' is not numeric")
            
            data = df[column].dropna()
            
            if method == 'iqr':
                Q1 = data.quantile(0.25)
                Q3 = data.quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                outliers = (data < lower_bound) | (data > upper_bound)
                
            elif method == 'zscore':
                z_scores = np.abs((data - data.mean()) / data.std())
                outliers = z_scores > 3
                
            elif method == 'modified_zscore':
                median = data.median()
                mad = np.median(np.abs(data - median))
                modified_z_scores = 0.6745 * (data - median) / mad
                outliers = np.abs(modified_z_scores) > 3.5
                
            else:
                raise ValueError(f"Unknown method: {method}")
            
            outlier_count = outliers.sum()
            outlier_percentage = (outlier_count / len(data)) * 100
            
            result = {
                'method': method,
                'outlier_count': outlier_count,
                'outlier_percentage': outlier_percentage,
                'outlier_values': data[outliers].tolist(),
                'outlier_indices': data[outliers].index.tolist()
            }
            
            logger.info(f"Outlier detection ({method}) for {column}: {outlier_percentage:.2f}%")
            return result
            
        except Exception as e:
            logger.error(f"Error in custom outlier detection: {str(e)}")
            return {'error': str(e)}
    
    def comprehensive_quality_check(self, df: pd.DataFrame, 
                                  config: Dict[str, Any]) -> Dict[str, Any]:
        """
        Perform comprehensive data quality assessment.
        
        Args:
            df: Input DataFrame
            config: Configuration dictionary with check parameters
            
        Returns:
            Comprehensive quality assessment results
        """
        try:
            results = {
                'timestamp': datetime.now().isoformat(),
                'data_shape': df.shape,
                'checks': {}
            }
            
            # Accuracy checks
            if 'accuracy_checks' in config:
                for check in config['accuracy_checks']:
                    column = check['column']
                    expected_values = check['expected_values']
                    results['checks'][f'accuracy_{column}'] = self.check_accuracy(
                        df, column, expected_values
                    )
            
            # Completeness check
            results['checks']['completeness'] = self.check_completeness(df)
            
            # Consistency check
            results['checks']['consistency'] = self.check_consistency(df)
            
            # Uniqueness checks
            if 'uniqueness_checks' in config:
                for column in config['uniqueness_checks']:
                    results['checks'][f'uniqueness_{column}'] = self.check_uniqueness(
                        df, column
                    )
            
            # Timeliness checks
            if 'timeliness_checks' in config:
                for column in config['timeliness_checks']:
                    results['checks'][f'timeliness_{column}'] = self.check_timeliness(
                        df, column
                    )
            
            # Anomaly detection
            if hasattr(self, 'anomaly_model') and self.anomaly_model is not None:
                results['checks']['anomalies'] = self.detect_anomalies(df)
            
            # Custom outlier detection
            if 'outlier_checks' in config:
                for check in config['outlier_checks']:
                    column = check['column']
                    method = check.get('method', 'iqr')
                    results['checks'][f'outliers_{column}_{method}'] = \
                        self.custom_outlier_detection(df, column, method)
            
            # Calculate overall quality score
            scores = []
            for check_name, check_result in results['checks'].items():
                if 'error' not in check_result:
                    if 'accuracy_score' in check_result:
                        scores.append(check_result['accuracy_score'])
                    elif 'overall_completeness' in check_result:
                        scores.append(check_result['overall_completeness'])
                    elif 'consistency_score' in check_result:
                        scores.append(check_result['consistency_score'])
                    elif 'timeliness_score' in check_result:
                        scores.append(check_result['timeliness_score'])
            
            results['overall_quality_score'] = np.mean(scores) if scores else 0.0
            
            # Trigger alerts if quality is below threshold
            if results['overall_quality_score'] < self.alert_threshold:
                self.send_alert(
                    "Data Quality Alert: Quality Below Threshold",
                    f"Overall quality score: {results['overall_quality_score']:.3f}"
                )
            
            logger.info(f"Comprehensive quality check completed. Score: {results['overall_quality_score']:.3f}")
            return results
            
        except Exception as e:
            logger.error(f"Error in comprehensive quality check: {str(e)}")
            return {'error': str(e)}


# Example usage and testing
if __name__ == "__main__":
    # Create sample data with quality issues
    np.random.seed(42)
    
    sample_data = {
        'ID': range(1, 1001),
        'Age': np.random.randint(18, 80, 1000),
        'Salary': np.random.normal(60000, 15000, 1000),
        'Department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], 1000),
        'JoiningDate': pd.date_range('2020-01-01', '2023-12-31', periods=1000),
        'Performance': np.random.choice(['Good', 'Average', 'Excellent'], 1000)
    }
    
    # Introduce some quality issues
    sample_df = pd.DataFrame(sample_data)
    
    # Add missing values
    sample_df.loc[50:60, 'Salary'] = None
    sample_df.loc[100:105, 'Department'] = None
    
    # Add duplicates
    sample_df = pd.concat([sample_df, sample_df.iloc[:5]], ignore_index=True)
    
    # Add outliers
    sample_df.loc[200:205, 'Salary'] = 200000
    sample_df.loc[300:302, 'Age'] = 150
    
    # Initialize monitor
    monitor = DataQualityMonitor(alert_threshold=0.9)
    
    # Generate profile report
    profile = monitor.generate_profile_report(sample_df, 'data_profile.json')
    print("Profile report generated")
    
    # Train anomaly detector
    monitor.train_anomaly_detector(sample_df)
    
    # Define quality check configuration
    quality_config = {
        'accuracy_checks': [
            {
                'column': 'Department',
                'expected_values': ['IT', 'HR', 'Finance', 'Marketing']
            },
            {
                'column': 'Performance',
                'expected_values': ['Good', 'Average', 'Excellent']
            }
        ],
        'uniqueness_checks': ['ID'],
        'timeliness_checks': ['JoiningDate'],
        'outlier_checks': [
            {'column': 'Salary', 'method': 'iqr'},
            {'column': 'Age', 'method': 'zscore'}
        ]
    }
    
    # Perform comprehensive quality check
    quality_results = monitor.comprehensive_quality_check(sample_df, quality_config)
    
    # Print results
    print(f"\nOverall Quality Score: {quality_results['overall_quality_score']:.3f}")
    print(f"Data Shape: {quality_results['data_shape']}")
    
    # Setup email alerts (example configuration)
    # monitor.setup_email_alerts(
    #     smtp_server='smtp.gmail.com',
    #     smtp_port=587,
    #     email='your_email@gmail.com',
    #     password='your_password',
    #     recipients=['recipient@example.com']
    # )
    
    print("\nData quality monitoring system initialized successfully!")

ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [None]:
# Write your code from here

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [None]:
# Write your code from here