### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport

# Load CSV data
df = pd.read_csv('data.csv')

# Generate profile report
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

# Save the report as an HTML file
profile.to_file("profile_report.html")


PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.11/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.11/u/import-error

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [2]:
import logging
import smtplib
from email.message import EmailMessage

# Setup logging configuration
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    handlers=[
                        logging.FileHandler("data_quality.log"),
                        logging.StreamHandler()
                    ])

def alert_on_failure(success):
    if not success:
        logging.error("Data Quality Alert: One or more data quality checks failed!")
        send_email_alert(
            subject="Data Quality Alert: Quality Drop Detected",
            body="One or more data quality checks have failed. Please investigate immediately.",
            to_emails=["recipient@example.com"]  # Change to your recipient emails
        )
    else:
        logging.info("All data quality checks passed successfully.")

def send_email_alert(subject, body, to_emails):
    email_user = 'your_email@example.com'       # Your email
    email_password = 'your_email_password'      # Your email password or app-specific password
    email_server = 'smtp.example.com'            # SMTP server e.g., smtp.gmail.com
    email_port = 587                             # SMTP TLS port
    
    msg = EmailMessage()
    msg['Subject'] = subject
    msg['From'] = email_user
    msg['To'] = ', '.join(to_emails)
    msg.set_content(body)
    
    try:
        with smtplib.SMTP(email_server, email_port) as server:
            server.starttls()
            server.login(email_user, email_password)
            server.send_message(msg)
        logging.info("Email alert sent successfully.")
    except Exception as e:
        logging.error(f"Failed to send email: {e}")

# Example function to simulate data quality validation
def validate_data_quality():
    # Placeholder for actual validation logic
    # Return True if data quality passes, False if it fails
    # For demo, we simulate a failure
    return False

if __name__ == "__main__":
    # Run data quality check
    data_quality_passed = validate_data_quality()
    
    # Trigger alerting based on the check result
    alert_on_failure(data_quality_passed)


2025-05-21 07:39:40,146 - ERROR - Data Quality Alert: One or more data quality checks failed!
2025-05-21 07:39:40,648 - ERROR - Failed to send email: [Errno -2] Name or service not known


### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [3]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

# Sample dataset with two features: age and income (income has missing values replaced)
data = np.array([
    [25, 50000],
    [30, 60000],
    [35, 75000],
    [40, 0],       # Replace None/missing with 0 or a suitable placeholder
    [45, 100000],
    [50, 120000],  # Add some outliers
    [55, 130000]
])

# Convert to DataFrame for easier handling
df = pd.DataFrame(data, columns=['age', 'income'])

# Train Isolation Forest model
def train_isolation_forest(data):
    model = IsolationForest(contamination=0.15, random_state=42)
    model.fit(data)
    return model

# Custom simple AI logic for outlier detection (e.g., income < threshold)
def simple_ai_outlier_detection(df, income_threshold=20000):
    # Return indices of rows considered outliers based on income threshold
    return df.index[df['income'] < income_threshold].tolist()

# Monitoring function using the pre-trained model
def monitor_data_quality(df, model):
    preds = model.predict(df)  # -1 for anomaly, 1 for normal
    anomalies = df[preds == -1]
    return anomalies

if __name__ == "__main__":
    # Train model on the data
    isolation_forest_model = train_isolation_forest(df)

    # Detect anomalies using Isolation Forest
    anomalies_if = monitor_data_quality(df, isolation_forest_model)
    print("Anomalies detected by Isolation Forest:")
    print(anomalies_if)

    # Detect anomalies using simple AI logic
    anomalies_simple = simple_ai_outlier_detection(df)
    print("\nAnomalies detected by simple AI logic (income < 20,000):")
    print(df.loc[anomalies_simple])


Anomalies detected by Isolation Forest:
   age  income
3   40       0

Anomalies detected by simple AI logic (income < 20,000):
   age  income
3   40       0
