## Architecture to Monitor Data Quality Over Time

**Description**: Design a monitoring system in Python that checks and logs data quality metrics (accuracy, completeness) for a dataset over time.

**Steps to follow:**
1. Implement a Scheduled Script:
    - Use schedule library to periodically run a script.
2. Script to Calculate Metrics:
    - For simplicity, use a function calculate_quality_metrics() that calculates and logs metrics such as missing rate or mismatch rate.
3. Store Logs:
    - Use Python's logging library to save these metrics over time.

In [None]:
# Write your code from here

In [2]:
import pandas as pd
import logging
import time
from datetime import datetime

# Sample dataset
data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 40],
    'Gender': ['F', 'M', 'M', 'F']
})

# Configure logging
logging.basicConfig(
    filename='data_quality.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

# Function to calculate quality metrics
def calculate_quality_metrics(df):
    total_rows = len(df)
    missing_values = df.isnull().sum().sum()
    missing_rate = missing_values / (total_rows * len(df.columns))
    
    # Log metrics
    logging.info(f"Total Rows: {total_rows}")
    logging.info(f"Missing Values: {missing_values}")
    logging.info(f"Missing Rate: {missing_rate:.2%}")
    print(f"Logged metrics: Total Rows={total_rows}, Missing Values={missing_values}, Missing Rate={missing_rate:.2%}")

# Function to run the monitoring script
def monitor_data_quality():
    print(f"Running data quality check at {datetime.now()}")
    calculate_quality_metrics(data)

# Schedule the script to run every minute
schedule.every(1).minutes.do(monitor_data_quality)

# Keep the script running
print("Starting the data quality monitoring system...")
while True:
    schedule.run_pending()
    time.sleep(1)

NameError: name 'schedule' is not defined

In [3]:
import schedule

# Corrected function to run the monitoring script
def monitor_data_quality():
    print(f"Running data quality check at {datetime.now()}")
    calculate_quality_metrics(data)

# Schedule the script to run every minute
schedule.every(1).minutes.do(monitor_data_quality)

# Run the scheduled tasks manually for a limited time
print("Starting the data quality monitoring system (manual execution)...")
for _ in range(3):  # Run the scheduled tasks 3 times for demonstration
    schedule.run_pending()
    time.sleep(60)  # Wait for 1 minute between executions

ModuleNotFoundError: No module named 'schedule'

In [4]:
# Display the summary of the dataset
print("Dataset Summary:")
print(data.info())

# Display the first few rows of the dataset
print("\nDataset Preview:")
print(data.head())

Dataset Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3 non-null      object 
 1   Age     3 non-null      float64
 2   Gender  4 non-null      object 
dtypes: float64(1), object(2)
memory usage: 224.0+ bytes
None

Dataset Preview:
      Name   Age Gender
0    Alice  25.0      F
1      Bob   NaN      M
2  Charlie  35.0      M
3     None  40.0      F
