## Data Quality Framework Implementation

**Description**: Implement a simple data quality measurement framework using ISO 8000 principles to assess key dimensions in a dataset.

In [1]:
"""
Conceptual Data Quality Framework (Inspired by ISO 8000)

This framework provides a structure for assessing data quality based on key dimensions.
It's implemented using Python pseudo-code for clarity and adaptability.

Key Dimensions (Examples - Expand as needed):
    - Completeness
    - Accuracy
    - Consistency
    - Timeliness
    - Validity
    - Uniqueness

Functions:
    - calculate_completeness(data, column_name):
        - Calculates the proportion of non-null values in a column.
        - Returns a value between 0 and 1 (1 indicates 100% complete).

    - calculate_accuracy(data, column_name, expected_values):
        - Compares the values in a column against a set of expected values.
        - Returns the proportion of values that match the expected values.
        - Handles numeric and categorical data.

    - calculate_consistency(data, column_name_1, column_name_2, relationship_function):
        - Checks if the relationship between two columns holds true.
        - relationship_function is a function that defines the expected relationship
          (e.g., lambda x, y: x < y).
        - Returns the proportion of rows where the relationship is satisfied.

    - calculate_timeliness(data, date_column_name, reference_date):
        - Assesses how up-to-date the data is.
        - Returns a measure of timeliness (e.g., proportion of data within a time window).

    - calculate_validity(data, column_name, validation_function):
        - Checks if the values in a column conform to a specific format or rule.
        - validation_function is a function that returns True if a value is valid, False otherwise
        - Returns the proportion of valid values.

    - calculate_uniqueness(data, column_name):
        - Calculates the proportion of unique values in a column.
        - Returns a value between 0 and 1 (1 indicates all values are unique).

    - assess_data_quality(data, quality_checks):
        - Applies a set of quality checks to the data.
        - quality_checks is a dictionary where keys are dimension names
          (e.g., "Completeness") and values are lists of tuples:
          - (column_name, check_function, threshold, weight)
            - column_name: The name of the column to check.
            - check_function: The function to use for the check
              (e.g., calculate_completeness).
            - threshold: The minimum acceptable value (e.g., 0.9 for 90%).
            - weight:  The importance of this check (used for overall score).
        - Executes each check, stores the result.
        - Calculates an overall data quality score, weighting by the 'weight'
        - Returns a dictionary of results and the overall score.

Main Program:
    - Load the dataset.  (e.g., from CSV, database)
    - Define the data quality checks (using the structure described above).
    - Call assess_data_quality() to get the results.
    - Print or log the results, including the overall data quality score
    - (Optional) Generate a report or dashboard.
"""

import pandas as pd  # Import pandas (in real code, not pseudo-code)

# -----------------------------------------------------------------------------
#  Helper Functions (Pseudo-code)
# -----------------------------------------------------------------------------

def calculate_completeness(data, column_name):
    """
    Calculates the completeness of a column.

    Args:
        data:  The dataset (e.g., Pandas DataFrame).
        column_name: The name of the column to check.

    Returns:
        The completeness score (0 to 1).
    """
    # 1. Get the number of non-null values in the column.
    non_null_count = data[column_name].notnull().sum()  # Real code
    # 2. Get the total number of rows.
    total_count = len(data) # Real code
    # 3. Calculate completeness.
    if total_count == 0:
        return 0.0  # Handle empty data
    completeness = non_null_count / total_count
    return completeness

def calculate_accuracy(data, column_name, expected_values):
    """
    Calculates the accuracy of a column against a set of expected values.

    Args:
        data: The dataset.
        column_name: The name of the column.
        expected_values: A set or list of valid values.

    Returns:
        The accuracy score (0 to 1).
    """
    # 1. Get the values from the specified column
    column_values = data[column_name].tolist() # Real code
    # 2. Count the number of values that are in the expected_values
    valid_count = sum(1 for value in column_values if value in expected_values) # Real code
    # 3. Calculate accuracy
    total_count = len(column_values) # Real code
    if total_count == 0:
        return 0.0
    accuracy = valid_count / total_count
    return accuracy

def calculate_consistency(data, column_name_1, column_name_2, relationship_function):
    """
    Checks if the relationship between two columns holds.

    Args:
        data: The dataset.
        column_name_1: The name of the first column.
        column_name_2: The name of the second column.
        relationship_function: A function that takes values from the two columns
            and returns True if the relationship is satisfied, False otherwise.
            Example:  lambda x, y: x < y

    Returns:
        The consistency score (0 to 1).
    """
    # 1. Get the values from the two columns.
    values_1 = data[column_name_1].tolist() # Real code
    values_2 = data[column_name_2].tolist() # Real code

    # 2. Check the relationship for each pair of values.
    consistent_count = 0
    for v1, v2 in zip(values_1, values_2):
        if relationship_function(v1, v2):
            consistent_count += 1
    # 3. Calculate consistency
    total_count = len(values_1) # Real code
    if total_count == 0:
        return 0.0
    consistency = consistent_count / total_count
    return consistency

def calculate_timeliness(data, date_column_name, reference_date):
    """
    Calculates the timeliness of data in a date column.

    Args:
        data: The dataset.
        date_column_name: The name of the date column.
        reference_date: The date to compare against (e.g., current date).

    Returns:
        The timeliness score (0 to 1, or a time-based metric).
    """
    # 1. Get the date values from the specified column.
    date_values = data[date_column_name] # Real code
    # 2. Calculate how many dates are "timely" (e.g., within a certain range).
    timely_count = 0
    for date_value in date_values:
        if date_value <= reference_date: # Example condition
            timely_count +=1
    total_count = len(date_values) # Real code
    if total_count == 0:
        return 0.0
    timeliness = timely_count/total_count
    return timeliness

def calculate_validity(data, column_name, validation_function):
    """
    Checks if values in a column are valid according to a given function.

    Args:
        data: The dataset.
        column_name: The name of the column to check.
        validation_function: A function that takes a value and returns True
            if it's valid, False otherwise.  Example:  lambda x: isinstance(x, int)

    Returns:
        The validity score (0 to 1).
    """
    # 1. Get the values from the column.
    column_values = data[column_name].tolist() # Real code
    # 2. Count the number of valid values.
    valid_count = sum(1 for value in column_values if validation_function(value)) # Real code
    # 3. Calculate validity.
    total_count = len(column_values) # Real code
    if total_count == 0:
      return 0.0
    validity = valid_count / total_count
    return validity

def calculate_uniqueness(data, column_name):
    """
    Calculates the uniqueness of values in a column.

    Args:
        data: The dataset.
        column_name: The name of the column to check.

    Returns:
        The uniqueness score (0 to 1).
    """
    # 1. Get the number of unique values in the column.
    unique_count = data[column_name].nunique() # Real code
    # 2. Get the total number of rows.
    total_count = len(data) # Real code
    # 3. Calculate uniqueness.
    if total_count == 0:
        return 0.0
    uniqueness = unique_count / total_count
    return uniqueness

# -----------------------------------------------------------------------------
# Main Function (Pseudo-code)
# -----------------------------------------------------------------------------

def assess_data_quality(data, quality_checks):
    """
    Assesses the data quality of a dataset based on defined checks.

    Args:
        data: The dataset (e.g., Pandas DataFrame).
        quality_checks: A dictionary defining the checks to perform.
            Example:
            {
                "Completeness": [
                    ("customer_id", calculate_completeness, 0.95, 0.2),  # col, func, threshold, weight
                    ("product_id", calculate_completeness, 0.98, 0.3)
                ],
                "Accuracy": [
                    ("age", calculate_accuracy, set(range(0, 120)), 0.1),
                    ("city", calculate_accuracy, ["New York", "London", "Paris"], 0.15)
                ],
                "Consistency": [
                    ("order_date", "ship_date", calculate_consistency, lambda x, y: x <= y, 0.2)
                ],
                "Validity": [
                    ("email", calculate_validity, lambda x: "@" in x and "." in x, 0.1)
                ],
                "Uniqueness": [
                    ("customer_id", calculate_uniqueness, 1.0, 0.1)
                ]
            }

    Returns:
        A dictionary of results and the overall data quality score.
        Example:
        {
            "Completeness": {
                "customer_id": 0.97,
                "product_id": 0.99
            },
            "Accuracy": {
                "age": 0.99,
                "city": 0.95
            },
            "Consistency": {
                "order_date-ship_date": 0.98
            },
            "Validity":{
                "email": 0.92
            },
            "Uniqueness":{
                "customer_id": 1.0
            },
            "overall_score": 0.975
        }
    """
    results = {}
    overall_score = 0
    total_weight = 0

    for dimension, checks in quality_checks.items():
        results[dimension] = {}
        for check in checks:
            column_name = check[0]
            check_function = check[1]
            threshold = check[2]
            weight = check[3]
            total_weight += weight  # Accumulate weight for weighted average

            if dimension == "Consistency":
                column_name_2 = check[2]  # Get the second column name
                threshold = check[3]
                weight = check[4]
                score = check_function(data, column_name, column_name_2)
                results[dimension][f"{column_name}-{column_name_2}"] = score
            else:
                score = check_function(data, column_name)
                results[dimension][column_name] = score
            if score < threshold:
                print(f"Data Quality Alert: {dimension} check on {column_name} failed. Score: {score:.2f} < Threshold: {threshold:.2f}")  #Consider logging instead of printing

            overall_score += score * weight

    overall_score = overall_score / total_weight if total_weight > 0 else 0.0 # Avoid division by zero
    results["overall_score"] = overall_score
    return results

# -----------------------------------------------------------------------------
# Example Usage (Pseudo-code)
# -----------------------------------------------------------------------------

if __name__ == "__main__":
    # 1. Load the data.
    data = pd.DataFrame({ # Real code
        "customer_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "product_id": [101, 102, 103, 104, 105, 106, 107, 108, 109, None],
        "age": [25, 30, 22, 35, 28, 40, 27, 31, 24, 120],
        "city": ["New York", "London", "Paris", "Tokyo", "New York", "London", "Berlin", "Tokyo", "Paris", "Sydney"],
        "order_date": ["2024-01-10", "2024-01-15", "2024-01-20", "2024-02-01", "2024-02-05", "2024-02-10", "2024-02-15", "2024-03-01", "2024-03-05", "2024-03-10"],
        "ship_date": ["2024-01-12", "2024-01-17", "2024-01-22", "2024-02-03", "2024-02-07", "2024-02-12", "2024-02-17", "2024-03-03", "2024-03-07", "2024-03-12"],
        "email": ["alice@example.com", "bob@invalid", "charlie@example.net", "david@test.org", "eve@example.com",
                  "frank@test.com", "grace.com", "henry@example.io", "ivy@test.co.uk", "jack@example.info"]
    })

    # 2. Define the data quality checks.
    quality_checks = {
        "Completeness": [
            ("customer_id", calculate_completeness, 1.0, 0.2),
            ("product_id", calculate_completeness, 0.9, 0.3)
        ],
        "Accuracy": [
            ("age", calculate_accuracy, set(range(0, 120)), 0.1),
            ("city", calculate_accuracy, ["New York", "London", "Paris", "Tokyo", "Berlin", "Sydney"], 0.15)
        ],
        "Consistency": [
            ("order_date", "ship_date", calculate_consistency, lambda x, y: x <= y, 0.2)
        ],
        "Validity": [
            ("email", calculate_validity, lambda x: "@" in x and "." in x, 0.1)
        ],
        "Uniqueness": [
            ("customer_id", calculate_uniqueness, 1.0, 0.1)
        ]
    }

    # Convert date columns to datetime objects (in real code)
    data["order_date"] = pd.to_datetime(data["order_date"])
    data["ship_date"] = pd.to_datetime(data["ship_date"])
    reference_date = pd.to_datetime("2024-03-01")

    quality_checks["Timeliness"] = [("order_date", calculate_timeliness, reference_date, 0.1)]


    # 3. Assess data quality.
    results = assess_data_quality(data, quality_checks)

    # 4. Print the results.
    print(results)


TypeError: calculate_accuracy() missing 1 required positional argument: 'expected_values'