## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
import pandas as pd
import numpy as np

# Sample Data (Replace with actual data)
data = {
    'name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'age': [25, 30, 35, None, 22],
    'email': ['alice@example.com', 'bob@example', 'charlie@domain.com', 'eve@domain.com', None],
    'score': [88, 92, 95, None, 85]
}

# Create DataFrame
df = pd.DataFrame(data)

# Function to check if the DataFrame is empty
def check_empty(df):
    if df.empty:
        raise ValueError("The input DataFrame is empty!")

# Function to calculate completeness
def completeness(df):
    """Calculate completeness (percentage of missing values) for each column."""
    try:
        check_empty(df)
        return df.isnull().mean() * 100
    except Exception as e:
        print(f"Error in completeness calculation: {e}")
        return None

# Function to calculate accuracy
def accuracy(df, column, reference_values):
    """Check if the values in the specified column are within the reference values."""
    try:
        check_empty(df)
        # Ensure the column exists
        if column not in df.columns:
            raise ValueError(f"Column '{column}' not found in the DataFrame!")
        
        # Compare each value in the column with the reference values
        valid_values = df[column].isin(reference_values)
        return valid_values.mean() * 100
    except Exception as e:
        print(f"Error in accuracy calculation: {e}")
        return None

# Function to calculate consistency
def consistency(df, column, min_value, max_value):
    """Check if the values in the specified column are between min_value and max_value."""
    try:
        check_empty(df)
        if column not in df.columns:
            raise ValueError(f"Column '{column}' not found in the DataFrame!")
        
        # Ensure the values are between min_value and max_value
        valid_values = df[column].apply(lambda x: min_value <= x <= max_value if pd.notnull(x) else True)
        return valid_values.mean() * 100
    except Exception as e:
        print(f"Error in consistency calculation: {e}")
        return None

# Function to calculate overall data quality score based on completeness, accuracy, and consistency
def calculate_data_quality(df, accuracy_reference_values, consistency_min, consistency_max):
    completeness_score = completeness(df)
    accuracy_score = accuracy(df, 'email', accuracy_reference_values)
    consistency_score = consistency(df, 'age', consistency_min, consistency_max)
    
    # Calculate an overall score (just an average in this case)
    if None not in [completeness_score, accuracy_score, consistency_score]:
        overall_score = (completeness_score + accuracy_score + consistency_score) / 3
        print(f"Overall Data Quality Score: {overall_score:.2f}%")
    else:
        print("Data Quality Scores could not be calculated due to errors.")
    
    return completeness_score, accuracy_score, consistency_score

# Example: Calculate and print metrics
print("Calculating data quality metrics...")

accuracy_reference_values = ['alice@example.com', 'charlie@domain.com', 'eve@domain.com']
consistency_min = 18  # Minimum valid age
consistency_max = 100  # Maximum valid age

completeness_score, accuracy_score, consistency_score = calculate_data_quality(df, accuracy_reference_values, consistency_min, consistency_max)

print(f"\nCompleteness Score:\n{completeness_score}")
print(f"\nAccuracy Score (Email Column): {accuracy_score}%")
print(f"\nConsistency Score (Age Column between {consistency_min} and {consistency_max}): {consistency_score}%")

Calculating data quality metrics...


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [None]:
# Write your code from here


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here
