## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
import pandas as pd

# Sample dataset: customer information
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'Email': ['alice@example.com', None, 'charlie@example.com', 'david@example.com', 'eve@example.com'],
    'Age': [25, 30, 35, None, 28]
}

# Load data into DataFrame
df = pd.DataFrame(data)

# Display the dataset
print("Dataset:\n", df)

# Identify columns with missing values
missing_counts = df.isnull().sum()
print("\nMissing values per column:\n", missing_counts)

# Calculate completeness score for each column
completeness_score = (df.notnull().sum() / len(df)) * 100
print("\nCompleteness score (percentage of non-missing values) per column:\n", completeness_score)

# Calculate overall completeness score (average across all columns)
overall_completeness = completeness_score.mean()
print(f"\nOverall Completeness Score: {overall_completeness:.2f}%")


Dataset:
    CustomerID   Name                Email   Age
0           1  Alice    alice@example.com  25.0
1           2    Bob                 None  30.0
2           3   None  charlie@example.com  35.0
3           4  David    david@example.com   NaN
4           5    Eve      eve@example.com  28.0

Missing values per column:
 CustomerID    0
Name          1
Email         1
Age           1
dtype: int64

Completeness score (percentage of non-missing values) per column:
 CustomerID    100.0
Name           80.0
Email          80.0
Age            80.0
dtype: float64

Overall Completeness Score: 85.00%


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [2]:
# Write your code from here
import pandas as pd

# Main dataset (possibly with some errors)
main_data = {
    'OrderID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Tablet', 'Smartphone', 'Laptop', 'Camera'],
    'Quantity': [1, 2, 1, 1, 3]
}

# Reference dataset (assumed to be accurate)
reference_data = {
    'OrderID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Tablet', 'Smartphone', 'Laptop', 'Camera'],
    'Quantity': [1, 2, 2, 1, 3]  # Note: Quantity for OrderID 103 differs
}

# Load into DataFrames
df_main = pd.DataFrame(main_data)
df_ref = pd.DataFrame(reference_data)

# Display datasets
print("Main Dataset:\n", df_main)
print("\nReference Dataset:\n", df_ref)

# Merge datasets on 'OrderID' to compare key columns
merged = pd.merge(df_main, df_ref, on='OrderID', suffixes=('_main', '_ref'))

# Check accuracy for 'Product' and 'Quantity'
merged['Product_match'] = merged['Product_main'] == merged['Product_ref']
merged['Quantity_match'] = merged['Quantity_main'] == merged['Quantity_ref']

# Calculate accuracy percentage for each column
product_accuracy = merged['Product_match'].mean() * 100
quantity_accuracy = merged['Quantity_match'].mean() * 100

print(f"\nAccuracy for 'Product': {product_accuracy:.2f}%")
print(f"Accuracy for 'Quantity': {quantity_accuracy:.2f}%")

# Overall accuracy (average of both)
overall_accuracy = (product_accuracy + quantity_accuracy) / 2
print(f"\nOverall Accuracy Score: {overall_accuracy:.2f}%")


Main Dataset:
    OrderID     Product  Quantity
0      101      Laptop         1
1      102      Tablet         2
2      103  Smartphone         1
3      104      Laptop         1
4      105      Camera         3

Reference Dataset:
    OrderID     Product  Quantity
0      101      Laptop         1
1      102      Tablet         2
2      103  Smartphone         2
3      104      Laptop         1
4      105      Camera         3

Accuracy for 'Product': 100.00%
Accuracy for 'Quantity': 80.00%

Overall Accuracy Score: 90.00%


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [3]:
# Write your code from here
import pandas as pd
import re

# Sample dataset with a 'Phone' column
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Phone': ['123-456-7890', '1234567890', '123-4567-890', '123-456-7890', '12-3456-7890']
}

df = pd.DataFrame(data)

# Define a regex pattern for consistent phone number format: e.g., XXX-XXX-XXXX
phone_pattern = re.compile(r'^\d{3}-\d{3}-\d{4}$')

# Check if each phone number matches the pattern
df['is_consistent'] = df['Phone'].apply(lambda x: bool(phone_pattern.match(x)))

# Calculate consistency score: ratio of consistent phone numbers
consistency_score = df['is_consistent'].mean()

print(df)
print(f"\nConsistency Score (phone format): {consistency_score:.2f}")


      Name         Phone  is_consistent
0    Alice  123-456-7890           True
1      Bob    1234567890          False
2  Charlie  123-4567-890          False
3    David  123-456-7890           True
4      Eve  12-3456-7890          False

Consistency Score (phone format): 0.40
