## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [4]:
import pandas as pd

# Step 1: Load a sample dataset
data = {
    'CustomerID': [101, 102, 103, 104, None],
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eva'],
    'Email': ['alice@email.com', None, 'charlie@email.com', 'dan@email.com', 'eva@email.com']
}

df = pd.DataFrame(data)

# Step 2: Identify missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

# Step 3: Calculate completeness score
total_cells = df.size
total_missing = df.isnull().sum().sum()
total_non_missing = total_cells - total_missing
completeness_score = total_non_missing / total_cells

# Step 4: Output
print(f"\nTotal cells: {total_cells}")
print(f"Total missing values: {total_missing}")
print(f"Completeness Score: {completeness_score:.2%}")


Missing values per column:
CustomerID    1
Name          1
Email         1
dtype: int64

Total cells: 15
Total missing values: 3
Completeness Score: 80.00%


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [5]:
import pandas as pd

# Step 1: Create sample data
main_data = {
    'ProductID': [101, 102, 103, 104],
    'Quantity': [5, 3, 2, 8],
    'Price': [100, 200, 150, 300]
}

reference_data = {
    'ProductID': [101, 102, 103, 104],
    'Quantity': [5, 3, 2, 10],      # Last entry differs from main_data
    'Price': [100, 200, 150, 300]
}

# Step 2: Load datasets
df_main = pd.DataFrame(main_data)
df_ref = pd.DataFrame(reference_data)

# Step 3: Compare key columns
comparison_columns = ['Quantity', 'Price']

# Merge on unique key (ProductID)
df_merged = pd.merge(df_main, df_ref, on='ProductID', suffixes=('_main', '_ref'))

# Step 4: Check row-wise accuracy for selected columns
correct_matches = 0
total_checks = len(df_merged) * len(comparison_columns)

for col in comparison_columns:
    correct_matches += (df_merged[f"{col}_main"] == df_merged[f"{col}_ref"]).sum()

accuracy_score = correct_matches / total_checks

# Step 5: Output
print(f"Total checks: {total_checks}")
print(f"Correct matches: {correct_matches}")
print(f"Accuracy Score: {accuracy_score:.2%}")


Total checks: 8
Correct matches: 7
Accuracy Score: 87.50%


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [6]:
import pandas as pd
import re

# Step 1: Create sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva'],
    'Phone': ['+1-123-456-7890', '123-456-7890', '+1-987-654-3210', '+1-111-222-3333', '9876543210']
}

df = pd.DataFrame(data)

# Step 2: Define a regex pattern for valid phone format
pattern = r'^\+1-\d{3}-\d{3}-\d{4}$'

# Step 3: Apply consistency check
df['IsConsistent'] = df['Phone'].apply(lambda x: bool(re.match(pattern, str(x))))

# Step 4: Calculate consistency score
total_entries = len(df)
consistent_entries = df['IsConsistent'].sum()
consistency_score = consistent_entries / total_entries

# Step 5: Output
print("Phone number consistency check:")
print(df[['Name', 'Phone', 'IsConsistent']])
print(f"\nTotal entries: {total_entries}")
print(f"Consistent entries: {consistent_entries}")
print(f"Consistency Score: {consistency_score:.2%}")


Phone number consistency check:
      Name            Phone  IsConsistent
0    Alice  +1-123-456-7890          True
1      Bob     123-456-7890         False
2  Charlie  +1-987-654-3210          True
3    Diana  +1-111-222-3333          True
4      Eva       9876543210         False

Total entries: 5
Consistent entries: 3
Consistency Score: 60.00%
