## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [3]:
# Write your code from here

In [4]:
import pandas as pd

# Sample dataset with missing values
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'Age': [25, None, 35, 40, None],
    'Email': ['alice@example.com', None, 'charlie@example.com', 'david@example.com', 'eve@example.com']
}

df = pd.DataFrame(data)

# Calculate completeness score
total_values = df.size
non_missing_values = df.count().sum()
completeness_score = (non_missing_values / total_values) * 100

print(f"Completeness Score: {completeness_score:.2f}%")

Completeness Score: 80.00%


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [5]:
# Write your code from here


In [6]:
# Display the completeness score and dataset summary
print(f"Completeness Score: {completeness_score:.2f}%")
print("\nDataset Summary:")
print(df.info())

Completeness Score: 80.00%

Dataset Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   CustomerID  5 non-null      int64  
 1   Name        4 non-null      object 
 2   Age         3 non-null      float64
 3   Email       4 non-null      object 
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytes
None


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [7]:
# Write your code from here


In [None]:
# Visualize the completeness of each column in the dataset
import matplotlib.pyplot as plt

# Calculate the percentage of non-missing values for each column
column_completeness = (df.count() / len(df)) * 100

# Plot the completeness percentages
plt.figure(figsize=(8, 5))
column_completeness.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Column-wise Completeness Percentage', fontsize=14)
plt.ylabel('Completeness (%)', fontsize=12)
plt.xlabel('Columns', fontsize=12)
plt.ylim(0, 100)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()