### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [None]:
# Write your code from here

In [1]:
import pandas as pd

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Email': ['alice@example.com', 'bob@example.com', None, 'charlie@example.com'],
    'Age': [25, 30, None, 35]
}
df = pd.DataFrame(data)

# Completeness: Percentage of non-null values for each column
completeness = df.notnull().mean() * 100

# Validity: Percentage of email fields containing '@'
validity = (df['Email'].str.contains('@', na=False).mean()) * 100

# Uniqueness: Count of distinct entries in the Email column
uniqueness = df['Email'].nunique()

# Display metrics
print("Completeness (%):")
print(completeness)
print("\nValidity (%):")
print(validity)
print("\nUniqueness (count):")
print(uniqueness)

Completeness (%):
Name     75.0
Email    75.0
Age      75.0
dtype: float64

Validity (%):
75.0

Uniqueness (count):
3


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here

In [2]:
# Calculate the overall data quality score as the simple average of completeness, validity, and uniqueness metrics
# Normalize uniqueness to a percentage scale for consistency
normalized_uniqueness = (uniqueness / len(df)) * 100

# Calculate the overall data quality score
data_quality_score = (completeness.mean() + validity + normalized_uniqueness) / 3

# Display the overall data quality score
print("Overall Data Quality Score (%):")
print(data_quality_score)

Overall Data Quality Score (%):
75.0


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

In [3]:
import great_expectations as ge

# Import Great Expectations

# Convert the existing DataFrame to a Great Expectations DataFrame
ge_df = ge.from_pandas(df)

# Define expectations for completeness
ge_df.expect_column_values_to_not_be_null("Name")
ge_df.expect_column_values_to_not_be_null("Email")
ge_df.expect_column_values_to_not_be_null("Age")

# Define expectations for validity
ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")

# Define expectations for uniqueness
ge_df.expect_column_values_to_be_unique("Email")

# Validate the expectations
validation_results = ge_df.validate()

# Display the validation results
print("Validation Results:")
print(validation_results)

ModuleNotFoundError: No module named 'great_expectations'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


In [4]:
from great_expectations.render.renderer import ValidationResultsPageRenderer
from great_expectations.render.view import DefaultJinjaPageView

# Generate an HTML report for the validation results using Great Expectations

# Render the validation results to HTML
html_content = DefaultJinjaPageView().render(
    ValidationResultsPageRenderer().render(validation_results)
)

# Save the HTML report to a file
with open("validation_results_report.html", "w") as report_file:
    report_file.write(html_content)

print("Validation report saved as 'validation_results_report.html'.")

ModuleNotFoundError: No module named 'great_expectations'

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


In [5]:
# Check if the data quality score meets a predefined threshold
threshold = 80.0

if data_quality_score < threshold:
    print(f"Data quality score ({data_quality_score}%) is below the threshold ({threshold}%).")
    print("Consider triggering automated data cleaning scripts.")
else:
    print(f"Data quality score ({data_quality_score}%) meets or exceeds the threshold ({threshold}%).")
    print("No immediate action required.")

Data quality score (75.0%) is below the threshold (80.0%).
Consider triggering automated data cleaning scripts.


In [7]:
# Automate the data quality score calculation and integrate with Great Expectations

# Define a function to calculate data quality score
def calculate_data_quality_score(df):
    # Completeness: Percentage of non-null values for each column
    completeness = df.notnull().mean() * 100

    # Validity: Percentage of email fields containing '@'
    validity = (df['Email'].str.contains('@', na=False).mean()) * 100

    # Uniqueness: Count of distinct entries in the Email column
    uniqueness = df['Email'].nunique()

    # Normalize uniqueness to a percentage scale for consistency
    normalized_uniqueness = (uniqueness / len(df)) * 100

    # Calculate the overall data quality score
    data_quality_score = (completeness.mean() + validity + normalized_uniqueness) / 3

    return data_quality_score

# Recalculate the data quality score for the existing DataFrame
data_quality_score = calculate_data_quality_score(df)

# Display the recalculated data quality score
print("Recalculated Data Quality Score (%):")
print(data_quality_score)

Recalculated Data Quality Score (%):
75.0


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here


In [6]:
# Check if any column's completeness is below a certain threshold
completeness_threshold = 70.0

columns_below_threshold = completeness[completeness < completeness_threshold]

if not columns_below_threshold.empty:
    print("The following columns have completeness below the threshold:")
    print(columns_below_threshold)
    print("Consider taking corrective actions for these columns.")
else:
    print("All columns meet the completeness threshold.")

All columns meet the completeness threshold.
