## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [2]:
# Step 1: Install required packages if not done yet
# pip install great_expectations pandas

# Step 2: Create a sample dataset
import pandas as pd
import great_expectations as ge
from great_expectations.dataset import PandasDataset

# Create sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, None],
    "Email": ["alice@example.com", "bob@example.com", None, "david@example.com"]
}
df = pd.DataFrame(data)

# Convert to GE Dataset
ge_df = ge.dataset.PandasDataset(df)

# Step 3: Define Expectations
ge_df.expect_table_columns_to_match_ordered_list(["Name", "Age", "Email"])
ge_df.expect_column_values_to_not_be_null("Name")
ge_df.expect_column_values_to_be_between("Age", min_value=0, max_value=120)
ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")

# Step 4: Validate the Dataset
validation_result = ge_df.validate()
print("\n📋 Validation Summary:")
print(validation_result["statistics"])

# Optional: Display Failed Expectations
print("\n❌ Failed Expectations:")
for result in validation_result["results"]:
    if not result["success"]:
        print(f"- {result['expectation_config']['expectation_type']} failed")


ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [3]:
# Install the required packages (uncomment below if not installed)
# !pip install great_expectations pandas

import pandas as pd
import great_expectations as ge
from great_expectations.dataset import PandasDataset

# Step 1: Create Sample Dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", None],
    "Age": [25, 30, 35, None, 28],
    "Email": ["alice@example.com", "bob@example.com", None, "david@example.com", "eve@example.com"]
}
df = pd.DataFrame(data)

# Step 2: Convert pandas DataFrame to GE Dataset
ge_df = ge.dataset.PandasDataset(df)

# Step 3: Define Expectations
ge_df.expect_column_values_to_not_be_null("Name")  # Completeness
ge_df.expect_column_values_to_be_between("Age", min_value=0, max_value=100)  # Valid Age
ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")  # Email pattern

# Step 4: Validate Dataset
results = ge_df.validate()

# Step 5: Print Results Summary
print("📋 Validation Summary:")
print(results["statistics"])

# Step 6: Print Failed Expectations
print("\n❌ Failed Expectations:")
for result in results["results"]:
    if not result["success"]:
        print(f"- {result['expectation_config']['expectation_type']} failed")

# Optional: Save Report to JSON
import json
with open("validation_report.json", "w") as f:
    json.dump(results, f, indent=4)

print("\n✅ Report saved to 'validation_report.json'")


ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [4]:
import pandas as pd
import great_expectations as ge
from great_expectations.dataset import PandasDataset

# Sample dataset
data = {
    "customer_id": [101, 102, 103, 104, 101],  # Duplicate ID to test uniqueness
    "email": ["a@x.com", "b@x.com", None, "d@x.com", "e@x.com"],
    "signup_date": ["2023-01-01", "2023-02-15", "invalid_date", "2023-03-10", "2023-04-01"]
}
df = pd.DataFrame(data)

# Convert to GE Dataset
ge_df = ge.dataset.PandasDataset(df)

# Advanced Expectations
ge_df.expect_column_values_to_be_unique("customer_id")  # Ensure unique IDs
ge_df.expect_column_values_to_match_regex("email", r".+@.+\..+")  # Valid emails
ge_df.expect_column_values_to_match_strftime_format("signup_date", "%Y-%m-%d")  # Valid dates

# Validate
result = ge_df.validate()

# Save report
import json
with open("daily_validation_report.json", "w") as f:
    json.dump(result, f, indent=4)

print("✅ Advanced expectations validated. Report saved.")


ModuleNotFoundError: No module named 'great_expectations.dataset'