## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [1]:
# Write your code from here
# Write your code from here
# Importing necessary libraries
import great_expectations as ge
import pandas as pd
from great_expectations.core.batch import BatchRequest
from great_expectations.dataset import Dataset
from great_expectations.data_context import DataContext

# Step 1: Load Sample Data
# Replace this with your actual data path
df = pd.read_csv("sample_data.csv")
print("Sample Data:")
print(df.head())

# Step 2: Initialize Great Expectations Context (if not already initialized)
# If this is your first time, initialize the data context with:
# great_expectations init

# Step 3: Setup DataContext
context = DataContext()

# Step 4: Add Data Source (if not already added through CLI)
# You can add a datasource like this (this can also be done via CLI):
# context.add_datasource(
#     "my_pandas_datasource",
#     class_name="Datasource",
#     module_name="great_expectations.datasource",
#     batch_kwargs_generators={
#         "pandas_batch_generator": {
#             "class_name": "SubdirReaderBatchKwargsGenerator",
#             "base_directory": "path/to/your/csv/directory"
#         }
#     }
# )

# Step 5: Create or Load Expectation Suite
# Here, we create a new expectation suite if it does not exist
suite_name = "sample_suite"
try:
    suite = context.get_expectation_suite(suite_name)
except ValueError:
    suite = context.create_expectation_suite(suite_name, overwrite_existing=True)

# Step 6: Create a Validator
batch = context.get_batch({
    "datasource_name": "my_pandas_datasource",  # Adjust as necessary
    "batch_kwargs": {
        "path": "sample_data.csv",  # Path to your CSV
        "reader_options": {"delimiter": ","}
    },
    "expectation_suite_name": suite_name
})
validator = context.get_validator(batch)

# Step 7: Add Expectations

# Expecting the column "name" to exist
validator.expect_column_to_exist("name")

# Expecting the column "email" to exist
validator.expect_column_to_exist("email")

# Expecting the column "age" to be of type int64
validator.expect_column_values_to_be_of_type("age", "int64")

# Expecting the column "email" to not have any null values
validator.expect_column_values_to_not_be_null("email")

# Expecting the email format to match regex (basic email pattern)
validator.expect_column_values_to_match_like(
    "email", r"^[\w\.-]+@[\w\.-]+\.\w+$"
)

# Step 8: Save the Expectations to the Suite
validator.save_expectation_suite()

# Step 9: Validate the Data (Generate a Validation Report)

# Create a checkpoint (this is similar to running a checkpoint via CLI)
checkpoint_config = {
    "name": "my_checkpoint",
    "class_name": "Checkpoint",
    "module_name": "great_expectations.checkpoint",
    "batch_request": {
        "datasource_name": "my_pandas_datasource",
        "batch_kwargs": {
            "path": "sample_data.csv",  # Path to your CSV
            "reader_options": {"delimiter": ","}
        },
        "expectation_suite_name": suite_name
    }
}

# Run the checkpoint validation
checkpoint = context.add_checkpoint(**checkpoint_config)
result = checkpoint.run()
print("Checkpoint validation result:")
print(result)

# Step 10: View the Validation Results
# Great Expectations generates data docs that can be opened in a browser
# Open the generated docs in your browser to view the results
context.build_data_docs()
print("Validation results saved to data docs at:")
print(context.get_docs_sites())

ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [2]:
# Write your code from here
import pandas as pd
import great_expectations as ge
from great_expectations.data_context import DataContext
from great_expectations.core import ExpectationSuite, ExpectationResult
import unittest

# Setup Great Expectations DataContext
def setup_data_context():
    try:
        # Initialize the Great Expectations DataContext
        context = DataContext("/path/to/great_expectations/directory")
        return context
    except Exception as e:
        print(f"Error initializing Great Expectations DataContext: {e}")
        return None

# Load dataset with error handling
def load_data(file_path):
    try:
        df = pd.read_csv(file_path)
        return df
    except FileNotFoundError:
        print(f"Error: The file {file_path} was not found.")
        return None
    except pd.errors.EmptyDataError:
        print(f"Error: The file {file_path} is empty.")
        return None
    except Exception as e:
        print(f"An unexpected error occurred while loading the file: {e}")
        return None

# Define basic expectations for the dataset
def create_expectations(df):
    # Expectation 1: Check if the columns exist
    if 'customer_id' not in df.columns:
        print("Error: 'customer_id' column is missing.")
    else:
        print("'customer_id' column is present.")

    # Expectation 2: Check if customer_id is unique
    if df['customer_id'].duplicated().any():
        print("Error: 'customer_id' column contains duplicates.")
    else:
        print("'customer_id' column is unique.")

    # Expectation 3: Check if the data types are as expected
    expected_dtypes = {
        'customer_id': 'int64',
        'age': 'float64',
        'gender': 'object'
    }
    
    for column, dtype in expected_dtypes.items():
        if column in df.columns and df[column].dtype != dtype:
            print(f"Error: '{column}' is not of type {dtype}.")
        else:
            print(f"'{column}' has the expected data type {dtype}.")

# Create a function to validate data completeness
def validate_completeness(df):
    # Expectation: Ensure that no more than 10% of any column's values are missing
    for column in df.columns:
        missing_percent = df[column].isnull().sum() / len(df) * 100
        if missing_percent > 10:
            print(f"Warning: '{column}' has more than 10% missing data.")
        else:
            print(f"'{column}' completeness is within acceptable limits.")

# Create a function to validate data accuracy
def validate_accuracy(df):
    # Example: Validate that 'age' column values are non-negative
    if (df['age'] < 0).any():
        print("Error: 'age' column contains negative values.")
    else:
        print("'age' column contains valid values.")

# Unit Tests to validate the expectations
class TestDataQuality(unittest.TestCase):
    def setUp(self):
        self.df = load_data("sample_data.csv")
        
    def test_completeness(self):
        # Ensure no column has more than 10% missing data
        for column in self.df.columns:
            missing_percent = self.df[column].isnull().sum() / len(self.df) * 100
            self.assertLessEqual(missing_percent, 10, f"{column} has more than 10% missing data.")
    
    def test_column_presence(self):
        required_columns = ['customer_id', 'age', 'gender']
        for column in required_columns:
            self.assertIn(column, self.df.columns, f"{column} is missing from the dataset.")
    
    def test_data_types(self):
        expected_dtypes = {
            'customer_id': 'int64',
            'age': 'float64',
            'gender': 'object'
        }
        for column, dtype in expected_dtypes.items():
            self.assertEqual(self.df[column].dtype, dtype, f"{column} does not have the expected data type.")

# Main process
if __name__ == "__main__":
    # Load the dataset
    df = load_data("sample_data.csv")
    if df is not None:
        # Setup Great Expectations
        context = setup_data_context()
        if context is not None:
            create_expectations(df)
            validate_completeness(df)
            validate_accuracy(df)
        
        # Run the unit tests
        unittest.main(argv=[''], exit=False)

ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [3]:
# Write your code from here
# Importing necessary libraries
import great_expectations as ge
from great_expectations.data_context import DataContext

# Step 1: Load the Sample Dataset (same as before)
import pandas as pd
df = pd.read_csv("sample_data.csv")
print("Sample Data:")
print(df.head())

# Step 2: Initialize Great Expectations Context
context = DataContext()

# Step 3: Load the Existing Expectation Suite
suite_name = "sample_suite"
try:
    suite = context.get_expectation_suite(suite_name)
except ValueError:
    print(f"Expectation suite '{suite_name}' not found. Please create it first.")
    exit()

# Step 4: Setup Batch for Validation
batch_kwargs = {
    "path": "sample_data.csv",  # Path to your CSV
    "reader_options": {"delimiter": ","}
}
batch_request = {
    "datasource_name": "my_pandas_datasource",  # Adjust as necessary based on your datasource
    "batch_kwargs": batch_kwargs,
    "expectation_suite_name": suite_name
}
batch = context.get_batch(batch_request)

# Step 5: Create Validator for the Dataset
validator = context.get_validator(batch)

# Step 6: Run the Validation
validation_result = validator.validate()

# Step 7: Review the Validation Results
print("Validation Results:")
print(validation_result)

# Step 8: Generate Data Docs (HTML report)
context.build_data_docs()

# Step 9: Open the Data Docs folder (This will generate a visual report in the browser)
docs_path = context.get_docs_sites()
print(f"Data docs available at: {docs_path}")

# Step 10: Optional - Save the Report (This generates a static HTML report)
# You can specify a location to save the report for offline access
# context.open_data_docs()  # Opens the docs in the browser

ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

In [4]:
import great_expectations as ge
from great_expectations.data_context import DataContext
from apscheduler.schedulers.blocking import BlockingScheduler
import pandas as pd

# Step 1: Load the Sample Dataset (same as before)
df = pd.read_csv("sample_data.csv")
print("Sample Data:")
print(df.head())

# Step 2: Initialize Great Expectations Context
context = DataContext()

# Step 3: Load/Create an Expectation Suite
suite_name = "advanced_suite"
try:
    suite = context.get_expectation_suite(suite_name)
except ValueError:
    suite = context.create_expectation_suite(suite_name, overwrite_existing=True)

# Step 4: Create a Batch Request
batch_kwargs = {
    "path": "sample_data.csv",  # Path to your CSV
    "reader_options": {"delimiter": ","}
}
batch_request = {
    "datasource_name": "my_pandas_datasource",  # Adjust as necessary based on your datasource
    "batch_kwargs": batch_kwargs,
    "expectation_suite_name": suite_name
}
batch = context.get_batch(batch_request)

# Step 5: Create a Validator for the Dataset
validator = context.get_validator(batch)

# Step 6: Define Advanced Expectations

# Expecting that the 'customer_id' column should have unique values
validator.expect_column_values_to_be_unique("customer_id")

# Expecting 'age' to be between 18 and 100
validator.expect_column_value_lengths_to_be_between("age", min_value=18, max_value=100)

# Adding the expectations to the suite
validator.save_expectation_suite()

# Step 7: Run the Validation

def validate_data():
    validation_result = validator.validate()
    print("Validation Results:")
    print(validation_result)

    # Step 8: Build Data Docs (Report)
    context.build_data_docs()
    docs_path = context.get_docs_sites()
    print(f"Data docs available at: {docs_path}")

# Step 9: Schedule the Validation with APScheduler (automated daily check)

scheduler = BlockingScheduler()

# Add the scheduled job for daily validation at a specific time (e.g., 8:00 AM every day)
scheduler.add_job(validate_data, 'interval', hours=24, start_date='2025-05-13 08:00:00')

print("Scheduled job added to run daily for data validation.")

# Start the scheduler (this will keep running indefinitely)
scheduler.start()

ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

In [5]:
from great_expectations.core import DataContext
import great_expectations.data_context
print(dir(great_expectations.data_context))

ImportError: cannot import name 'DataContext' from 'great_expectations.core' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/core/__init__.py)