## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [1]:
import great_expectations as ge
from great_expectations.dataset import PandasDataset
import pandas as pd

# Load your sample dataset
df = pd.read_csv('sample_data.csv')

# Wrap with GE's dataset wrapper
ge_df = ge.from_pandas(df)

# Add expectations
ge_df.expect_column_to_exist('id')
ge_df.expect_column_to_exist('email')
ge_df.expect_column_values_to_be_of_type('age', 'int64')
ge_df.expect_column_values_to_not_be_null('name')

# Save expectations
ge_df.save_expectation_suite("basic_suite", overwrite_existing=True)

# Validate and show results
results = ge_df.validate(expectation_suite_name="basic_suite")
print(results)


ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [2]:
import great_expectations as ge
from great_expectations.data_context import DataContext

# Step 1: Load the data context
context = DataContext()

# Step 2: Set the batch (your dataset)
batch = context.get_batch(
    {
        "datasource_name": "my_datasource",  # Replace with your actual datasource name
        "data_connector_name": "default_inferred_data_connector_name",  # This is default for file-based
        "data_asset_name": "sample_data.csv",  # Replace with your actual file name
        "limit": None,
    },
    "basic_suite"  # Replace with your expectation suite name
)

# Step 3: Run validation
results = context.run_validation_operator(
    "action_list_operator",
    assets_to_validate=[batch]
)

# Step 4: Print validation results
from pprint import pprint
pprint(results)

# Step 5: Build data docs (generate the HTML report)
context.build_data_docs()

# Optional: open the local HTML report in your browser
# This will open the validation result for the latest run
validation_result_identifier = results["run_id"]
context.open_data_docs(resource_identifier=results["run_results"][list(results["run_results"])[0]]["validation_result_identifier"])


ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [3]:
import pandas as pd
import great_expectations as ge
from great_expectations.dataset import PandasDataset

# Load data
df = pd.read_csv('customer_data.csv')  # Replace with your actual file
ge_df = ge.from_pandas(df)

# Advanced Expectations
ge_df.expect_column_values_to_be_unique('CustomerID')

# Conditional Expectation: Age must be ≥ 18 when not null
ge_df.expect_column_values_to_be_between(
    column='Age',
    min_value=18,
    mostly=1.0,
    condition="Age IS NOT NULL"
)

# Save expectations
ge_df.save_expectation_suite(expectation_suite_name='advanced_suite', overwrite_existing=True)
print("Advanced expectations created and saved.")


ModuleNotFoundError: No module named 'great_expectations.dataset'