### 1. Setup and Data Connection
Initial setup of Great Expectations and connection to our data source. This section covers:
* Importing required libraries
* Initializing the Great Expectations Data Context
* Connecting to PostgreSQL database
* Creating data assets and batch definitions
* Verifying our setup with data preview

*Note: This setup establishes the foundation for our data validation framework, ensuring we can access and process our yellow taxi trip data effectively.*

#### Import Required Libraries
Import Great Expectations core package and its expectations module for data validation

In [15]:
import great_expectations as gx
import great_expectations.expectations as gxe

#### Initialize Great Expectations Data Context
Create a Data Context in file mode to manage expectations, validations, and data sources locally

This line initializes a new Data Context in "file" mode, which stores all configurations and results locally. The following are the available Data Context types:

- **File Data Context**: A persistent Data Context that stores metadata and configuration information as YAML files within a file system. It allows you to re-use previously configured Expectation Suites, Data Sources, and Checkpoints.

- **Ephemeral Data Context**: A temporary Data Context that stores metadata and configuration information in memory. It will not persist beyond the current Python session, making it useful for data exploration without needing to save results.

- **GX Cloud Data Context**: A Data Context that connects to a GX Cloud Account to retrieve and store metadata and configuration information. This allows sharing of Expectation Suites, Data Sources, and Checkpoints with your organization.

For more details, refer to the [Great Expectations documentation](https://docs.greatexpectations.io/docs/core/set_up_a_gx_environment/create_a_data_context).

In [16]:
# Either use existing GX config from ./gx or delete ./gx folder to create fresh configuration in working directory (recommended)
context = gx.get_context(mode="file")

In [None]:
# Print the current Data Context configuration to verify settings
context

#### Connect to PostgreSQL Data Source
Configure connection to PostgreSQL database and create a datasource for Great Expectations to use. This datasource will be used to access and validate the yellow taxi trip data.

Note: For demonstration purposes, we're using a simple connection string with exposed credentials. In production environments, you should use secure credential management as described in the [Great Expectations documentation](https://docs.greatexpectations.io/docs/core/connect_to_data/sql_data) 

In [18]:
PG_CONNECTION_STRING = "postgresql+psycopg2://postgres:superpass@127.0.0.1:5450/postgres"
datasource_name = "pg_datasource"

data_source = context.data_sources.add_postgres(
    name=datasource_name, connection_string=PG_CONNECTION_STRING
)

#### Verify PostgreSQL Data Source Configuration
Display the data source configuration to confirm successful connection and review the connection details.

In [None]:
# fetch a Data Source from the Data Context.
data_source_name = "pg_datasource"
data_source = context.data_sources.get(data_source_name)
data_source


#### Create Data Asset (Table Data Asset)
Create a data asset that represents our table in the data source. A data asset serves as a bridge between Great Expectations and your data, enabling validation and expectation management.

*Note: Data assets in Great Expectations can be created from tables or queries.

In [20]:
# Define a name for our data asset - this is how we'll refer to it in Great Expectations
table_asset_name = "yellowtaxi_data"

# Specify the actual table name in our PostgreSQL database
table_name = "yellow_tripdata"

# Create a table asset by linking our data asset name to the physical database table
taxi_table_asset = data_source.add_table_asset(name=table_asset_name, table_name=table_name)

In [None]:
# Get all assets from the datasource
assets = context.fluent_datasources[datasource_name].assets

# Print asset details in a more structured format
print("Data Assets in '{}' datasource:".format(datasource_name))
for asset in assets:
    print(f"  • Asset Name: {asset.name}")
    print(f"    Type: {type(asset).__name__}")  

#### Create Batch Definition (Whole Table)
Define how we want to access and validate our data. A batch definition specifies the scope of data to be validated - in this case, we're creating a definition to validate the entire table at once.

*Note: Batch definitions in Great Expectations can be configured in different ways:*
- Whole table - validates all data in the table
- Custom SQL query - validates specific data based on a query
- Partitioned data - validates data split by time periods or other criteria

*For our yellow taxi data validation, we're using the whole table approach as we want to validate all records.*

For more information on batch definitions, refer to the [Great Expectations documentation](https://docs.greatexpectations.io/docs/core/connect_to_data/sql_data/#procedure-batch-definition).

In [22]:
# Get the Data Asset from the Data Source using our previously defined asset name
asset_name = "yellowtaxi_data"   
data_asset = data_source.get_asset(asset_name)

# Create a batch definition for the entire table
# This defines how we want to process the data - in this case, the whole table at once
fulltable_batch_name = "yellowtaxi_full_table_batch"
full_table_batch_definition = data_asset.add_batch_definition_whole_table(
    name=fulltable_batch_name
)

#### Review Batch Definition Configuration
Verify the batch definition settings to ensure it's correctly configured for our yellow taxi data table. This step helps confirm that Great Expectations understands how to access our data.

*Note: Reviewing the batch definition shows us important metadata like:*
* The batch identifier
* The data asset it's associated with
* The configuration for accessing the data

In [None]:
# Retrieve and display the batch definition configuration
batch_name = "yellowtaxi_full_table_batch"
full_table_batch_definition = data_asset.get_batch_definition(batch_name)
full_table_batch_definition  # Display the configuration details

#### Preview Batch Data
Retrieve a batch of data using our batch definition and display the first few rows. This helps verify that we can successfully access our yellow taxi data and examine its structure.

*Note: The `head()` function shows the first few rows of the data, allowing us to:*
* Confirm data is being retrieved correctly
* Review the column names and data types
* Verify the data content matches our expectations

In [None]:
# Retrieve and preview the data batch
batch = full_table_batch_definition.get_batch()
batch.head()  # Display first few rows of the data

### 2. Define Expectations
Create and configure data quality rules for our yellow taxi data. This section covers:
* Creating an Expectation Suite as a container for our data quality rules
* Defining specific expectations for data validation:
  - Ensuring pickup dates are not null
  - Validating passenger counts are within valid range (0-6)
* Reviewing the configured expectations

*Note: Expectations are the core concept in Great Expectations, acting as assertions that define what we expect from our data. Each expectation represents a specific data quality rule, and together they form an Expectation Suite that can be used repeatedly to validate our data.*


#### Create Expectation Suite
Initialize a new suite named 'yellowtaxi_data_suite' to store our data quality rules for the taxi dataset.

*Note: This suite will contain our specific validation rules for pickup dates and passenger counts.*

In [25]:
# Create a new Expectation Suite with a descriptive name
suite_name = "yellowtaxi_data_suite"
suite = gx.ExpectationSuite(name=suite_name)

# Add the Expectation Suite to our Data Context
# This makes it available for future validations
suite = context.suites.add(suite)

#### Add Data Quality Expectations
Add two validation rules to our suite:
* Pickup date validation - must not be null
* Passenger count validation - must be between 0 and 6 passengers

*Note: These rules help identify invalid trip records and unrealistic passenger counts.*

In [None]:
# Add expectation: pickup_date should never be null
# This is crucial for trip tracking and analysis
suite.add_expectation(
    gxe.ExpectColumnValuesToNotBeNull(column="pickup_date")
)

In [None]:
# Add expectation: passenger_count should be between 0 and 6
# This reflects the typical capacity of a taxi
suite.add_expectation(
    gxe.ExpectColumnValuesToBeBetween(
        column="passenger_count",
        min_value=0,
        max_value=6
    )
)

#### Review Expectation Suite Configuration
Examine the configured Expectation Suite to verify that our data quality rules were properly added and configured.

*Note: Reviewing the suite details allows us to:*
* Confirm all expectations were added correctly
* View the complete set of data quality rules
* Verify the configuration of each expectation

In [None]:
# Retrieve and display the Expectation Suite configuration
suite_name = "yellowtaxi_data_suite"
suite_details = context.suites.get(suite_name)
print(suite_details)  # Display all expectations in the suite

### 3. Validations
Execute and verify our data quality checks against the yellow taxi dataset. This section covers:
* Creating a Validation Definition to connect our data with expectations
* Registering the validation definition in our context
* Running the actual validation process
* Reviewing validation results in summary format

*Note: Validation is where we actually test our data against the expectations we defined. This process helps us identify any data quality issues by showing which expectations passed or failed, providing immediate feedback on our data's quality.*

#### Create Validation Definition
Configure a validation task named 'yellowtaxi_data_validation' that connects our:
* Taxi data batch
* Expectation Suite

In [29]:
# Create a Validation Definition that connects our data with our expectations
validation_definition_name = "yellowtaxi_data_validation"

validation_definition = gx.ValidationDefinition(
    data=full_table_batch_definition,  # The data batch to validate
    suite=suite,                       # The expectations to validate against
    name=validation_definition_name     # Name for this validation
)

#### Add Validation Definition to Context
Register our validation definition with the Data Context to make it available for execution.


In [30]:
# Add the validation definition to our Data Context
# This makes it available for running validations
validation_definition = context.validation_definitions.add(validation_definition)

#### Review Validation Definition
Verify that our validation definition was properly configured by retrieving and displaying its details from the context.


In [None]:
# Retrieve and display the validation definition configuration
validation_definition_name = "yellowtaxi_data_validation"
validation_definition = context.validation_definitions.get(validation_definition_name)
print(validation_definition)  # Display the configuration details

#### Execute Validation
Run the validation to check if our taxi data meets the defined expectations. Using 'SUMMARY' format to get a concise overview of results.

In [None]:
# Execute the validation and get results in summary format
results = validation_definition.run(result_format="SUMMARY")

# Display the validation results
print(results)  # Shows which expectations passed or failed

### 4. Data Docs
Configure and manage documentation sites for our validation results through:
* Site configuration and storage settings
* Management of Data Docs sites (listing, adding, removing)
* Custom site setup for validation results

*Note: Data Docs transform our technical validations into readable, shareable documentation that helps track and communicate data quality.*

#### Configure Data Docs Site
Define storage location and structure:
* Base directory: 'uncommitted/data_docs/local_site'
* Site builder configuration
* Storage backend settings

In [33]:
# Define the base directory for storing Data Docs files
base_directory = "uncommitted/data_docs/local_site/"

# Configure the Data Docs site settings
site_config = {
    "class_name": "SiteBuilder",                                  # Handles site generation
    "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},  # Creates site index
    "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",             # Manages file storage
        "base_directory": base_directory,                         # Where files are stored
    },
}

#### List Available Data Docs Sites
Display currently configured documentation sites in our context.


In [None]:
# Retrieve and display names of all configured Data Docs sites
context.get_site_names()

#### Remove Existing Data Docs Site
Clean up previous site configuration to prepare for new setup.


In [None]:
# Remove the existing Data Docs site from our context
context.delete_data_docs_site(site_name="yellowtaxi_data_validation_result_site")

#### Add New Data Docs Site
Register site 'yellowtaxi_data_validation_result_site' using our custom configuration.

In [36]:
# Add the new Data Docs site to our context
site_name = "yellowtaxi_data_validation_result_site"
context.add_data_docs_site(
    site_name=site_name,     # Name for our validation results site
    site_config=site_config  # Using the configuration we defined earlier
)

### 5. Actions and Checkpoints
Automate the validation workflow through Checkpoints and Actions. This section demonstrates:

* Checkpoints: Reusable validation configurations that combine:
  - What to validate (validation definitions)
  - When to validate (triggers)
  - What to do with results (actions)

* Actions: Automated responses to validation results:
  - In this case, automatically updating Data Docs
  - Can be extended to include notifications, alerts, or other responses

*Note: Think of this as an automated quality control system where:*
* Checkpoints act as quality control stations
* Actions are the automated responses to inspection results
* Together they create a repeatable, automated validation process


#### Create Data Docs Update Action
Configure an action to automatically update our documentation site after validation.


In [37]:
# Define action to update Data Docs after validation
actions = [
    gx.checkpoint.actions.UpdateDataDocsAction(
        name = "update_data_docs",
        site_names=[site_name]    # Update our specific validation results site
    )
]

#### Create Checkpoint
Create a checkpoint that combines our validation definition with the update action.


In [38]:
# Define a checkpoint that links validations with actions
checkpoint_name = "yellowtaxi_data_validation_checkpoint"
checkpoint = gx.Checkpoint(
    name=checkpoint_name,                           # Name for our checkpoint
    validation_definitions=[validation_definition], # What to validate
    actions=actions                                # What to do after validation
)

#### Add Checkpoint to Context
Register our checkpoint with the Data Context to make it available for execution.


In [39]:
# Add the checkpoint to our Data Context
checkpoint = context.checkpoints.add(checkpoint)

#### Execute Checkpoint
Run the checkpoint to validate data and trigger the Data Docs update action.

In [None]:
# Run the checkpoint and get validation results
validation_results = checkpoint.run()

#### Access Validation Results
After checkpoint execution, view the validation results in the auto-generated Data Docs:

* Location: `./gx/uncommitted/data_docs/local_site/index.html`
* Report Contents:
  - Overview of all validations
  - Detailed results for each expectation
  - Pass/Fail status for data quality rules
  - Historical validation results

*Note: Open the index.html file in a web browser to view an interactive dashboard of your validation results*

### 6. Summary and Workflow
This notebook demonstrates a complete data validation pipeline using Great Expectations:
1. Setup: Configured environment and connected to PostgreSQL
2. Expectations: Defined data quality rules for taxi data
3. Validations: Executed and verified our quality rules
4. Documentation: Generated human-readable reports
5. Automation: Set up checkpoints for repeated validation

*Note: For detailed documentation and advanced features, refer to [Great Expectations Core (GX Core) Documentation](https://docs.greatexpectations.io/docs/core/introduction/)*