# Prototype validation tools
In developing a new data validation framework, it's important to demonstrate that existing validations can be translated into the new framework. This notebook will demonstrate a factory pattern for generating `asset_check`'s which will run [Great Expectations](https://docs.greatexpectations.io/docs/core/introduction/) under the hood for easy configurable validations.

In [None]:
!pip install great_expectations

## Read assets
In production dagster will handle loading assets, so using a helper function for testing purposes.

In [None]:
import pandas as pd

def _get_asset(asset_key: str) -> pd.DataFrame:
    return pd.read_parquet(f"https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{asset_key}.parquet")

## Define `asset_check` factory
Define a fake `asset_check` factory to generate GX based tests. This won't create real `asset_check`'s because they can't easily be run in a notebook, but is meant to mirror what the API would look like.

In [None]:
import great_expectations as gx
from typing import Type, Any
from dataclasses import dataclass
from collections.abc import Callable

@dataclass
class ValidationResult:
    """This would be an AssetCheckResult in production."""
    passed: bool
    metadata: dict
    description: str | None

def validation_factory(
    asset_name: str,
    expectation: Type[gx.expectations.Expectation],
    expectation_config: dict,
    preprocessing_func: Callable | None = None,
    fast_etl_expectation_config: dict | None = None,
    description: str | None = None,
):
    """Return a function which will execute a great expectations expectation."""
    def _validation():
        df = _get_asset(asset_name)

        # Apply preprocessing
        if preprocessing_func is not None:
            df = preprocessing_func(df)

        # Connect to data
        context = gx.get_context()
        batch = context.data_sources.pandas_default.read_dataframe(df)

        # Create expectation (in actual asset_check factory this would check job name to supply correct config)
        configured_expectation = expectation(**expectation_config)

        # Run test
        validation_result = batch.validate(configured_expectation)

        return ValidationResult(
            passed=validation_result.success,
            description=description,
            metadata=validation_result.result,
        )
    return _validation

### Prototype FERC1 bounds check
Using a fairly basic example test case, this API feels a bit clunky and confusing. There's a lot you have to understand to use it effectively.

In [None]:
validation = validation_factory(
    "out_ferc1__yearly_steam_plants_fuel_by_plant_sched402",
    gx.expectations.ExpectColumnQuantileValuesToBeBetween,
    expectation_config={
        "column": "gas_cost_per_mmbtu",
        "quantile_ranges": {
            "quantiles": [0.05, 0.50, 0.90],
            "value_ranges": [[1.5, 2.1], [2.0, 10.0], [8.0, 15.0]],
        },
    },
    preprocessing_func=lambda df: pd.DataFrame(
        {
            "gas_cost_per_mmbtu": (
                df["gas_fraction_cost"] * df["fuel_cost"]
            ) / (df["gas_fraction_mmbtu"] * df["fuel_mmbtu"])
        }
    )
)()

## Utility function API
An alternate approach to the factory function based API is to just provide a selection of utility/helper functions and having developers create `asset_check`'s directly. This will allow for a clear and straightforward place to manipulate assets before executing an expectation.

The first obvious utility function we would need is one that will take a dataframe and a preconfigured expectation, then handle the boiler plate GX setup and execute the expectation.

In [None]:
def validate_expectation(df, expectation: gx.expectations.Expectation, description: str | None = None) -> ValidationResult:
    # Connect to data
    context = gx.get_context()
    batch = context.data_sources.pandas_default.read_dataframe(df)

    # Execute
    validation_result = batch.validate(expectation)

    return ValidationResult(
        passed=validation_result.success,
        description=description,
        metadata=validation_result.result,
    )

### Prototype FERC1 bounds check
We'll re-implement the FERC bounds check from above using this new API. This approach feels much more readable to me.

In [None]:
#Apply asset_check decorator here
def ferc1_fbp_bounds_check() -> ValidationResult:
    df = _get_asset("out_ferc1__yearly_steam_plants_fuel_by_plant_sched402")
    df = pd.DataFrame(
        {
            "gas_cost_per_mmbtu": (
                df["gas_fraction_cost"] * df["fuel_cost"]
            ) / (df["gas_fraction_mmbtu"] * df["fuel_mmbtu"])
        }
    )
    expectation = gx.expectations.ExpectColumnQuantileValuesToBeBetween(
        # Get configuration based on fast/full etl in production
        column="gas_cost_per_mmbtu",
        quantile_ranges={
            "quantiles": [0.05, 0.50, 0.90],
            "value_ranges": [[1.5, 2.1], [2.0, 10.0], [8.0, 15.0]],
        },

    )
    return validate_expectation(df, expectation)
ferc1_fbp_bounds_check()

### Prototype `vs_historical` checks
#### Problem 1: Access multiple tables
To demonstrate a more complex use case, we'll prototype the `test_agg_vs_historical` validation function for EIA 923 boiler fuel data (found in `test/validate/bf_eia923_test.py`). This method compares the `out_eia923__boiler_fuel` table to aggregated versions of this table. This means that the `asset_check` needs access to multiple assets, which is not how `asset_check`'s normally work.

There are a couple options I see for handling this multi-asset need:

1. Apply `asset_check` to the downstream aggregated asset and just read the upstream asset from parquet

This approach should work fine for this use case, but it embeds an implicit dependency between required assets. If we were to write a validation that looks at two assets that don't have a direct dependency, then you could end up running the `asset_check` before one of the  assets has been materialized.

2. Create stub assets that depend on all necessary upstream assets and apply `asset_check` to this downstream stub. This asset wouldn't actually do anything, but guarantee that upstream assets have been run before running the `asset_check`.

I think option `2.` seems more robust and clear to me. Either way for the purpose of this notebook we'll simply load both assets from parquet.

#### Problem 2: Computing weighted quantile
GX has tooling right out of the box for checking quantiles, but nothing for working with weighted quantiles. Perhaps there's a way to preprocess the data, then apply the basic quantile check tools, but this doesn't seem possible to me. Another option would be to develop a custom SQL based expectation that will check weighted quantiles. I spent quite a bit of time trying to get this working with no luck. I was able to develop a query that mostly mimicks the behavior of the existing `vs_historical` function and works as expected using duckdb. However, when trying to use this query with GX I couldn't get it working and the `UnexpectedRowsExpectation` returns very minimal feedback as to what went wrong.

Fortunately, by using `asset_check`'s as the basis for the framework, we aren't explicitly tied to GX and can simply write plain python validations. For the time being, I would recommend simply wrapping the existing method in an `asset_check`.

Below is a demonstration of the SQL query if we ever want to try to get the `UnexpectedRowsExpectation` version working:

In [None]:
# Prep parameters

from pudl.validate import historical_distribution

bf_eia923 = _get_asset("out_eia923__boiler_fuel")
bf_eia923_agg = _get_asset("out_eia923__monthly_boiler_fuel")

bf_eia923 = bf_eia923[bf_eia923["fuel_type_code_pudl"] == "coal"]
bf_eia923_agg = bf_eia923_agg[bf_eia923_agg["fuel_type_code_pudl"] == "coal"]

lower_bound = min(historical_distribution(
    bf_eia923,
    data_col="ash_content_pct",
    weight_col="fuel_consumed_units",
    quantile=0.2
))

data_col = "ash_content_pct"
weight_col = "fuel_consumed_units"
quantile = 0.2

In [None]:
# Create query with parameters
query = (
    "WITH CumulativeWeights AS ( "
    "    SELECT "
    f"        {data_col}, "
    f"        {weight_col}, "
    f"        SUM({weight_col}) OVER (ORDER BY {data_col}) AS cumulative_weight, "
    f"        SUM({weight_col}) OVER () AS total_weight "
    "    FROM bf "
    "    WHERE fuel_type_code_pudl='coal' "
    "), "
    "QuantileData AS ( "
    "    SELECT "
    f"        {data_col}, "
    f"        {weight_col}, "
    "        cumulative_weight, "
    "        total_weight, "
    "        cumulative_weight / total_weight AS cumulative_probability "
    "    FROM CumulativeWeights "
    ")"
    f"SELECT {data_col} "
    "FROM QuantileData "
    f"WHERE cumulative_probability >= {quantile} AND {data_col} < {lower_bound} "
    f"ORDER BY {data_col} "
)

In [None]:
# Execute query using duckdb
import duckdb

asset_key = "out_eia923__monthly_boiler_fuel"
bf = duckdb.read_parquet(f"https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{asset_key}.parquet")
duckdb.query(query).fetchall()

### Prototype high memory usage validations
GX has flexible options for connecting to data, which could provide support for high-memory usecases, but unfortunately none of this options quite fit with our architecture. In these cases, we can once again fallback on the `asset_check`'s underlying the framework, and use duckdb for highly efficient validations. A simple and flexible solution to enable this would be to provide a factory function that takes the asset name, and a SQL query that is expected to return no rows. This factory will then generate an `asset_check`, which will handle boiler plate setup, execute the query on the asset and return a failure with details if the query returns one or more rows.

In [None]:
def duckdb_asset_check_factory(
    asset_key: str,
    query: str,
    description: str | None = None,
    limit: int | None = 10,
):
    # Apply `asset_check` decorator here
    def asset_check_factory():
        # Append limit to query
        modified_query = f"{query} LIMIT {limit};" 
        asset = duckdb.read_parquet(f"https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{asset_key}.parquet")
        returned_rows = duckdb.query(modified_query).fetchall()

        return ValidationResult(
            passed=len(returned_rows) == 0,
            description=description,
            metadata={"extra_rows": returned_rows, "num_extra_rows": len(returned_rows)},
        )
    return asset_check_factory

#### Demonstrate with VCERARE asset

In [None]:
duckdb_asset_check_factory(
    asset_key="out_vcerare__hourly_available_capacity_factor",
    query="SELECT county_or_lake_name FROM asset WHERE county_or_lake_name IS NULL",
)()

And a failing case.

In [None]:
duckdb_asset_check_factory(
    asset_key="out_vcerare__hourly_available_capacity_factor",
    query="SELECT * FROM asset WHERE capacity_factor_solar_pv > 1.0",
)()