# Validation of FERC Form 1 Large Steam Plants
This notebook runs sanity checks on the FERC Form 1 large steam plants table (`plants_steam_ferc1`). These are the same tests which are run by the `plants_steam_ferc1` validation tests by PyTest.  The notebook and visualizations are meant to be used as a diagnostic tool, to help understand what's wrong when the PyTest based data validations fail for some reason.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
import pandas as pd
import sqlalchemy as sa
import pudl

In [None]:
import warnings
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

In [None]:
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 56

In [None]:
from pudl.workspace.setup import PudlPaths
ferc1_engine = sa.create_engine(PudlPaths().sqlite_db_uri('ferc1'))
pudl_engine = sa.create_engine(PudlPaths().pudl_db)

## Pull `plants_steam_ferc1` and calculate some useful values
First we pull the original (post-ETL) FERC 1 large plants data out of the PUDL database using an output object.  The FERC Form 1 data only exists at annual resolution, so there's no inter-frequency aggregation to think about.

In [None]:
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)
plants_steam_ferc1 = (
    pudl_out.plants_steam_ferc1().
    assign(
        water_limited_ratio=lambda x: x.water_limited_capacity_mw / x.capacity_mw,
        not_water_limited_ratio=lambda x: x.not_water_limited_capacity_mw / x.capacity_mw,
        peak_demand_ratio=lambda x: x.peak_demand_mw / x.capacity_mw,
        capability_ratio=lambda x: x.plant_capability_mw / x.capacity_mw,
    )
)

# Validation Against Fixed Bounds 
Some of the variables reported in this table have a fixed range of reasonable values, like the heat content per unit of a given fuel type.  These varaibles can be tested for validity against external standards directly.  In general we have two kinds of tests in this section:
* **Tails:** are the exteme values too extreme? Typically, this is at the 5% and 95% level, but depending on the distribution, sometimes other thresholds are used.
* **Middle:** Is the central value of the distribution where it should be?

## Plant Capacities

In [None]:
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_capacity)

## CapEx & OpEx

In [None]:
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_expenses)

## Plant Capacity Ratios

In [None]:
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_capacity_ratios)

## Plant Connected Hours
Currently expected to fail: ~10% of all plants have > 8760 hours.

In [None]:
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_connected_hours)