# Inspecting `raw_eia923__byproduct_disposition`

## Setup

In [None]:
import os

assert os.environ.get("DAGSTER_HOME"), (
    "The DAGSTER_HOME env var is not set so dagster won't be able to find the assets."
    "Set the DAGSTER_HOME env var in this notebook or kill the jupyter server and set"
    " the DAGSTER_HOME env var in your terminal and relaunch jupyter."
)

In [None]:
from dagster import AssetKey

from pudl.etl import defs

asset_key = "raw_eia923__byproduct_disposition"
df = defs.load_asset_value(AssetKey(asset_key))

df

## Summary of Cleaning 

Cleaning steps gathered from inspecting the data in this notebook:
1. Use fix_eia_na. There are periods that should be nulls
2. Remove duplicates based on report_year, plant_id_eia, and byproduct description. Only 9 duplicate records
3. Related, those 9 duplicates, representing 18 rows, have no byproduct description or values. They can be filtered from the data.
4. Drop the `early_release` column -- there are no values
5. Drop the `data_maturity` column -- there is only one value ("final") for all records.
6. Add constraints to `byproduct_description` (10 unique values)
7. Add constraints to `byproducts_to_report` (2 unique values, null allowed)


Questions arised from inspecting:
1. What can be interpreted from the data when `byproducts_to_report` is null?
2. Is there a standard unit conversion for these? Units are in 1000 tons and mmbtu or 1000 tons.

## Fix NA

In [None]:
from pudl.helpers import fix_eia_na

In [None]:
df = fix_eia_na(df)

## Value Counts

In [None]:
df.count()

In [None]:
df["byproducts_to_report"].value_counts()

In [None]:
df["byproduct_description"].value_counts()

In [None]:
df["data_maturity"].value_counts()

In [None]:
df["report_year"].value_counts().sort_index()

In [None]:
df.describe(include=['object', 'category'])

## Check for Duplicates

In [None]:

unique_by = ["report_year", "plant_id_eia", "byproduct_description"]
duplicates = df.duplicated(subset=unique_by, keep=False)

print(df[duplicates])

## Check NULLs

In [None]:
df[df[["report_year", "plant_id_eia", "byproduct_description"]].isnull().any(axis=1)]

In [None]:
df.isnull().sum()

## Check Numeric

In [None]:
df.describe()

## Scratch Transform

In [None]:
import pandas as pd
import pudl
from pudl.metadata import PUDL_PACKAGE

df = defs.load_asset_value(AssetKey(asset_key))

def _core_eia923__byproduct_disposition(raw_eia923__byproduct_disposition: pd.DataFrame) -> pd.DataFrame:
    """Transforms the core_eia923__byproduct_disposition table.

    Transformations include:

    * Replace . values with NA
    * Drop rows with NA byproduct_description 
        * This also removes all duplicates based on report_year, plant_id_eia, and byproduct_description
    * Drop early_release column with no data values
    * Drop data_maturity column with only a single value ("final") for all records

    Args:
        raw_eia923__byproduct_disposition: The raw ``raw_eia923__byproduct_disposition`` dataframe.

    Returns:
        Cleaned ``core_eia923__byproduct_disposition`` dataframe ready for harvesting.
    """
    df = raw_eia923__byproduct_disposition

    # TODO: Confirm these fields are droppable
    # Need to stop dropping fields that contain harvestable entity attributes.
    # See https://github.com/catalyst-cooperative/pudl/issues/509
    cols_to_drop = [
        "early_release",
        "data_maturity",
    ]
    df = df.drop(cols_to_drop, axis=1)
    df = pudl.helpers.fix_eia_na(df)
    df = df.dropna(subset=["byproduct_description"])

    df = PUDL_PACKAGE.encode(df)

    return df

In [None]:
clean_df = _core_eia923__byproduct_disposition(df)

In [None]:
clean_df.columns

In [None]:
import matplotlib.pyplot as plt

In [None]:
def plot_bar(df, x, y):
    avg_df = df.groupby(x)[y].mean().reset_index()
    plt.bar(avg_df[x], avg_df[y])
    plt.xticks(rotation=45)
    plt.show()

In [None]:
plot_bar(clean_df, "report_year", "disposal_offsite_1000_tons")

In [None]:
plot_bar(clean_df, "report_year", "sold_1000_tons_or_mmbtu")

In [None]:
plot_bar(clean_df, "byproduct_description", "used_onsite_1000_tons")

In [None]:
clean_df

In [None]:
numeric_cols = clean_df.select_dtypes(include='number').columns
null_summary = (
    clean_df \
        .groupby("byproduct_description")[numeric_cols] \
        .apply(lambda x: x.isnull().sum()) \
        .reset_index()
)

In [None]:
null_summary