# Purpose
This notebook compares unaggregated EIA denormalized and monthly aggregated EIA denormalized tables. Most of the aggregated tables are reported monthly yet we still create monthly aggregation tables. Which aggregated tables are actually different? Do we need to retain all of the monthly aggregated tables?

In [22]:
import os

assert os.environ.get("DAGSTER_HOME"), (
    "The DAGSTER_HOME env var is not set so dagster won't be able to find the assets."
    "Set the DAGSTER_HOME env var in this notebook or kill the jupyter server and set"
    " the DAGSTER_HOME env var in your terminal and relaunch jupyter."
)

## denorm_generation_eia923

In [64]:
from dagster import AssetKey

from pudl.etl import defs
import sqlalchemy as sa
import pandas as pd

asset_key = "denorm_generation_eia923"

asset False, "Replace the sqlite connection string with a full local pudl db!"

engine = sa.create_engine("sqlite:////Users/bendnorman/catalyst/dagster-pudl-work/pudl_output/full_eia_pudl.sqlite")
with engine.connect() as con:
    denorm_generation_eia923 = pd.read_sql_table(asset_key, con)

In [65]:
asset_key = "denorm_generation_monthly_eia923"

with engine.connect() as con:
    denorm_generation_monthly_eia923 = pd.read_sql_table(asset_key, con)

In [159]:
print(denorm_generation_eia923.shape)
print(denorm_generation_monthly_eia923.shape)

(645742, 10)
(645742, 10)


In [70]:
denorm_generation_eia923 = denorm_generation_eia923.drop(columns=["data_maturity"])

In [71]:
denorm_generation_eia923.head()

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id,unit_id_pudl,net_generation_mwh
0,2008-01-01,3,32,Barry,195,18,Alabama Power Co,1,,96021.0
1,2008-02-01,3,32,Barry,195,18,Alabama Power Co,1,,79256.0
2,2008-03-01,3,32,Barry,195,18,Alabama Power Co,1,,91687.0
3,2008-04-01,3,32,Barry,195,18,Alabama Power Co,1,,73693.0
4,2008-05-01,3,32,Barry,195,18,Alabama Power Co,1,,68161.0


In [72]:
denorm_generation_monthly_eia923.head()

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id,unit_id_pudl,net_generation_mwh
0,2008-01-01,3,32,Barry,195,18,Alabama Power Co,1,,96021.0
1,2008-02-01,3,32,Barry,195,18,Alabama Power Co,1,,79256.0
2,2008-03-01,3,32,Barry,195,18,Alabama Power Co,1,,91687.0
3,2008-04-01,3,32,Barry,195,18,Alabama Power Co,1,,73693.0
4,2008-05-01,3,32,Barry,195,18,Alabama Power Co,1,,68161.0


In [87]:
pk_fields = ["plant_id_eia", "generator_id", "report_date"]

sorted_denorm_generation_monthly_eia923 = denorm_generation_monthly_eia923.sort_values(by=pk_fields).reset_index(drop=True)
sorted_denorm_generation_eia923 = denorm_generation_eia923.sort_values(by=pk_fields).reset_index(drop=True)

In [88]:
compare_df = sorted_denorm_generation_monthly_eia923.compare(sorted_denorm_generation_eia923)
assert compare_df.empty

It looks like `denorm_generation_monthly_eia923` and `denorm_generation_eia923` are identical which is expected.

## denorm_generation_fuel_combined_eia923

In [89]:
asset_key = "denorm_generation_fuel_combined_eia923"

with engine.connect() as con:
    denorm_generation_fuel_combined_eia923 = pd.read_sql_table(asset_key, con)

In [90]:
asset_key = "denorm_generation_fuel_combined_monthly_eia923"

with engine.connect() as con:
    denorm_generation_fuel_combined_monthly_eia923 = pd.read_sql_table(asset_key, con)

In [91]:
pk_fields = [
                "plant_id_eia",
                "report_date",
                "prime_mover_code",
                "energy_source_code",
            ]

In [92]:
print(denorm_generation_fuel_combined_eia923.shape)
print(denorm_generation_fuel_combined_monthly_eia923.shape)

(2859269, 18)
(2859269, 16)


In [100]:
missing_cols = set(denorm_generation_fuel_combined_eia923.columns) - set(denorm_generation_fuel_combined_monthly_eia923.columns)
missing_cols

{'data_maturity', 'fuel_type_code_aer'}

In [105]:
compare_cols = list(set(denorm_generation_fuel_combined_eia923.columns) - missing_cols)

sorted_denorm_generation_fuel_combined_eia923 = denorm_generation_fuel_combined_eia923[compare_cols].sort_values(by=pk_fields).reset_index(drop=True)
sorted_denorm_generation_fuel_combined_monthly_eia923 = denorm_generation_fuel_combined_monthly_eia923[compare_cols].sort_values(by=pk_fields).reset_index(drop=True)

In [106]:
compare_df = sorted_denorm_generation_fuel_combined_eia923.compare(sorted_denorm_generation_fuel_combined_monthly_eia923)
assert compare_df.empty

AssertionError: 

In [125]:
compare_df

Unnamed: 0_level_0,fuel_mmbtu_per_unit,fuel_mmbtu_per_unit
Unnamed: 0_level_1,self,other
0,5.825,5.826211
1,0.000,inf
2,5.825,5.823980
3,0.000,inf
4,5.825,5.824742
...,...,...
2859264,0.000,inf
2859265,0.000,inf
2859266,0.000,inf
2859267,0.000,inf


In [119]:
diff_gen_fuel = sorted_denorm_generation_fuel_combined_eia923.loc[compare_df.index]

In [121]:
diff_gen_fuel.energy_source_code.value_counts()

NG     674250
DFO    530256
WAT    373010
SUN    297877
WND    174413
BIT     94935
LFG     70394
RFO     61129
SUB     52729
WDS     37293
OG      25569
OBG     19689
MSB     19617
MSN     19335
KER     17052
GEO     15649
BLQ     14734
PC      13806
JF      11751
WO      10507
OTH      8982
TDF      7896
PG       6313
WH       6289
WC       5746
LIG      5193
RC       4634
SLW      4374
OBS      3968
AB       3624
MWH      3570
BFG      3112
OBL      3016
PUR      2941
SC       2385
WDL       988
SGC       468
SGP       206
ANT       182
Name: energy_source_code, dtype: int64

In [124]:
(diff_gen_fuel.energy_source_code == "NUC").any()

False

It seems like calculated `fuel_mmbtu_per_unit` creates some rounding errors. Are they all rounding errors?

In [142]:
import numpy as np
non_inf_compare_df = compare_df[compare_df.fuel_mmbtu_per_unit.other != np.inf]
fuel_mmbtu_per_unit_diff = (non_inf_compare_df.fuel_mmbtu_per_unit.self - non_inf_compare_df.fuel_mmbtu_per_unit.other)
fuel_mmbtu_per_unit_diff.describe()

count    1.163572e+06
mean              inf
std               NaN
min     -3.098800e+04
25%     -2.462380e-05
50%     -2.018916e-08
75%      1.896324e-05
max               inf
dtype: float64

Looks like they are mostly rounding errors and `inf` values produced by the aggregation step. Can we just use the unaggregated tables for downstream assets?

## denorm_boiler_fuel_eia923

In [143]:
asset_key = "denorm_boiler_fuel_eia923"

with engine.connect() as con:
    denorm_boiler_fuel_eia923 = pd.read_sql_table(asset_key, con)

In [144]:
asset_key = "denorm_boiler_fuel_monthly_eia923"

with engine.connect() as con:
    denorm_boiler_fuel_monthly_eia923 = pd.read_sql_table(asset_key, con)

In [145]:
print(denorm_boiler_fuel_eia923.shape)
print(denorm_boiler_fuel_monthly_eia923.shape)

(1521304, 17)
(1521304, 17)


In [150]:
pk_fields = [
                "plant_id_eia",
                "boiler_id",
                "energy_source_code",
                "prime_mover_code",
                "report_date",
            ]


sorted_denorm_boiler_fuel_eia923 = denorm_boiler_fuel_eia923.sort_values(by=pk_fields).reset_index(drop=True)
sorted_denorm_boiler_fuel_monthly_eia923 = denorm_boiler_fuel_monthly_eia923.sort_values(by=pk_fields).reset_index(drop=True)


compare_df = sorted_denorm_boiler_fuel_eia923.compare(sorted_denorm_boiler_fuel_monthly_eia923)

In [155]:
compare_df

Unnamed: 0_level_0,fuel_mmbtu_per_unit,fuel_mmbtu_per_unit,sulfur_content_pct,sulfur_content_pct,ash_content_pct,ash_content_pct
Unnamed: 0_level_1,self,other,self,other,self,other
0,,,0.49,0.49,,
5,,,0.46,0.46,,
6,,,0.46,0.46,,
9,,,0.64,0.64,,
13,,,0.46,0.46,,
...,...,...,...,...,...,...
1521287,0.0,,0.00,,0.0,
1521288,0.0,,0.00,,0.0,
1521289,0.0,,0.00,,0.0,
1521290,0.0,,0.00,,0.0,


In [154]:
(compare_df.sulfur_content_pct.self - compare_df.sulfur_content_pct.other).describe()

count    2.213200e+04
mean     1.812164e-19
std      2.300727e-16
min     -8.881784e-16
25%     -1.110223e-16
50%     -1.734723e-18
75%      1.110223e-16
max      8.881784e-16
dtype: float64

I think this is another case of weird rounding and NaN errors from dividing values during the aggregation.

## denorm_fuel_receipts_costs_eia923

In [156]:
asset_key = "denorm_fuel_receipts_costs_eia923"

with engine.connect() as con:
    denorm_fuel_receipts_costs_eia923 = pd.read_sql_table(asset_key, con)

In [157]:
asset_key = "denorm_fuel_receipts_costs_monthly_eia923"

with engine.connect() as con:
    denorm_fuel_receipts_costs_monthly_eia923 = pd.read_sql_table(asset_key, con)

In [158]:
print(denorm_fuel_receipts_costs_eia923.shape)
print(denorm_fuel_receipts_costs_monthly_eia923.shape)

(631975, 36)
(258115, 20)


Ok! The unaggregated and monthly aggregated tables are clearly different which makes sense given:
> There can be multiple deliveries of the same type of fuel from the same supplier to the same plant in a single month, so the table has no natural primary key.