# Create a Mega EIA861 Table

For now, the EIA861 data is only partially integrated into the ETL pipeline. We have created temporary output tables to access the transformed data more easily. This notebook combines the information from these tables into one mega spreadsheet.

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Standard libraries
import logging
import os
import pathlib
import sys

# 3rd party libraries
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import sqlalchemy as sa
from functools import reduce

# Local libraries
import pudl
import pudl.constants as pc

In [None]:
sns.set()
%matplotlib inline
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [None]:
logger=logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

## Load Output Tables

In [None]:
eia861_dfs = {
    "service_territory_eia861": pudl_out.service_territory_eia861(),
    "balancing_authority_eia861": pudl_out.balancing_authority_eia861(),
    "sales_eia861": pudl_out.sales_eia861(),
    "advanced_metering_infrastructure_eia861": pudl_out.advanced_metering_infrastructure_eia861(),
    "demand_response_eia861": pudl_out.demand_response_eia861(),
    "demand_response_water_heater_eia861": pudl_out.demand_response_water_heater_eia861(),
    "demand_side_management_sales_eia861": pudl_out.demand_side_management_sales_eia861(),
    "demand_side_management_ee_dr_eia861": pudl_out.demand_side_management_ee_dr_eia861(),
    "demand_side_management_misc_eia861": pudl_out.demand_side_management_misc_eia861(),
    "distributed_generation_tech_eia861": pudl_out.distributed_generation_tech_eia861(),
    "distributed_generation_fuel_eia861": pudl_out.distributed_generation_fuel_eia861(),
    "distributed_generation_misc_eia861": pudl_out.distributed_generation_misc_eia861(),
    "distribution_systems_eia861": pudl_out.distribution_systems_eia861(),
    "dynamic_pricing_eia861": pudl_out.dynamic_pricing_eia861(),
    "energy_efficiency_eia861": pudl_out.energy_efficiency_eia861(),
    "green_pricing_eia861": pudl_out.green_pricing_eia861(),
    "mergers_eia861": pudl_out.mergers_eia861(),
    "net_metering_customer_fuel_class_eia861": pudl_out.net_metering_customer_fuel_class_eia861(),
    "net_metering_misc_eia861": pudl_out.net_metering_misc_eia861(),
    "non_net_metering_customer_fuel_class_eia861": pudl_out.non_net_metering_customer_fuel_class_eia861(),
    "non_net_metering_misc_eia861": pudl_out.non_net_metering_misc_eia861(),
    "operational_data_revenue_eia861": pudl_out.operational_data_revenue_eia861(),
    "operational_data_misc_eia861": pudl_out.operational_data_misc_eia861(),
    "reliability_eia861": pudl_out.reliability_eia861(),
    "utility_data_nerc_eia861": pudl_out.utility_data_nerc_eia861(),
    "utility_data_rto_eia861": pudl_out.utility_data_rto_eia861(),
    "utility_data_misc_eia861": pudl_out.utility_data_misc_eia861(),
}

## Combine Output Tables


In [None]:
util_cols = [
    'utility_id_eia',
    'state',
    'report_date',
]
idx_ba = util_cols + ['balancing_authority_code_eia']
idx_nr = util_cols + ['nerc_region']

In [None]:
# Get rid of 2019 unnamed col for new data
for df_name, df in eia861_dfs.items():
    if 'unnamed_0' in df.columns:
        eia861_dfs[df_name] = df.drop('unnamed_0', axis=1)

# Fix reliability col to say standard
eia861_dfs['reliability_eia861'] = (
    eia861_dfs['reliability_eia861'].rename(columns={'standard': 'standards_class'})
)

In [None]:
# run this kernel to reset the dict
table_dict = {
    'advanced_metering_infrastructure_eia861': eia861_dfs['advanced_metering_infrastructure_eia861'].copy(),
     #'balancing_authority_assn_eia861',
     #'balancing_authority_eia861': eia_transformed_dfs['balancing_authority_eia861'].copy(),
     'demand_response_eia861': eia861_dfs['demand_response_eia861'].copy(),
     'demand_response_water_heater_eia861': eia861_dfs['demand_response_water_heater_eia861'].copy(),
     'demand_side_management_ee_dr_eia861': eia861_dfs['demand_side_management_ee_dr_eia861'].copy(),
     'demand_side_management_misc_eia861': eia861_dfs['demand_side_management_misc_eia861'].copy(),
     'demand_side_management_sales_eia861': eia861_dfs['demand_side_management_sales_eia861'].copy(),
     'distributed_generation_fuel_eia861': eia861_dfs['distributed_generation_fuel_eia861'].copy(),
     'distributed_generation_misc_eia861': eia861_dfs['distributed_generation_misc_eia861'].copy(),
     'distributed_generation_tech_eia861': eia861_dfs['distributed_generation_tech_eia861'].copy(),
     'distribution_systems_eia861': eia861_dfs['distribution_systems_eia861'].copy(),
     'dynamic_pricing_eia861': eia861_dfs['dynamic_pricing_eia861'].copy(),
     'energy_efficiency_eia861': eia861_dfs['energy_efficiency_eia861'].copy(),
     'green_pricing_eia861': eia861_dfs['green_pricing_eia861'].copy(),
     'mergers_eia861': eia861_dfs['mergers_eia861'].copy(),
     'net_metering_customer_fuel_class_eia861': eia861_dfs['net_metering_customer_fuel_class_eia861'].copy(),
     'net_metering_misc_eia861': eia861_dfs['net_metering_misc_eia861'].copy(),
     'non_net_metering_customer_fuel_class_eia861': eia861_dfs['non_net_metering_customer_fuel_class_eia861'].copy(),
     'non_net_metering_misc_eia861': eia861_dfs['non_net_metering_misc_eia861'].copy(),
     'operational_data_misc_eia861': eia861_dfs['operational_data_misc_eia861'].copy(),
     'operational_data_revenue_eia861': eia861_dfs['operational_data_revenue_eia861'].copy(),
     'reliability_eia861': eia861_dfs['reliability_eia861'].copy(),
     'sales_eia861': eia861_dfs['sales_eia861'].copy(),
     'service_territory_eia861': eia861_dfs['service_territory_eia861'].copy(),
     #'utility_assn_eia861',
     'utility_data_misc_eia861': eia861_dfs['utility_data_misc_eia861'].copy(),
     'utility_data_nerc_eia861': eia861_dfs['utility_data_nerc_eia861'].copy(),
     'utility_data_rto_eia861': eia861_dfs['utility_data_rto_eia861'].copy(),
}

In [None]:
unpeel_list = [
    'advanced_metering_infrastructure_eia861',
     #'balancing_authority_assn_eia861',
     #'balancing_authority_eia861',
     'demand_response_eia861',
     'demand_response_water_heater_eia861',
     'demand_side_management_ee_dr_eia861',
     'demand_side_management_misc_eia861',
     'demand_side_management_sales_eia861',
     'distributed_generation_fuel_eia861',
     'distributed_generation_misc_eia861',
     'distributed_generation_tech_eia861',
     'distribution_systems_eia861',
     'dynamic_pricing_eia861',
     'energy_efficiency_eia861',
     'green_pricing_eia861',
     'mergers_eia861',
     'net_metering_customer_fuel_class_eia861',
     'net_metering_customer_fuel_class_eia861', # because two classes
     'net_metering_misc_eia861',
     'non_net_metering_customer_fuel_class_eia861',
     'non_net_metering_customer_fuel_class_eia861', # because two classes 
     'non_net_metering_misc_eia861',
     'operational_data_misc_eia861',
     'operational_data_revenue_eia861',
     'reliability_eia861',
     'sales_eia861',
     'service_territory_eia861',
     #'utility_assn_eia861',
     'utility_data_misc_eia861',
     'utility_data_nerc_eia861',
     'utility_data_rto_eia861'
]

In [None]:
moniker_dict = {
    'advanced_metering_infrastructure_eia861': 'AMI',
     #'balancing_authority_assn_eia861',
     #'balancing_authority_eia861',
     'demand_response_eia861': 'DR',
     'demand_response_water_heater_eia861': 'DR',
     'demand_side_management_ee_dr_eia861': 'DSM',
     'demand_side_management_misc_eia861': 'DSM',
     'demand_side_management_sales_eia861': 'DSM',
     'distributed_generation_fuel_eia861': 'DG',
     'distributed_generation_misc_eia861': 'DG',
     'distributed_generation_tech_eia861': 'DG',
     'distribution_systems_eia861': 'DS',
     'dynamic_pricing_eia861': 'DP',
     'energy_efficiency_eia861': 'EE',
     'green_pricing_eia861': 'GP',
     'mergers_eia861': 'M',
     'net_metering_customer_fuel_class_eia861': 'NM',
     'net_metering_misc_eia861': 'NM',
     'non_net_metering_customer_fuel_class_eia861': 'NNM',
     'non_net_metering_misc_eia861': 'NNM',
     'operational_data_misc_eia861': 'OD',
     'operational_data_revenue_eia861': 'OD',
     'reliability_eia861': 'R',
     'sales_eia861': 'S',
     'service_territory_eia861': 'ST',
     #'utility_assn_eia861',
     'utility_data_misc_eia861': 'UD',
     'utility_data_nerc_eia861': 'UD',
     'utility_data_rto_eia861': 'UD',
}

In [None]:
def unpeel(df, df_name, class_name):
    """Make single class name column into suffix for columns - tall-to-wide reformatting"""
    logger.info(f'unpeeling {class_name} from {df_name} table')
    # Include utility_id_eia in qualitative col grab (for index)
    string_df = (
        df[['utility_id_eia']]
        .join(df.select_dtypes(exclude=['int64', 'float']))
    )

    class_name = class_name
    qual_cols = list(string_df.columns)
    qual_cols.remove(class_name)

    wide_df = (
        df.set_index(qual_cols)
        .pivot(columns=class_name)
    )
    old_cols = list(wide_df.columns.values)
    wide_df.columns= list(map('_'.join, [col[::-1] for col in old_cols]))
    #wide_df.columns = list(map('_'.join, wide_df.columns.values))
    wide_df = wide_df.reset_index()
    return wide_df

In [None]:
def check_and_unpeel(df_name):
    """Run unpeel function on tables that have a class column."""
    df = table_dict[df_name].copy()
    
    # Get rid of categorical columns
    for col in df:
        if 'category' in df[col].dtype.name:
            df[col] = df[col].astype('string')

    # Only unpeel if there is a class column.
    class_names = [col for col in df if 'class' in col]
    if len(class_names) > 0:
        wide_df = unpeel(df, df_name, class_names[0])
    else:
        wide_df = df
    
    return wide_df

In [None]:
def groupby_utils(df, df_name, util_cols):
    """Group EIA861 tables at the utility-level
    
    Some of the qualitative columns may present an aggregation challenge when
    grouping at the utility level (nerc_region and ba_code, specifically). To
    account for all of these values we'll first look to see if there are any
    instances where there are duplicate values (same utility/state/year, diff
    nerc region or ba code). If there are, we'll combine them into a single row
    EX: SERC and MISC to SERC, MISC. We single out the rows that have duplicates
    rather than running this on the whole dataframe to save time.
    
    """
    # Set N/A state values to UNK to prevent issues in the .transform() func
    df['state'] = df['state'].fillna('UNK')
    df = df.set_index(util_cols)
    
    # Separate the df columns into dtypes
    num_df = df.select_dtypes(include=['int64', 'float']).reset_index()
    qual_df = df.select_dtypes(exclude=['int64', 'float']).reset_index()
    
    # See whether any of the columns are duplicated at the utility-state-date level
    qual_df['dup'] = qual_df.duplicated(subset=util_cols, keep=False)
    
    # Divide into duplicated and non-duplicated
    dup_df = qual_df[qual_df['dup']==True]
    non_dup_df = qual_df[qual_df['dup']==False]
    
    if dup_df.empty:
        logger.info(f'{df_name} has no duplicates')
        return df.reset_index()
    else:
        logger.info(f'{df_name}')
        # Combine those that are duplicated into VAL1, VAL2 units
        dup_transformed = dup_df.groupby(util_cols).transform(lambda x: ' ,'.join(x.unique()))
        dup_grouped = (
            dup_df[util_cols]
            .drop_duplicates()
            .join(dup_transformed)
            .groupby(util_cols)
            .first()
            .reset_index()
        )
        # Grab the first value for non-duplicated values
        non_dup_grouped = non_dup_df.groupby(util_cols).first().reset_index()

        # Combine newly grouped duplicates and non duplicates
        qual_grouped = dup_grouped.append(non_dup_grouped, ignore_index=True)

        # Sum numeric columns
        num_grouped = num_df.groupby(util_cols).sum(min_count=1)

        # Merge numeric and qualitative dataframes back together
        merge_df = pd.merge(num_grouped, qual_grouped, on=util_cols).drop('dup', axis=1)
        
        return merge_df

In [None]:
def mega_merge(table_dict):
    """Merge all the EIA 861 tables together"""
    # Get the list of eia861 tables and merge them together. Add numeric suffixes to columns that repeat.
    #table_list = list(table_dict.values())
    merge_df = pd.DataFrame(columns=util_cols)
    #num = 0
    for df_name, df in table_dict.items():
        logger.info(f'merging {df_name}')
        moniker = moniker_dict[df_name]
        df = df.set_index(util_cols)
        df.columns = df.columns.map(lambda x: str(x) + f'_{moniker}_')
        merge_df = pd.merge(merge_df, df, on=util_cols, how='outer')
        #num = num+1
    
    return merge_df

In [None]:
def unpeel_group_merge():
    """Re-widen all tables, groupby utility, merge into one mega table."""
    # Go through list of tables and widen. Use unpeel_list because 
    # the non/net_metering tables have to be run twice.
    for df_name in unpeel_list:
        wide_df = check_and_unpeel(df_name)
        table_dict[df_name] = wide_df
    
    # Group each of the widened tables by utility/state/date
    for df_name, df in table_dict.items():
        wide_df = df.copy()
        util_df = groupby_utils(wide_df, df_name, util_cols)
        table_dict[df_name] = util_df
        
    # Merge wide, grouped tables together into one "mega" dataframe
    mega_df = mega_merge(table_dict)
    
    return mega_df

In [None]:
def compare_common_cols(df, col_name):
    """Turn repeat columns into one column with all values."""
    col_list = [col for col in df if col_name in col]
    col_df = df.set_index(util_cols)[col_list]
    temp_df = col_df.fillna('UNK')
    temp_df = temp_df.eq(temp_df.iloc[:, 0], axis=0)
    col_df['bool'] = temp_df.eq(temp_df.iloc[:, 0], axis=0).all(1)
    col_df_false = col_df[col_df['bool']==False].copy()
    col_df_false = col_df_false.astype('object')
    col_df_false.fillna(np.nan)
    col_df_false[col_name] = (
        col_df_false[col_df_false.columns[:-1]]
        .apply(lambda x: ', '.join(x.dropna().unique()), axis=1)
    )
    df = df.drop(col_list, axis=1)
    df = pd.merge(df, col_df_false[[col_name]], on=util_cols, how='outer')
    
    return df

In [None]:
def loop_over_common_cols(mega_df):
    
    common_cols = [
        'balancing_authority_code_eia',
        'utility_name_eia',
        'nerc_region',
        'entity_type',
    ]
    # get rid of short form cols
    logger.info('removing short form columns')
    drop_list = []
    for col in mega_df:
        if 'short_form' in col:
            drop_list.append(col)
    mega_df = mega_df.drop(drop_list, axis=1)
            
    # Compare duplicate columns in the mega table
    for col in common_cols:
        logger.info(f'comparing column values for {col}')
        mega_df = compare_common_cols(mega_df, col)
    
    return mega_df

In [None]:
mega_df = unpeel_group_merge()

In [None]:
final_df = loop_over_common_cols(mega_df)

In [None]:
final_df.to_excel('/Users/aesharpe/Desktop/mega_eia861.xlsx')

In [None]:
final_df.columns.tolist()

## Data Validation


In [None]:
d_val_dict = {
    '_AMI_': ['advanced_metering_infrastructure_eia861', []],
    '_DG_': ['distributed_generation_eia861', ['capacity_mw']],
    '_DP_': ['dynamic_pricing_eia861', []],
    '_DR_': ['demand_response_eia861', ['cost']],
    '_DS_': ['distribution_systems_eia861', []],
    '_DSM_': ['demand_side_management_eia861', ['cost', 'payment']],
    '_EE_': ['energy_efficiency_eia861', []],
    '_GP_': ['green_pricing_eia861', []],
    '_M_': ['mergers_eia861', []],
    '_NM_': ['net_metering_eia861', []],
    '_NNM_': ['non_net_metering_eia861', []],
    '_OD_': ['operational_data_eia861', []],
    '_R_': ['reliability_eia861', []],
    '_S_': ['sales_eia861', []],
    '_ST_': ['service_territory_eia861', []],
    '_UD_': ['utility_data_eia861', []],
}

In [None]:
val_df = final_df.copy()

In [None]:
val_df = val_df.set_index(util_cols)

In [None]:
# Split mega data into OG eia table chunks in prep for comparison 
monikers = list(d_val_dict.keys())#list(set(moniker_dict.values()))
monikers.sort()

by_eia_table_dict = {}
for moniker in monikers:
    moniker_cols = [col for col in val_df if moniker in col]
    non_moniker_cols = [col.strip(f'_{moniker}_') for col in moniker_cols]
    moniker_df = val_df[moniker_cols]
    moniker_df.columns = non_moniker_cols
    by_eia_table_dict[moniker] = moniker_df.reset_index()
    
# delete cols with only null values -- if you uncomment this, then there will be some
# cases where a reported column also has all nulls (as opposed to made up cols from the
# re-widening process)
for name, table in by_eia_table_dict.items():
    null_cols = []
    for col in table:
        if table[col].dtype == 'float' or table[col].dtype == 'int':
            if table[col].isnull().all():
                null_cols.append(col)
    by_eia_table_dict[name] = table.drop(null_cols, axis=1)

In [None]:
# Prep raw data for comparison
raw_dfs_dict = eia861_raw_dfs.copy()

for df_name, df in raw_dfs_dict.items():
    df = pudl.helpers.fix_eia_na(df)
    df = pudl.helpers.convert_to_date(df)
    raw_dfs_dict[df_name] = df
    
raw_dfs_dict = pudl.helpers.convert_dfs_dict_dtypes(raw_dfs_dict, 'eia')

In [None]:
test = raw_dfs_dict['operational_data_eia861'].copy()
test2 = test[test['utility_id_eia'].isna()]
test2.to_excel('OD_NA.xlsx')

In [None]:
# Reverse order of customer_class and tech_class in raw net/non_net metering tables

tc = pudl.constants.TECH_CLASSES
cc = pudl.constants.CUSTOMER_CLASSES

def swap_col_order(df_name):
    raw_order_cols = raw_dfs_dict[df_name].columns.tolist()
    #test = ['commercial_chp_cogen_customers', '']
    new_order_cols = []
    for col in raw_order_cols:
        for c in cc: 
            if c in col:
                for t in tc:
                    if t in col:
                        col = col.replace(f'{c}_{t}_', f'{t}_{c}_')
        new_order_cols.append(col)
        
    raw_dfs_dict[df_name].columns = new_order_cols

swap_col_order('net_metering_eia861')
swap_col_order('non_net_metering_eia861')

In [None]:
#Adapt raw tables to account for data cleaning and manipulation

dr_df = raw_dfs_dict['demand_response_eia861'].copy()
raw_dfs_dict['demand_response_eia861'] = (
    dr_df.drop_duplicates(subset=util_cols+['balancing_authority_code_eia'])
)
dsm_df = raw_dfs_dict['demand_side_management_eia861'].copy()
raw_dfs_dict['demand_side_management_eia861'] = (
    dsm_df.loc[dsm_df['utility_id_eia'] != 88888].copy()
)
nm_df = raw_dfs_dict['net_metering_eia861'].copy()
raw_dfs_dict['net_metering_eia861'] = (
    nm_df.loc[nm_df['utility_id_eia'] != 99999].copy()
)
nnm_df = raw_dfs_dict['non_net_metering_eia861'].copy()
raw_dfs_dict['non_net_metering_eia861'] = (
    nnm_df.loc[nnm_df['utility_id_eia'] != 99999].copy()
)
od_df = raw_dfs_dict['operational_data_eia861'].copy()
raw_dfs_dict['operational_data_eia861'] = (
    od_df.loc[od_df['utility_id_eia'] != 88888].copy()
) #NULLS!
r_df = raw_dfs_dict['reliability_eia861'].copy()
raw_dfs_dict['reliability_eia861'] = (
    r_df.drop_duplicates(subset=util_cols)
)
s_df = raw_dfs_dict['sales_eia861'].copy()
raw_dfs_dict['sales_eia861'] = (
    s_df.drop_duplicates(subset=util_cols + ['balancing_authority_code_eia'])
)
s_df = raw_dfs_dict['sales_eia861'].copy()
raw_dfs_dict['sales_eia861'] = (
    s_df.loc[s_df['utility_id_eia'] != 88888].copy()
)
s_df = raw_dfs_dict['sales_eia861'].copy()
raw_dfs_dict['sales_eia861'] = (
    s_df.loc[s_df['utility_id_eia'] != 99999].copy()
)

In [None]:
# WHY IS THIS BLANK AFTER RUNNING THE ABOVE??????? 
test = raw_dfs_dict['operational_data_eia861'].copy()
test['utility_id_eia'] = test.utility_id_eia.astype('float')
#test.loc[test['utility_id_eia'].isna()]
#test[['utility_id_eia']].sort_values('utility_id_eia')

In [None]:
def check_against_raw_numeric(df, raw_df, df_name_and_exceptions):
    """Compare numeric columns against their raw counterpart data."""
    logger.info('')
    logger.info(f'checking columns for {df_name_and_exceptions[0]} table')
    
    num_df = df.select_dtypes(include=['int64', 'float']).set_index('utility_id_eia')
    raw_num_df = raw_df.select_dtypes(include=['int64', 'float']).set_index('utility_id_eia')
    
    not_in_raw = [col for col in num_df if col not in raw_num_df]
    not_in_transformed = [col for col in raw_num_df if col not in num_df]
    not_in_transformed = [col for col in not_in_transformed if 'total' not in col] #exclude total cols
    in_both = [col for col in num_df if col in raw_num_df]
    for exception in df_name_and_exceptions[1]:
        in_both = [col for col in in_both if exception not in col]
    
    logger.info(f'     columns not in the raw_df: {not_in_raw}')
    logger.info(f'     columns not in the transformed_df {not_in_transformed}')
    
    # Check whether the raw column total is the same as the transformed column total
    for col in in_both:
        new_sum = round(num_df[col].sum(skipna=True), 0)
        raw_sum = round(raw_num_df[col].sum(skipna=True), 0)
        if new_sum != raw_sum:
            if raw_sum != round((new_sum/1000), 0):
                print(f'     sum miss-match for col: {col}')
                print(f'     new_sum: {new_sum}, raw_sum: {raw_sum}')

In [None]:
for moniker, df_name_and_exceptions in d_val_dict.items():
    check_against_raw_numeric(
        by_eia_table_dict[moniker],
        raw_dfs_dict[df_name_and_exceptions[0]],
        df_name_and_exceptions
    )

### Explanation of data transformations:

**DG**: change pct into mw - sums will differ

**DR**: cost cols thousands to ones, drop duplicates

**DSM**: cost / payment cols thousands to one, removed 88888 utilities

**NM**: removed 99999 utilities, extra colums from reconstruction that are all nan. can delete but don't impact sum.

**NNM**: removed 99999 utilitiesremoved 99999 utilities, *had to fix issue with capacity_mw merge deleting y vs. x*

**OD**: removed 88888 utilities, **removed utilities with NA for eia_id**

**R**: dropped duplicates

**S**: removed 99999 and 88888 utilities, dropped duplicates, revenue cols thousands to one

In [None]:
test = raw_dfs_dict['operational_data_eia861']
test = test.reset_index()
test[test['utility_id_eia'].isnull()]

## Other Data Wrangling

Once all of the data is loaded and looks like it's in good shape, do any initial wrangling that's specific to this particular analysis. This should mostly make use of the higher level functions which were defined above. If this step takes a while, don't be shy about producing `logging` outputs.

In [None]:
test = eia_transformed_dfs['demand_side_management_misc_eia861']

In [None]:
test['dup'] = test.duplicated(subset=['utility_id_eia', 'state', 'report_date'])
test.sort_values('dup', ascending=False)

print(len(test.groupby(['utility_id_eia', 'state', 'report_date'])))
print(len(test.groupby(['utility_id_eia', 'state', 'report_date', 'nerc_region'])))

### Data Validation Test with Pandera

In [None]:
#Zscore not a good measure because utilities are not all uniform in size.

df = eia_transformed_dfs['advanced_metering_infrastructure_eia861'].copy()
df['advanced_metering_infrastructure'] = df['advanced_metering_infrastructure'].fillna(0)
df['automated_meter_reading'] = df['automated_meter_reading'].fillna(0)
df['non_amr_ami'] = df['non_amr_ami'].fillna(0)
df['total_meters'] = df['total_meters'].fillna(0)

df = df.assign(
    summ=lambda x: (
        x.advanced_metering_infrastructure 
        + x.automated_meter_reading 
        + x.non_amr_ami),
    same=lambda x: x.summ == x.total_meters
)

df[(df['same']==False) & (df['total_meters']!= 0)]