# Integrating a New Year of FERC Form 1
* Every September / October we integrate a new year of FERC Form 1 data.
* This notebook contains some tools to help with that somewhat manual process.

## Before you start!
You will need:
* An up-to-date FERC Form 1 database with all years of available data in it (**including the new year**).
* An up-to-date PUDL database with all years of available EIA data in it (**including** the new year).
* An up-to-date PUDL database with all years of available FERC Form 1 data in it (**NOT** including the new year).

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
import re
import pandas as pd
import sqlalchemy as sa
import pudl
import dbfread
import pathlib
import pudl.constants as pc

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
sns.set()
%matplotlib inline

In [None]:
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 100
pd.options.display.max_rows = 200

In [None]:
pudl_settings = pudl.workspace.setup.get_defaults()
ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
pudl_settings

## Generate new Row Maps
* The FERC 1 Row Maps function similarly to the xlsx_maps that we use to track which columns contain what data across years in the EIA spreadsheets.
* In many FERC 1 tables, a particular piece of reported data is associated not only with a named column in the database, but also what "row" the data showed up on.
* So for instance, in the Plant in Service table, the column might contain "additions" to plant in service, while each numbered row corrresponds to an individual FERC Account to which value was added.
* However, from year to year which row corresponds to which value (e.g. which FERC account) changes, as new rows are added, or obsolete rows are removed.
* To keep all this straight, we look at the "row literals" -- the labels that are associated with each row number -- year by year.
* Any time the row literals change between years, we compare the tables for those two adjacent years to see if the row numbers associated with a given piece of data have actually changed.
* However, many tables are not organized this way, and in most tables that are organized this way, in most years, the rows don't change.
* The row maps are stored in CSVs, under `src/pudl/package_data/meta/ferc1_row_maps`

In [None]:
def get_row_literals(table_name, report_year, ferc1_engine):
    row_literals = (
        pd.read_sql("f1_row_lit_tbl", ferc1_engine)
        .query(f"sched_table_name=='{table_name}'")
        .query(f"report_year=={report_year}")
        .sort_values("row_number")
    )
    return row_literals

def compare_row_literals(table_name, old_year, new_year, ferc1_engine):
    idx_cols = ["row_number", "row_seq"]
    old_df = get_row_literals(table_name, old_year, ferc1_engine).drop(columns=["row_status", "sched_table_name"])
    new_df = get_row_literals(table_name, new_year, ferc1_engine).drop(columns=["row_status", "sched_table_name"])
    merged_df = (
        pd.merge(old_df, new_df, on=idx_cols, suffixes=["_old", "_new"], how="outer")
        .set_index(idx_cols)
    )
    merged_df = (
        merged_df.loc[:, merged_df.columns.sort_values()]
        .assign(match=lambda x: x.row_literal_new == x.row_literal_old)
    )
    return merged_df 

def check_all_row_years(table_name, ferc1_engine):
    years = list(range(min(pc.WORKING_PARTITIONS["ferc1"]["years"]), max(pc.WORKING_PARTITIONS["ferc1"]["years"])))
    years.sort()
    for old_year in years:
        compared = compare_row_literals(table_name, old_year, old_year+1, ferc1_engine)
        if not compared.match.all():
            logger.error(f"  * CHECK: {old_year} vs. {old_year+1}")

In [None]:
recent_year_comparison = compare_row_literals("f1_plant_in_srvce", max(pc.WORKING_PARTITIONS["ferc1"]["years"]) - 1, max(pc.WORKING_PARTITIONS["ferc1"]["years"]), ferc1_engine)
      
unmatched_recent_rows = recent_year_comparison[~recent_year_comparison.match]
if len(unmatched_recent_rows) > 0:
    print("HEY!... check most recent row mappings!")
    display(recent_year_comparison[~recent_year_comparison.match])
else:
    print("Recent row mappings look consistent. No need to change anything.")

In [None]:
row_mapped_tables = [
    "f1_dacs_epda",         # Depreciation data.
    # "f1_edcfu_epda",      # Additional depreciation data. Not yet row mapped
    # "f1_acb_epda",        # Additional depreciation data. Not yet row mapped
    "f1_elc_op_mnt_expn",   # Electrical operating & maintenance expenses.
    "f1_elctrc_oper_rev",   # Electrical operating revenues.
    # "f1_elc_oper_rev_nb", # Additional electric operating revenues. One-line table. Not yet row mapped.
    "f1_income_stmnt",      # Utility income statements.
    # "f1_incm_stmnt_2",    # Additional income statement info. Not yet row mapped.
    "f1_plant_in_srvce",    # Utility plant in service, by FERC account number.
    "f1_sales_by_sched",    # Electricity sales by rate schedule -- it's a mess.
]

row_mapped_dfs = {t: pd.read_sql(t, ferc1_engine) for t in row_mapped_tables}
for tbl in row_mapped_tables:
    print(f"{tbl}:")
    check_all_row_years(tbl, ferc1_engine)
    print("\n", end="")

In [None]:
compare_row_literals("f1_plant_in_srvce", max(pc.DATA_YEARS["ferc1"]) - 1, max(pc.DATA_YEARS["ferc1"]), ferc1_engine)

## Identify Missing Respondents
* Some FERC 1 respondents appear in the data tables, but not in the `f1_respondent_id` table.
* During the database cloning process we create dummy entries for these respondents to ensure database integrity.
* Some of these missing respondents can be identified based on the data they report.
* For instance, `f1_respondent_id==519` reports two plants in the `f1_steam` table, named "Kuester" & "Mihm".
* Searching for those plant names in the EIA 860 data (and Google) reveals those plants are owned by Upper Michigan Energy Resources Company (`utility_id_eia==61029`).
* These "PUDL Determined" respondent names are stored in the `pudl.extract.ferc1.PUDL_RIDS` dictionary, and used to populate the `f1_respondent_id` table when they're available.
* However, since many plants are owned by multiple utilities, we need to identify a utility that matches *all* of the reported plant names, hopefully uniquely.
* The following functions help us identify that kind of utility.

In [None]:
def get_util_from_plants(pudl_out, patterns, display=False):
    """
    Find any utilities associated with a list patterns for matching plant names.
    
    Args:
        pudl_out (pudl.output.pudltabl.PudlTable): A PUDL Output Object.
        patterns (iterable of str): Collection of patterns with which to match
            the names of power plants in the EIA 860. E.g. ".*Craig.*".
        display (bool): Whether or not to display matching records for
            debugging and refinement purposes.

    Returns:
        pandas.DataFrame: All records from the utilities_eia860 table
        pertaining to the utilities identified as being associated with
        plants that matched all of the patterns.

    """
    own_eia860 = pudl_out.own_eia860()
    plants_eia860 = pudl_out.plants_eia860()
    
    util_ids = []
    for pat in patterns:
        owners_df = own_eia860[own_eia860.plant_name_eia.fillna("").str.match(pat, case=False)]
        plants_df = plants_eia860[plants_eia860.plant_name_eia.fillna("").str.match(pat, case=False)]
        if display:
            print(f"Pattern: \"{pattern}\"")
            display(owners_df)
            display(plants_df)
        util_ids.append(set.union(set(owners_df.owner_utility_id_eia), set(plants_df.utility_id_eia)))
        
    util_ids = set.intersection(*util_ids)
    utils_eia860 = pudl_out.utils_eia860()

    return utils_eia860[utils_eia860.utility_id_eia.isin(util_ids)]


### Missing Respondents
* This will show all the as of yet unidentified respondents
* You can then use these respondent IDs to search through other tables for identifying information

In [None]:
f1_respondent_id = pd.read_sql("f1_respondent_id", ferc1_engine)
missing_respondent_ids = f1_respondent_id[f1_respondent_id.respondent_name.str.contains("Missing Respondent")].respondent_id.unique()
missing_respondent_ids

### Utility identification example using Plants
* Let's use `respondent_id==529` which was identified as Tri-State Generation & Transmission in 2019
* Searching for that `respondent_id` in all of the plant-related tables we find the following plants:

In [None]:
(
    pudl.glue.ferc1_eia.get_db_plants_ferc1(pudl_settings, years=pc.DATA_YEARS["ferc1"])
    .query("utility_id_ferc1==529")
)

### Create a list of patterns based on plant names
* Pretend this respondent hadn't already been identified
* Generate a list of plant name patterns based on what we see here
* Use the above function `get_utils_from_plants` to identify candidate utilities involved with those plants, in the EIA data.
* Note that the list of patterns doesn't need to be exhaustive -- just enough to narrow down to a single utility.

In [None]:
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine)

In [None]:
get_util_from_plants(
    pudl_out,
    patterns=[
    ".*laramie.*",
    ".*craig.*",
    ".*escalante.*",
])

### Another example `with respondent_id==519`

In [None]:
(
    pudl.glue.ferc1_eia.get_db_plants_ferc1(pudl_settings, years=pc.DATA_YEARS["ferc1"])
    .query("utility_id_ferc1==519")
)

In [None]:
get_util_from_plants(
    pudl_out,
    patterns=[
    ".*kuester.*",
    ".*mihm.*",
])

### And again with `respondent_id==531`

In [None]:
(
    pudl.glue.ferc1_eia.get_db_plants_ferc1(pudl_settings, years=pc.DATA_YEARS["ferc1"])
    .query("utility_id_ferc1==531")
)

In [None]:
get_util_from_plants(
    pudl_out,
    patterns = [
    ".*leland.*",
    ".*antelope.*",
    ".*dry fork.*",
    ".*laramie.*",
])

### What about missing respondents in the Plant in Service table?
* There are a couple of years worth of plant in service data associated with unidentified respondents.
* Unfortunately the plant in service table doesn't have a lot of identifying information.
* The same is true of the `f1_dacs_epda` depreciation table

In [None]:
f1_plant_in_srvce = pd.read_sql_table("f1_plant_in_srvce", ferc1_engine)
f1_plant_in_srvce[f1_plant_in_srvce.respondent_id.isin(missing_respondent_ids)]

## Identify new strings for cleaning
* Several FERC 1 fields contain freeform strings that should have a controlled vocabulary imposed on them.
* This function help identify new, unrecognized strings in those fields each year.
* Use regular expressions to identify collections of new, related strings, and add them to the appropriate string cleaning dictionary entry in `pudl.transform.ferc1`.
* then re-run the cell with new search terms, until everything left is impossible to confidently categorize.

In [None]:
clean_me = [
    {"table": "f1_fuel",  "field": "fuel",       "strdict": pudl.transform.ferc1.FUEL_STRINGS},
    {"table": "f1_fuel",  "field": "fuel_unit",  "strdict": pudl.transform.ferc1.FUEL_UNIT_STRINGS},
    {"table": "f1_steam", "field": "plant_kind", "strdict": pudl.transform.ferc1.PLANT_KIND_STRINGS},
    {"table": "f1_steam", "field": "type_const", "strdict": pudl.transform.ferc1.CONSTRUCTION_TYPE_STRINGS},
]

for kwargs in clean_me:
    unmapped_strings = pudl.helpers.find_new_ferc1_strings(ferc1_engine=ferc1_engine, **kwargs)
    print(f"{len(unmapped_strings)} unmapped {kwargs['field']} strings found.")
    if unmapped_strings:
        display(unmapped_strings)