# Integrating New FERC Form 1 and EIA Data Releases
This notebook generates lists of new plants and utilities that need to be assigned PUDL IDs. It helps with the process of integrating new data each fall when the agencies make their new annual release for the previous year. In addition, it has some functions that assist in the manual categorization of new freeform strings within the FERC Form 1 database, describing fuel types, fuel units, plant types, etc.

## Prerequisites:
* All available EIA 860/923 years must be loaded into your PUDL DB.
* This includes the **new** year of data to be integrated.
* This means the spreadsheet tab maps need to be updated.
* Some minor EIA data wrangling may also be required.
* All years of FERC Form 1 data must be loaded into your FERC 1 DB.
* This includes the **new** year of data to be integrated.

## Outputs:
* `unmapped_utilities_ferc1.csv`: Respondent IDs and respondent names of utilities which appear in the FERC Form 1 DB, but which do **not** appear in the PUDL ID mapping spreadsheet.
* `unmapped_plants_ferc1.csv`: Plant names, respondent names, and respondent IDs associated with plants that appear in the FERC Form 1 DB, but which do **not** appear in the PUDL ID Mapping spreadsheet.
* `unmapped_utilities_eia.csv`: EIA Utility IDs and names of utilities which appear in the PUDL DB, but which do **not** appear in the PUDL ID mapping spreadsheet.
* `unmapped_plants_eia.csv`: EIA Plant IDs and Plant Names of plants which appear in the PUDL DB, but which do **not** appear in the PUDL ID mapping spreadsheet.  The Utility ID and Name for the primary plant operator, as well as the aggregate plant capacity and the state the plant is located in are also proved to aid in PUDL ID mapping.
* `lost_utilities_eia.csv`: The Utility IDs and Names of utilities which appear in the PUDL ID mapping spreadsheet but which do **not** appear in the PUDL DB. Likely because EIA revised previous years of data, and removed those utilities, after we had mapped them.
* `lost_plants_eia.csv`: The Plant IDs and Names of plants which appear in the PUDL ID mapping spreadsheet but which do **not** appear in the PUDL DB. Likely because EIA revised previous years of data, and removed those plants, after we had mapped them.

In [1]:
%load_ext autoreload
%autoreload 2
import sqlalchemy as sa
import pandas as pd
import pudl
import re
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_settings

{'pudl_in': '/home/zane/code/catalyst/pudl-work',
 'data_dir': '/home/zane/code/catalyst/pudl-work/data',
 'settings_dir': '/home/zane/code/catalyst/pudl-work/settings',
 'pudl_out': '/home/zane/code/catalyst/pudl-work',
 'sqlite_dir': '/home/zane/code/catalyst/pudl-work/sqlite',
 'parquet_dir': '/home/zane/code/catalyst/pudl-work/parquet',
 'datapackage_dir': '/home/zane/code/catalyst/pudl-work/datapackage',
 'notebook_dir': '/home/zane/code/catalyst/pudl-work/notebook',
 'ferc1_db': 'sqlite:////home/zane/code/catalyst/pudl-work/sqlite/ferc1.sqlite',
 'pudl_db': 'sqlite:////home/zane/code/catalyst/pudl-work/sqlite/pudl.sqlite'}

## Setup:
* Create FERC1/PUDL database connections
* Set the scope of the FERC Form 1 search (which years to check)

In [2]:
ferc1_engine = sa.create_engine(pudl_settings["ferc1_db"])
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
ferc1_years = pudl.constants.data_years["ferc1"]
print("Searching for new FERC 1 plants, utilities and strings in the following years:")
print(ferc1_years)

Searching for new FERC 1 plants, utilities and strings in the following years:
(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018)


## Unmapped FERC Form 1 Plants

In [3]:
unmapped_plants_ferc1 = pudl.glue.ferc1_eia.get_unmapped_plants_ferc1(pudl_settings, years=ferc1_years)
n_ferc1_unmapped_plants = len(unmapped_plants_ferc1)
print(f"{n_ferc1_unmapped_plants} unmapped FERC 1 plants found in {min(ferc1_years)}-{max(ferc1_years)}.")
if n_ferc1_unmapped_plants > 0:
    unmapped_plants_ferc1_outfile= f"unmapped_plants_ferc1.csv"
    print(f"Writing {n_ferc1_unmapped_plants} out to {unmapped_plants_ferc1_outfile}")
    unmapped_plants_ferc1.to_csv(unmapped_plants_ferc1_outfile, index=False)
unmapped_plants_ferc1

3180 unmapped FERC 1 plants found in 1994-2018.
Writing 3180 out to unmapped_plants_ferc1.csv


Unnamed: 0,utility_id_ferc1,plant_name_ferc1,utility_name_ferc1,capacity_mw,plant_table
0,1,rockport,aep generating company,1300.0,f1_steam
1,1,rockport total plt,aep generating company,2600.0,f1_steam
2,1,rockport u1 aeg,aep generating company,650.0,f1_steam
3,1,rockport u2 aeg,aep generating company,650.0,f1_steam
4,1,rockport unit 1,aep generating company,650.0,f1_steam
...,...,...,...,...,...
3175,432,pueblo diesels,"black hills/colorado electric utility company, lp",10.0,f1_steam
3176,432,rocky ford diesels,"black hills/colorado electric utility company, lp",10.0,f1_steam
3177,454,little gypsy 2 & 3,"entergy louisiana, llc",1002.9,f1_steam
3178,454,ninemile point 4 & 5,"entergy louisiana, llc",1790.2,f1_steam


## Unmapped FERC Form 1 Utilities / Respondents
* **Note:** Frequently there are zero of these.

In [4]:
unmapped_utils_ferc1 = pudl.glue.ferc1_eia.get_unmapped_utils_ferc1(pudl_settings, years=ferc1_years)
n_ferc1_unmapped_utils = len(unmapped_utils_ferc1)
print(f"{n_ferc1_unmapped_utils} unmapped FERC 1 utilities found in {min(ferc1_years)}-{max(ferc1_years)}.")
if n_ferc1_unmapped_utils > 0:
    unmapped_utils_ferc1_outfile= f"unmapped_utilities_ferc1.csv"
    print(f"Writing {n_ferc1_unmapped_utils} out to {unmapped_utils_ferc1_outfile}")
    unmapped_utils_ferc1.to_csv(unmapped_utils_ferc1_outfile, index=False)
unmapped_utils_ferc1

0 unmapped FERC 1 utilities found in 1994-2018.


Unnamed: 0,utility_id_ferc1,utility_name_ferc1


## Unmapped EIA Plants
* **Note:** Some unmapped EIA plants do not have Utilities associated with them.
* Many of these plants are too small to warrant mapping, and so capacity is included as a potential filter.
* Also note that the first and last few plants in the output dataframe have a bunch of NA values... which can be confusing.

In [5]:
unmapped_plants_eia = pudl.glue.ferc1_eia.get_unmapped_plants_eia(pudl_engine)
print(f"Found {len(unmapped_plants_eia)} unmapped EIA plants.")
unmapped_plants_eia.to_csv(f"unmapped_plants_eia.csv")
unmapped_plants_eia.sample(10)

Found 1531 unmapped EIA plants.


Unnamed: 0,plant_id_eia,plant_name_eia,utility_id_eia,utility_name,state,capacity_mw
262,61686,johnson 1 community solar,61287.0,johnson i csg llc,MN,2.0
1408,62874,acorn i energy storage llc,62747.0,acorn i energy storage llc,CA,2.0
569,61992,pisces community solar garden,61585.0,pisces community solar garden llc,MN,0.7
1,334,riverside canal power company,,,CA,
45,61462,sr platte solar farm,61081.0,sr platte,CO,32.0
121,61541,town of rocky hill,60947.0,tesla inc.,CT,6.0
1417,62883,telfair thompson,62715.0,westbound solar llc,GA,1.9
1152,62600,cookstown,62098.0,"cookstown solar farm, llc",NC,5.0
1376,62837,frontier windpower ii,62720.0,"frontier windpower ii, llc",OK,351.8
926,62360,uss dvl solar csg,61883.0,uss dvl solar csg,MN,1.0


## Lost EIA Plants
* There shouldn't be very many of these... if it's more than a few hundred (out of the ~10,000 EIA plants) then something may be wrong.

In [6]:
lost_plants_eia = pudl.glue.ferc1_eia.get_lost_plants_eia(pudl_engine)
print(f"Found {len(lost_plants_eia)} lost EIA plants.")
lost_plants_eia.to_csv(f"lost_plants_eia.csv")
lost_plants_eia.sample(10)

Found 167 lost EIA plants.


Unnamed: 0_level_0,plant_name_eia
plant_id_eia,Unnamed: 1_level_1
54359,ppg place
54061,minnesota wood products
54664,pekin paperboard
54971,lone star steel
7412,john harmon gen
54828,3200 wildwood plaza
56175,phoenix wind power llc
54208,hidalgo smelter
55038,westroads shopping center
10097,crozer chester medical center


## Unmapped EIA Utilities
* Especially with the advent of many small distributed generators, there are often just as many new utilities as there are new plants.

In [7]:
unmapped_utils_eia = pudl.glue.ferc1_eia.get_unmapped_utils_eia(pudl_engine)
print(f"Found {len(unmapped_utils_eia)} unmapped EIA utilities.")
unmapped_utils_eia.to_csv("unmapped_utilities_eia.csv")
unmapped_utils_eia.sample(10)

Found 1460 unmapped EIA utilities.


Unnamed: 0_level_0,utility_name
utility_id_eia,Unnamed: 1_level_1
61977,acb energy partners llc
61856,turkey hill solar i
61648,"wed green hill, llc"
61814,sandifer solar
61697,"hillcrest solar i, llc"
61094,nyu langone health
61445,"sp solar 6, llc"
61872,gavilan district college solar project
60541,ameresco glendale road solar pv llc
57275,metropolitan transportation authority


## Lost EIA Utilities
* Again, there shouldn't be **too** many of these. If it's thousands, not hundreds, dig deeper.

In [8]:
lost_utils_eia = pudl.glue.ferc1_eia.get_lost_utils_eia(pudl_engine)
print(f"Found {len(lost_utils_eia)} lost EIA utilities.")
lost_utils_eia.to_csv("lost_utilities_eia.csv")
lost_utils_eia.sample(10)

Found 287 lost EIA utilities.


Unnamed: 0_level_0,utility_name_eia
utility_id_eia,Unnamed: 1_level_1
20281,west pharmaceutical services
50137,"u.s. dept of the interior, bureau of rec"
5463,duraco products inc
54702,navasota wharton energy partners lp
1311,bassett furniture industries inc
49875,b&k energy systems llc
6762,city of fredonia
55883,k&d energy llc
431,alyeska seafoods inc
1687,bio-energy partners


## FERC Form 1 String Cleaning Dictionaires
* We categorize several important but messy free-form FERC Form 1 fields to enable analysis.
* Every year any new entries (abbreviations, misspellings, etc.) in these fields need to be categorized.
* This cell helps with that interactive process, generating the list of newly discovered strings.
* Use regular expressions to identify collections of new, related strings, and add them to the appropriate string cleaning dictionary entry in `pudl.constants`, then re-run the cell with new search terms, until everything left is impossible to confidently categorize.

In [9]:
def new_ferc1_strings(table, field, start_year, end_year, ferc1_engine, strdict):
    all_strings = pd.read_sql(f"SELECT * FROM {table} WHERE report_year>={start_year} AND report_year<={end_year};", ferc1_engine)
    all_strings = (
        all_strings.
        pipe(pudl.helpers.strip_lower, columns=[field])[field].
        unique()
    )
    old_strings = []
    for x in strdict:
        old_strings = old_strings + strdict[x]
    new_strings = [s for s in all_strings if s not in old_strings]
    return new_strings

clean_me = {
    "fuel": {
        "table": "f1_fuel",
        "field": "fuel",
        "strdict": pudl.constants.ferc1_fuel_strings,
    },
    "fuel_unit": {
        "table": "f1_fuel",
        "field": "fuel_unit",
        "strdict": pudl.constants.ferc1_fuel_unit_strings,
    },
    "plant_kind": {
        "table": "f1_steam",
        "field": "plant_kind",
        "strdict": pudl.constants.ferc1_plant_kind_strings,
    },
    "type_const": {
        "table": "f1_steam",
        "field": "type_const",
        "strdict": pudl.constants.ferc1_const_type_strings,
    },
}

# This must be one of the keys in the above dictionary....
type_of_string = "fuel"

unmapped_strings = new_ferc1_strings(
    start_year=min(ferc1_years),
    end_year=max(ferc1_years),
    ferc1_engine=ferc1_engine,
    **clean_me[type_of_string])

n_unmapped_strings = len(unmapped_strings)
print(f"{n_unmapped_strings} unmapped {type_of_string} strings found.")
# These numbers represent how many unmapped strings remained as of 2019-10-03
if type_of_string == "fuel":
    assert n_unmapped_strings == 0
elif type_of_string == "fuel_unit":
    assert n_unmapped_strings <= 80
elif type_of_string == "plant_kind":
    assert n_unmapped_strings <= 64
elif type_of_string == "type_const":
    assert n_unmapped_strings <= 94
else:
    assert False

# Choose your own colleciton of search strings here to match the
# category you're attempting to identify. For all except the "fuel"
# string type, there will be leftovers that can't be categorized
# because they're a mess.  In the "fuel" type we use "other" so that
# we don't lose the fuel records altogether.
[
    s for s in unmapped_strings
    if re.match('.*', s)
    #if re.match('.*semi.*', s)
    #and not re.match('.*out.*', s)
    #and not re.match('.*in.*', s)
    #and not re.match('.*less.*', s)
    #and not re.match('.*under.*', s)
    #and not re.match('.*o.*[db].*', s)
]

0 unmapped fuel strings found.


[]

## Cleaning other FERC Form 1 Plant Tables
* There are several additional FERC Form 1 tables which contain plant data.
* These include small plants, hydro, and pumped storage.
* Thus far we have not done much concerted work cleaning up / categorizing these plants, though they do get PUDL IDs.
* The following cell pulls the small plants (`f1_gnrt_plant`) table with some fields that would be useful for categorization.
* This is just a prototype/outline/suggestion...

In [10]:
small_plants_ferc1 = (
    pd.read_sql(
        f"""SELECT f1_gnrt_plant.report_year,\
                   f1_gnrt_plant.respondent_id,\
                   f1_gnrt_plant.row_number,\
                   f1_gnrt_plant.spplmnt_num,\
                   f1_gnrt_plant.plant_name,\
                   f1_gnrt_plant.capacity_rating,\
                   f1_gnrt_plant.kind_of_fuel, \
                   f1_respondent_id.respondent_name\
            FROM f1_gnrt_plant, f1_respondent_id \
            WHERE report_year>={min(ferc1_years)}
            AND report_year<={max(ferc1_years)}
            AND f1_respondent_id.respondent_id=f1_gnrt_plant.respondent_id;""", ferc1_engine).
    assign(record_number=lambda x: x["row_number"] + 46*x["spplmnt_num"]).
    drop(["row_number", "spplmnt_num"], axis="columns").
    pipe(pudl.helpers.strip_lower, columns=["plant_name", "kind_of_fuel", "respondent_name"]).
    rename(columns={"capacity_rating": "capacity_mw"}).
    loc[:,["report_year", "respondent_id", "respondent_name", "record_number", "plant_name", "capacity_mw", "kind_of_fuel"]].
    sort_values(["report_year", "respondent_id", "record_number"])
)
n_small_plants_ferc1 = len(small_plants_ferc1)
small_plants_ferc1_outfile = "f1_gnrt_plant.csv"
print(f"Writing {n_small_plants_ferc1} small plant records out to {small_plants_ferc1_outfile}")
small_plants_ferc1.to_csv(small_plants_ferc1_outfile, index=False)
small_plants_ferc1

Writing 18690 small plant records out to f1_gnrt_plant.csv


Unnamed: 0,report_year,respondent_id,respondent_name,record_number,plant_name,capacity_mw,kind_of_fuel
616,1994,3,alaska electric light and power company,1,gold creek hydro,1.60,
617,1994,3,alaska electric light and power company,3,gold creek internal combustion:,0.00,
618,1994,3,alaska electric light and power company,4,enterprise diesel,1.25,
619,1994,3,alaska electric light and power company,5,enterprise diesel,1.25,
620,1994,3,alaska electric light and power company,6,enterprise diesel,3.50,diesel
...,...,...,...,...,...,...,...
18601,2018,312,"wabash valley energy marketing, inc.",4,ste. geneiveve solar,0.54,
18595,2018,403,"cheyenne light, fuel and power company",1,not applicable,0.00,
18319,2018,428,"ugi utilities, inc.",1,,0.00,
18596,2018,432,"black hills/colorado electric utility company, lp",1,busch ranch wind energy farm,29.04,wind
