# Integrating New FERC Form 1 and EIA Data Releases
This notebook generates lists of new plants and utilities that need to be assigned PUDL IDs. It helps with the process of integrating new data each fall when the agencies make their new annual release for the previous year.

## Prerequisites:
* All available EIA 860/923 years must be loaded into your PUDL DB.
* This includes the **new** year of data to be integrated.
* This means the spreadsheet tab maps need to be updated.
* Some minor EIA data wrangling may also be required.
* All years of FERC Form 1 data must be loaded into your FERC 1 DB.
* This includes the **new** year of data to be integrated.

## Outputs:
* `unmapped_utilities_ferc1.csv`: Respondent IDs and respondent names of utilities which appear in the FERC Form 1 DB, but which do **not** appear in the PUDL ID mapping spreadsheet.
* `unmapped_plants_ferc1.csv`: Plant names, respondent names, and respondent IDs associated with plants that appear in the FERC Form 1 DB, but which do **not** appear in the PUDL ID Mapping spreadsheet.
* `unmapped_utilities_eia.csv`: EIA Utility IDs and names of utilities which appear in the PUDL DB, but which do **not** appear in the PUDL ID mapping spreadsheet.
* `unmapped_plants_eia.csv`: EIA Plant IDs and Plant Names of plants which appear in the PUDL DB, but which do **not** appear in the PUDL ID mapping spreadsheet.  The Utility ID and Name for the primary plant operator, as well as the aggregate plant capacity and the state the plant is located in are also proved to aid in PUDL ID mapping.
* `lost_utilities_eia.csv`: The Utility IDs and Names of utilities which appear in the PUDL ID mapping spreadsheet but which do **not** appear in the PUDL DB. Likely because EIA revised previous years of data, and removed those utilities, after we had mapped them.
* `lost_plants_eia.csv`: The Plant IDs and Names of plants which appear in the PUDL ID mapping spreadsheet but which do **not** appear in the PUDL DB. Likely because EIA revised previous years of data, and removed those plants, after we had mapped them.

In [1]:
%load_ext autoreload
%autoreload 2
import sqlalchemy as sa
import pandas as pd
import pudl
import re
from pathlib import Path
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_settings

{'pudl_in': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR',
 'data_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/data',
 'settings_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/settings',
 'pudl_out': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR',
 'sqlite_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite',
 'parquet_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/parquet',
 'datapkg_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/datapkg',
 'ferc1_db': 'sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/ferc1.sqlite',
 'pudl_db': 'sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/pudl.sqlite',
 'censusdp1tract_db': 'sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/censusdp1tract.sqlite'}

## Setup:
* Create FERC1/PUDL database connections
* Set the scope of the FERC Form 1 search (which years to check)

In [9]:
ferc1_engine = sa.create_engine(pudl_settings["ferc1_db"])
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
ferc1_years = pudl.constants.DATA_YEARS["ferc1"]
print("Searching for new FERC 1 plants, utilities and strings in the following years:")
print(ferc1_years)

Searching for new FERC 1 plants, utilities and strings in the following years:
(1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)


In [10]:
with pudl_engine.connect() as conn:
    eia_years = pd.read_sql("select distinct(report_date) from plants_eia860", conn)

print(f"EIA Years in db: {(eia_years)}")

EIA Years in db:    report_date
0   2019-01-01
1   2018-01-01
2   2017-01-01
3   2016-01-01
4   2015-01-01
5   2014-01-01
6   2013-01-01
7   2012-01-01
8   2011-01-01
9   2010-01-01
10  2009-01-01
11  2008-01-01
12  2007-01-01
13  2006-01-01
14  2005-01-01
15  2004-01-01
16  2003-01-01
17  2002-01-01
18  2001-01-01


## Unmapped FERC Form 1 Plants

In [11]:
unmapped_plants_ferc1 = pudl.glue.ferc1_eia.get_unmapped_plants_ferc1(pudl_settings, years=ferc1_years)
n_ferc1_unmapped_plants = len(unmapped_plants_ferc1)
print(f"{n_ferc1_unmapped_plants} unmapped FERC 1 plants found in {min(ferc1_years)}-{max(ferc1_years)}.")
outfile = Path("unmapped_plants_ferc1.csv")
print(f"Writing {n_ferc1_unmapped_plants} out to {outfile}")
unmapped_plants_ferc1.to_csv(outfile, index=False)
unmapped_plants_ferc1

0 unmapped FERC 1 plants found in 1994-2019.
Writing 0 out to unmapped_plants_ferc1.csv


Unnamed: 0,utility_id_ferc1,plant_name_ferc1,utility_name_ferc1,capacity_mw,plant_table


## Unmapped FERC Form 1 Utilities / Respondents
* **Note:** Frequently there are zero of these.

In [7]:
unmapped_utils_ferc1 = pudl.glue.ferc1_eia.get_unmapped_utils_ferc1(ferc1_engine)
n_ferc1_unmapped_utils = len(unmapped_utils_ferc1)
print(f"{n_ferc1_unmapped_utils} unmapped FERC 1 utilities found in {min(ferc1_years)}-{max(ferc1_years)}.")
outfile = Path("unmapped_utilities_ferc1.csv")
print(f"Writing {n_ferc1_unmapped_utils} out to {outfile}")
unmapped_utils_ferc1.to_csv(outfile, index=False)
unmapped_utils_ferc1

0 unmapped FERC 1 utilities found in 1994-2019.
Writing 0 out to unmapped_utilities_ferc1.csv


Unnamed: 0,utility_id_ferc1,utility_name_ferc1


## Unmapped EIA Plants
* **Note:** Some unmapped EIA plants do not have Utilities associated with them.
* Many of these plants are too small to warrant mapping, and so capacity is included as a potential filter.
* Also note that the first and last few plants in the output dataframe have a bunch of NA values... which can be confusing.

In [13]:
unmapped_plants_eia = pudl.glue.ferc1_eia.get_unmapped_plants_eia(pudl_engine)
print(f"Found {len(unmapped_plants_eia)} unmapped EIA plants.")
outfile = Path("unmapped_plants_eia.csv")
unmapped_plants_eia.to_csv(outfile)
unmapped_plants_eia

Found 253 unmapped EIA plants.


Unnamed: 0,plant_id_eia,plant_name_eia,utility_id_eia,utility_name_eia,state,capacity_mw
0,230,,,,,
1,234,,,,,
2,278,,,,,
3,343,,,,,
4,346,,,,,
...,...,...,...,...,...,...
248,55948,liberty generating,14046,orion power operating services,PA,1583.0
249,55971,washington county,15381,progress genco ventures llc,GA,797.6
250,55978,montfort wind farm,6354,fpl energy upton wind lp,WI,30.0
251,55979,fenner wind,38002,chi energy inc,NY,


## Lost EIA Plants
* There shouldn't be very many of these... if it's more than a few hundred (out of the ~10,000 EIA plants) then something may be wrong.

In [14]:
lost_plants_eia = pudl.glue.ferc1_eia.get_lost_plants_eia(pudl_engine)
print(f"Found {len(lost_plants_eia)} lost EIA plants.")
outfile = Path("lost_plants_eia.csv")
outfile.unlink(missing_ok=True)
outfile.touch()
lost_plants_eia.to_csv(outfile)
lost_plants_eia.sample(min([10, len(lost_plants_eia)]))

Found 3 lost EIA plants.


Unnamed: 0_level_0,plant_name_eia
plant_id_eia,Unnamed: 1_level_1
99999,state-fuel level increment
60812,gruber solar center
60767,seashore solar


## Unmapped EIA Utilities
* Especially with the advent of many small distributed generators, there are often just as many new utilities as there are new plants.

In [80]:
unmapped_utils_eia = pudl.glue.ferc1_eia.get_unmapped_utils_eia(pudl_engine)
print(f"Found {len(unmapped_utils_eia)} unmapped EIA utilities.")
outfile = Path("all_unmapped_utilities_eia.csv")
unmapped_utils_eia.to_csv(outfile)

miss_utils = pudl.glue.ferc1_eia.get_unmapped_utils_with_plants_eia(pudl_engine)
print(f"Found {len(miss_utils)} unmapped utilities with plants/ownership.")
outfile = Path("planted_unmapped_utilities_eia.csv")
miss_utils.to_csv(outfile)

unmapped_utils_eia.head(10)

Found 6491 unmapped EIA utilities.
Found 2594 unmapped utilities with plants/ownership.


Unnamed: 0_level_0,utility_name_eia,most_recent_total_capacity_mw
utility_id_eia,Unnamed: 1_level_1,Unnamed: 2_level_1
8901,reliant energy hl&p,14266.8
50023,"texas genco ii, lp",14158.6
6083,exelon gen co llc,10636.0
829,american electric power co inc,5765.2
5886,energy developement group,5028.5
5507,duke energy north america llc,4810.6
14716,pennsylvania power co,4587.9
26751,"national grid generation, llc",4389.4
13998,ohio edison co,3970.2
9162,interstate power and light,3961.2


In [81]:
miss_utils.head(10)

Unnamed: 0_level_0,utility_name_eia,most_recent_total_capacity_mw
utility_id_eia,Unnamed: 1_level_1,Unnamed: 2_level_1
8901,Reliant Energy HL&P,14266.8
50023,"Texas Genco II, LP",14158.6
6083,Exelon Gen Co LLC,10636.0
829,American Electric Power Co Inc,5765.2
5886,Energy Developement Group,5028.5
5507,Duke Energy North America LLC,4810.6
14716,Pennsylvania Power Co,4587.9
26751,"National Grid Generation, LLC",4389.4
13998,Ohio Edison Co,3970.2
9162,Interstate Power and Light,3961.2


## Another Kind of Unmapped EIA Utilities
* This cell looks *only* for the EIA utilities that show up somewhere in the EIA 923 data, but still don't have a `utility_id_pudl` value assigned to them.

In [16]:
pudl_raw = pudl.output.pudltabl.PudlTabl(pudl_engine, freq=None)
frc_eia923 = pudl_raw.frc_eia923()
gf_eia923 = pudl_raw.gf_eia923()
gen_eia923 = pudl_raw.gen_eia923()
bf_eia923 = pudl_raw.bf_eia923()

missing_frc = frc_eia923[frc_eia923.utility_id_pudl.isna()][["utility_id_eia", "utility_name_eia"]]
missing_gf = gf_eia923[gf_eia923.utility_id_pudl.isna()][["utility_id_eia", "utility_name_eia"]]
missing_bf = bf_eia923[bf_eia923.utility_id_pudl.isna()][["utility_id_eia", "utility_name_eia"]]
missing_gens = gen_eia923[gen_eia923.utility_id_pudl.isna()][["utility_id_eia", "utility_name_eia"]]

missing_utils = (
    pd.concat([missing_frc, missing_bf, missing_gf, missing_gens])
    .drop_duplicates(subset="utility_id_eia")
    .set_index("utility_id_eia")
)

print(f"Found {len(missing_utils)} utilities with EIA 923 data but no PUDL Utility ID.")
outfile = Path("dataful_unmapped_utilities_eia.csv")
missing_utils.to_csv(outfile)
missing_utils.sample(min(len(missing_utils), 10))

KeyboardInterrupt: 

In [None]:
missing_utils.iloc[50:100]

## Lost EIA Utilities
* Again, there shouldn't be **too** many of these. If it's thousands, not hundreds, dig deeper.

In [None]:
lost_utils_eia = pudl.glue.ferc1_eia.get_lost_utils_eia(pudl_engine)
print(f"Found {len(lost_utils_eia)} lost EIA utilities.")
outfile = Path("lost_utilities_eia.csv")
lost_utils_eia.to_csv(outfile)

## Cleaning other FERC Form 1 Plant Tables
* There are several additional FERC Form 1 tables which contain plant data.
* These include small plants, hydro, and pumped storage.
* Thus far we have not done much concerted work cleaning up / categorizing these plants, though they do get PUDL IDs.
* The following cell pulls the small plants (`f1_gnrt_plant`) table with some fields that would be useful for categorization.
* This is just a prototype/outline/suggestion...

In [None]:
small_plants_ferc1 = (
    pd.read_sql(
        f"""SELECT f1_gnrt_plant.report_year,\
                   f1_gnrt_plant.respondent_id,\
                   f1_gnrt_plant.row_number,\
                   f1_gnrt_plant.spplmnt_num,\
                   f1_gnrt_plant.plant_name,\
                   f1_gnrt_plant.capacity_rating,\
                   f1_gnrt_plant.kind_of_fuel, \
                   f1_respondent_id.respondent_name\
            FROM f1_gnrt_plant, f1_respondent_id \
            WHERE report_year>={min(ferc1_years)}
            AND report_year<={max(ferc1_years)}
            AND f1_respondent_id.respondent_id=f1_gnrt_plant.respondent_id;""", ferc1_engine).
    assign(record_number=lambda x: x["row_number"] + 46*x["spplmnt_num"]).
    drop(["row_number", "spplmnt_num"], axis="columns").
    pipe(pudl.helpers.simplify_strings, columns=["plant_name", "kind_of_fuel", "respondent_name"]).
    rename(columns={"capacity_rating": "capacity_mw"}).
    loc[:,["report_year", "respondent_id", "respondent_name", "record_number", "plant_name", "capacity_mw", "kind_of_fuel"]].
    sort_values(["report_year", "respondent_id", "record_number"])
)
n_small_plants_ferc1 = len(small_plants_ferc1)
outfile = Path("f1_gnrt_plant.csv")
print(f"Writing {n_small_plants_ferc1} small plant records out to {outfile}")
small_plants_ferc1.to_csv(outfile, index=False)
small_plants_ferc1