Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename FERC Form 1 core and output assets #2992

Closed
5 tasks done
Tracked by #2812
bendnorman opened this issue Oct 31, 2023 · 20 comments · Fixed by #2995
Closed
5 tasks done
Tracked by #2812

Rename FERC Form 1 core and output assets #2992

bendnorman opened this issue Oct 31, 2023 · 20 comments · Fixed by #2995
Assignees

Comments

@bendnorman
Copy link
Member

bendnorman commented Oct 31, 2023

For our first pass at renaming the FERC Form 1 core and output assets, we just applied the new naming convention to the existing asset name. We later realized some of the original asset_names were not consistent and did not accurately describe the data. Our Form 1 masters @zaneselvans and @cgosnell created more descriptive and consistent names in the Naming Conventions spreadsheet.

We should make these name changes before we merge in #2818 so we don't have two rounds of widespread name changes.

I started to pluralize some of the FERC 1 core tables in #2914, but it would probably be easier to cut a new branch off of #2818 and replace the asset names in the "Asset name in rename-core-asset branch" column of the spreadsheet with the names with y'alls suggestions.

Don't forget to update the asset names in the transformation metadata csvs! I forgot to do this in the first round of renaming and ran into some unexpected errors.

Tasks

@cmgosnell
Copy link
Member

cmgosnell commented Oct 31, 2023

Okay I have two FERC1 naming questions that I'd love some feedback on:

  • should we keep electric in the tables names?
  • should we keep plant in the table names?

This whole comment will make more sense if you go looks at the three iterations of suggestions in this tab.

its electric

There are 9 tables that either have electric in the name or we could add it in.

my proposal

Form 1 is a form for electric utilities. It seems very reasonable to assume that most of the tables are about electric utilities. We've already been systematically adding utility_type into these FERC1 tables (lots of the tables have it already, more in #2894). If we were actually consistent about the names here we would need to add electric into at least another 7 table names. My suggestion is to remove all electric from the table names but make sure we have utility_type == "electric" in those tables.

  • depreciation_amortization_summary_ferc1: has utility_type
  • electric_energy_dispositions_ferc1: does not have utility_type
  • electric_energy_sources_ferc1: does not have utility_type
  • electric_operating_expenses_ferc1: has utility_type
  • electric_operating_revenues_ferc1: has utility_type
  • electric_plant_depreciation_changes_ferc1: has utility_type
  • electric_plant_depreciation_functional_ferc1: has utility_type
  • electricity_sales_by_rate_schedule_ferc1: does not have utility_type
  • plant_in_service_ferc1: we're adding utility_type

its a plant

(i'm not talking about the generation plant tables here) there are three depreciation tables + one non-deprish table that have plant in them right now.

my proposal

remove plant! in the accounting sense of the word plant, I think all depreciation tables by their very definition contain plants. i don't think this is a particularly helpful addition all of these tables have some plant id-ing columns. I feel less strongly about this one overall than about electric in the table names.

  • depreciation_amortization_summary_ferc1 : has plant_function column
  • electric_plant_depreciation_changes_ferc1 : has plant_status column
  • electric_plant_depreciation_functional_ferc1: has plant_function and plant_status
  • plant_in_service_ferc1: we're adding plant_status

@zaneselvans
Copy link
Member

zaneselvans commented Oct 31, 2023

Removing plant and electric is attractive in a kind of "database normalization" sense, since it would reduce duplicative information, but I don't know that that's the most important thing about how we name the tables.

To me having the explicit utility_type or plant_* columns inside the tables feels independent of the table names. Looking at the list of tables, nobody is going to know what columns are inside them.

The old DBF table names were hopelessly illegible with their inconsistent and highly abbreviated names. The XBRL table names are almost ridiculously long and descriptive, but I think that does help identify what the heck is inside them -- especially if someone is coming in with a familiarity with the "paper" FERC Form 1, table names that somewhat correspond to the schedule titles in the PDF will help them find what they're looking for, even if they aren't the 150 character long titles that they use in XBRL.

There are also some limited cases in which there's non-electric utility information being reported in the FERC 1 tables.

If we were to remove both plant and electric then instead of electric_plant_in_service_ferc1 would we just have in_service_ferc1? That seems pretty terse and hard to interpret.

Another option that could help folks navigate the PDF to database chasm is using the schedule number in the table names, as the XBRL table names do. That makes it very easy to find the corresponding information in the XBRL Taxonomy viewer or to find the pages in the PDF that correspond to the information in the table. They also seem to be pretty stable over the years.

I think having e.g. electric_plant_ in the name of a table is more about differentiating that table from others that don't pertain to plant at all. E.g. the income_statement electric_energy_sources and electric_energy_dispositions tables aren't about accounting for the capital stock, they're about the flow of income (and expenses?) or energy, and I think that's a useful kind of information to have in the list of tables. When I am exploring a new database the first thing I do looks something like...

.tables -- get a list of all the tables in the DB
SHOW ALL TABLES; -- gives more information about the tables, but often gets truncated.
DESCRIBE table_name; -- list the columns in the table and their types, nullability, etc.
SUMMARIZE table_name; -- calculate some summary statistics for the table. Min/max median, null-ness, etc.

It seems like the tension here is between the inconvenience and potentially duplicative nature of longer names, and the desire for the table names to be legible and connected to information users are likely to already be familiar with.

@zaneselvans
Copy link
Member

There's also our colloquial usage of "plants" in the case of generation plant vs. FERC's use of the capital accounting "plant" and if we're going to remove "plant" then should we also remove "plants?" Or should we specify that the tables containing "plants" now are specifically generation_plant. And if we do that, then for consistency should we be specifying the other kinds of non-generation plant where the tables refer to a particular type? (are there any of those?)

@cmgosnell
Copy link
Member

I mostly dislike the electricin the table names because it is inconsistent, it seems at least semi-arbitrary across the tables, its duplicative with the content of the tables themselves and if we make it consistent across the tables than only 7 of our 24 FERC1 tables would not have electric in the name.

@cmgosnell cmgosnell linked a pull request Oct 31, 2023 that will close this issue
8 tasks
@bendnorman
Copy link
Member Author

bendnorman commented Oct 31, 2023

Electric

Are there utilities in these tables where utility_type != "electric"? If there is a mix of utility types in these tables, I don’t think it makes sense to include “electric” in the table name because there might be data that doesn't describe an electric utility. Even if utility_type == "electric" for all records it seems duplicative to include "electric" in the table name.

However, if there are tables that don't describe electric utilities, we should probably include "electric" in the table names of tables that do describe electric utilities.

Plant

I agree with @zaneselvans that users should get a sense of what is in the table without having to look at columns or values. If the entity the tables describe is a "plant" and not all Form 1 tables describe plants, then I think we should include "plant" in the table name so they can be differentiated.

@cmgosnell
Copy link
Member

cmgosnell commented Oct 31, 2023

Electric

Are there utilities in these tables where utility_type != "electric"? If there is a mix of utility types in these tables, I don’t think it makes sense to include “electric” in the table name because there might be data that doesn't describe an electric utility. Even if utility_type == "electric" for all records it seems duplicative to include "electric" in the table name.

I believe all of the tables with the suggestion for electric in the name are fully utility_type == "electric"

However, if there are tables that don't describe electric utilities, we should probably include "electric" in the table names of tables that do describe electric utilities.

Every respondent to FERC1 is an electric utility, but some of the tables like the income statement table include income from non-electric portions of their business. Out of our current FERC tables, 7 of them include non-electric portions of the FERC1 respondents.

Plant

I agree with @zaneselvans that users should get a sense of what is in the table without having to look at columns or values. If the entity the tables describe is a "plant" and not all Form 1 tables describe plants, then I think we should include "plant" in the table name so they can be differentiated.

I agree with this for the plant in service table. But for the three depreciation tables, it seems duplicative because depreciation is necessarily about plant assets.

@cmgosnell
Copy link
Member

cmgosnell commented Oct 31, 2023

Desires (some of which are conflicting)

  • Clarity of content/Legibility
  • Reduce duplication
  • Consistency
  • Link-able back to the original table
  • Not the longest names ever plz

@jdangerx
Copy link
Member

We had a synchronous call:

Decisions:

  • should we add electric? No - while this better reflects "this table is only electric data" vs. "this table has electric + some other stuff," and also more faithfully reproduces the paper schedule name, the extra verbosity across many of our FERC tables is not worth it. We'll link the table to the schedule name via the schedule number.
  • should we add plant? Yes - plant (accounting asset) differentiates from plantS (physical entities)
    • not in depreciation_*, though - the only thing that can depreciate is plant
    • should we explicitly mark the _plantS tables as _generation_plants? No - the "hydro"/"pumped storage" etc. prefixes imply
  • should we add schedule number? Yes - this links the table to the paper forms, which is useful for FERC1-aware users. use schedXXX to communicate that it is indeed a schedule number.

@cmgosnell
Copy link
Member

thanks y'all for the chat yesterday to come to a good decision on these names!

Here are the new name suggestions:

Asset name in dev NEW NAMES
balance_sheet_assets_ferc1 core_ferc1__yearly_balance_sheet_assets_sched110
balance_sheet_liabilities_ferc1 core_ferc1__yearly_balance_sheet_liabilities_sched110
cash_flow_ferc1 core_ferc1__yearly_cash_flows_sched120
depreciation_amortization_summary_ferc1 core_ferc1__yearly_depreciation_summary_sched336
electric_energy_dispositions_ferc1 core_ferc1__yearly_energy_dispositions_sched401
electric_energy_sources_ferc1 core_ferc1__yearly_energy_sources_sched401
electric_operating_expenses_ferc1 core_ferc1__yearly_operating_expenses_sched320
electric_operating_revenues_ferc1 core_ferc1__yearly_operating_revenues_sched300
electric_plant_depreciation_changes_ferc1 core_ferc1__yearly_depreciation_changes_sched219
electric_plant_depreciation_functional_ferc1 core_ferc1__yearly_depreciation_by_function_sched219
electricity_sales_by_rate_schedule_ferc1 core_ferc1__yearly_sales_by_rate_schedules_sched304
fuel_ferc1 core_ferc1__yearly_steam_plants_fuel_sched402
income_statement_ferc1 core_ferc1__yearly_income_statements_sched114
other_regulatory_liabilities_ferc1 core_ferc1__yearly_other_regulatory_liabilities_sched278
plant_in_service_ferc1 core_ferc1__yearly_plant_in_service_sched204
plants_hydro_ferc1 core_ferc1__yearly_hydroelectric_plants_sched406
plants_pumped_storage_ferc1 core_ferc1__yearly_pumped_storage_plants_sched408
plants_small_ferc1 core_ferc1__yearly_small_plants_sched410
plants_steam_ferc1 core_ferc1__yearly_steam_plants_sched402
purchased_power_ferc1 core_ferc1__yearly_purchased_power_and_exchanges_sched326
retained_earnings_ferc1 core_ferc1__yearly_retained_earnings_sched118
transmission_statistics_ferc1 core_ferc1__yearly_transmission_lines_sched422
utility_plant_summary_ferc1 core_ferc1__yearly_utility_plant_summary_sched200

@cmgosnell
Copy link
Member

okay two output table name questions regarding the schedule # suffixes:

  • denorm_fuel_by_plant_ferc1: should this bb be out_ferc1__yearly_steam_plants_fuel_by_plant OR out_ferc1__yearly_steam_plants_fuel_by_plant_sched402?? this output table comes from the og table but we've aggregated the records by plant (the original records are by plant and fuel type). basically should we keep the schedule # when we've changed the structure?
  • denorm_plants_all_ferc1: this should probably be out_ferc1__yearly_all_plants. basically i'd just like to note that bc this is a concatenation of a bunch of schedules i think we should drop the _sched#s

@bendnorman
Copy link
Member Author

These table names look great thank you!

  • I'm worried if we don't retain the schedule numbers in the reformatted output table names, FERC savvy users might work with an asset with a schedule number further up the dag and redo some of the aggregation and denormalizing of a reformated asset.
  • I think it makes sense to drop _sched#s when we've combined tables from multiple schedules.

I say we keep the schedule numbers around until the tables are combined with other tables from different datasets or schedules.

@cmgosnell cmgosnell self-assigned this Nov 1, 2023
@cmgosnell
Copy link
Member

cmgosnell commented Nov 1, 2023

That makes total sense to me @bendnorman and is my inclination as well.

These are the only FERC names that i think had any weirdness in regards to the schedule name and where i think this lands:

Current Asset Name Proposed Asset Name
denorm_fuel_by_plant_ferc1 out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
denorm_plants_all_ferc1 out_ferc1__yearly_all_plants
denorm_plants_utilities_ferc1 _out_ferc1__yearly_plants_utilities

@cmgosnell
Copy link
Member

cmgosnell commented Nov 1, 2023

oop i realized there is still one downstream ferc1 naming question:

Current Asset Name Proposed Asset Name
plant_parts_eia out_eia__yearly_plant_parts
denorm_plants_all_ferc1 out_ferc1__yearly_all_plants
out__yearly_plants_all_ferc1_plant_parts_eia ????

Options:

  • out__yearly_plants_all_plant_parts
  • out_ferc1_eia__yearly_plants_all_plant_parts
  • out__yearly_ferc1_plants_all_eia_plant_parts
  • out_pudl__yearly_ferc1_plants_all_eia_plant_parts

basically how do we deal w/ double datasets?

@bendnorman
Copy link
Member Author

By double datasets, do you mean assets that come from multiple sources? We typically make source optional once there is data from multiple sources. What does the data in out__yearly_plants_all_ferc1_plant_parts_eia describe?

@cmgosnell
Copy link
Member

cmgosnell commented Nov 2, 2023

Yes double datasets as in two or more input sources.

out__yearly_plants_all_ferc1_plant_parts_eia is the record linkage between out_ferc1__yearly_all_plants and out_eia__yearly_plant_parts. the good old pudl_out.ferc1_eia.

The two (?) analogous tables I can think of are core_epa__assn_epacamd_eia core_epa__assn_epacamd_eia_subplant_ids. But this table is an epa table that epa linked to eia and we augment that link.

How about using PUDL as the source name?

  • out_pudl__yearly_assn_ferc1_eia_plant_parts
  • out_pudl__yearly_assn_ferc1_eia_plants

How about using FERC as the source name?

This table effectively is the out_ferc1__yearly_all_plants with the out_eia__yearly_plant_parts graphed onto it, so it is at whatever heterogeneous granulatiry of plant groupings that FERC1 reports.

because of core_pudl__assn_plants_ferc1 and core_pudl__assn_plants_eia it seems like we could:

  • out_ferc1__yearly_assn_eia_plant_parts
  • out_ferc1__yearly_assn_plant_parts_eia

if we end up making an EIA generators table with FERC data scaled down (re #2946) we could name that out_eia__yearly_assn_plants_all_ferc1 or ya know. I think the main downside here is that it's not really an assntable in that it's not a 1:m association. Each FERC1 plant(ish) record is linked to one EIA plant parts record.

@bendnorman
Copy link
Member Author

bendnorman commented Nov 2, 2023

We made some core_pudl__... assets because there are association tables that we manually compiled. I suppose we created this ferc to eia connection, just not manually so I guess using core_pudl is appropriate? Should this apply to future downstream assets that combine multiple datasets, though? Maybe our policy should be if the raw data was manually compiled by PUDL source should be pudl. If it's an asset that combines data from multiple data sources, source should be left blank.

I think it's important to include the names of the entities and sources that an assn table links so I'm in favor of out_pudl__yearly_assn_ferc1_eia_plant_parts and out_pudl__yearly_assn_ferc1_eia_plants.

Is each entity in these tables a plant or a plant part?

@zaneselvans
Copy link
Member

I think in many of our assets there's more than one input source and at that point it's okay for the notion of the "source" to get diluted and left out of the name. But in the case of association / glue tables whose whole point is to link two tables from different data sources together it's probably good to refer to both of those data sources somehow and include assn in the name. So...

I think that whatever this table is named it should definitely include:

  • out__
  • yearly
  • ferc1
  • eia
  • assn
  • plant*

So maybe that's out__yearly_assn_ferc1_eia_plant_parts

If we are going to use pudl as a data source ID, I think we should be clear about what that means, since it's obviously attributing the data to ourselves somehow. I think that's clearly correct when we've manually compiled something, but less so in this case where we're algorithmically deriving something.

@zaneselvans
Copy link
Member

Also for the love of god I wish we could standardize on the ORDERING of multiple data sources. Like if there's more than one always do them in alphabetical order.

@bendnorman
Copy link
Member Author

@cmgosnell and I just chatted about how to name this asset. We settled on out_pudl__yearly_assn_ferc1_eia_plant_parts. We decided to use pudl as a source when we create the connection between two datasets (core_pudl__assn_plants_eia,core_pudl__assn_plants_ferc1, and this asset...) and have no source when we combine datasets using a shared key we did not create.

@cmgosnell can probably better describe why we chose plant_parts over plants.

@cmgosnell
Copy link
Member

cmgosnell commented Nov 3, 2023

okay we're going with: out_pudl__yearly_assn_eia_ferc1_plant_parts!

why:

  • out_ : bc derived
  • pudl: bc this is a derived connection that we have created. rn its lots of manual training data. but even if we move to an unsupervised model, it's still a new connection that we have birthed into the world. (it also currently relies on the manual plant_id_pudl)
  • yearly: bc annual.
  • ferc1: duh
  • eia: duh
  • assn: bc this is a one:one association table.
  • plant_parts: bc ferc doesn't report whole plants. they report heterogeneous clumps of plants (ex: Plant A, Plant B gas bits, Plant B coal bits, Plant C CT's, Plant C ST's, etc etc), which could otherwise be called plant parts 😎. which we link up with corresponding EIA plant parts. This table is at whatever granularity FERC1 reports so it is not a plants table like an EIA plants table where each record is one whole plant.

okay i love the alphabetization. that's a good call. we had ferc1_eia... in lots and lots of places! I preserved the pudl_out.ferc1_eia

eia/epa

this isn't FERC1 related but there are lots of epa_eia as well for the crosswalk:

  • core_epa__assn_epacamd_eia
  • core_epa__assn_epacamd_eia_subplant_ids
  • _core_epa__assn_epacamd_eia_unique

for these, I think that epa being the source makes sense. they do publish the association, we just augment it. but for the assn_epacamd_eia part, should we reorder it to be assn_eia_epacamd*??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants