-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate EIA 176, 191 and 757A into PUDL #2603
Comments
I'm in a place to pick this up with consistent attention. For starters, the "bulk" data isn't a more usable form for ingesting EIA-176 data and related data. That leaves what I've come to think of as the "bundled" data as the source for this. (At some point I'd like more insight on how developers have discovered these data endpoints.) For next steps, I plan to put together a basic outline of extraction. That will inform the data model, potentially any additional requirements, and lead into the other tasks here. It looks like I can follow preexisting patterns of form-specific extraction logic in Elaboration on bulk dataThe "bulk" data doesn't contain the company-level information from form EIA-176. Searching for a handful of attributes corresponding to the bundled EIA-176 company data ( On the whole the bulk data is pretty disorganized. It comprises many different types of series and no single column exhibits low-cardinality values that would easily separate them. Ideally I'd expect one or more column(s) clearly indicating withdrawals vs receipts, etc. To get those clean attributes/dimensions, one could potentially parse them from the "description" field or the "series_id" field once there's a clear mapping of the components of "series_id". (The "series_id" components clearly correspond to some semantic codes.) There's also the "name" field but that appears to largely be a noisier version of "description" also including the unit of time for the series, e.g., monthly. A couple thousand entries do not have a "description" value, which might actually be a data structuring issue. Here are a some examples of the "series_id" and "description" values:
I took a quick and dirty pass at "description" keywords one might be able to use to decompose the 17,000+ series into clearer groupings. However, these conditions aren't mutually exclusive, so to get clean sets we'd need different groupings or extra logic.
|
Sweet, thanks for digging into the bulk data @davidmudrauskas - hope finding all these quirks was fun in some way :) As for next steps, following in the footsteps of EIA 860 seems like a good start! Some more confusing bits that you might have already figured out:
|
Thanks! Yeah, I'd tracked down the DOI 😁 Looks like you all don't have generic CSV extracting yet so I've drafted a basic class for that, and I think I've found the entry points for the other major operations. Should have something to look at soon. |
I have a decent idea going in a branch I'll push soon once I get pre-commit hooks resolved, then maybe I can get some feedback. Let me know and I can adjust the pace too. |
If we do get a generic CSV extractor set up, all of the FERC-714 data from 2020 and earlier is stored as CSVs, and it could be applied there too. |
Responding to this question from #3264 here since I think it's more general to our integration of the EIA's gas data:
I wouldn't interpret the "EIA-176" label narrowly. There's nothing particularly special about that form, and as noted in #2603 initially we thought that all of the data in this bulk download zipfile was EIA-176, but it turns out there were these other associated forms related to gas production, storage, etc. IIRC in my initial digging around, it seemed like some of it looked usable, and some of it was both very messy and only had small number of years of data included. Almost like they accidentally dumped the data in this CSV once and then never looked at it again (which means the data is somewhere else, but... where?). Rather than focusing on the forms in particular, I'd try and identify the subset of data that we've got archived and ready to process which are actually worth cleaning up and turning into tables -- like there's a significant amount of data and it's a tractable problem without a huge investment of manual effort. Or at least prioritize them in terms of "person-hours per unit of data integrated"
My guess is that like the EIA-860 and EIA-923 spreadsheets, the data in this zipfile is (part of) what gets fed into the bulk data and more polished monthly/annual gas reports they publish, but that there's other data coming from other forms too, and the bulk API data and glossy reports probably do not reflect the full detail of the data that's in the original submitted form responses (whatever form they might take). I think we're probably looking for the long historical record of inputs that go into making their natural gas data products. Some of which looks like it goes back (from somewhere) as far as the 1970s. The EIA Natural Gas Annual Report refers to forms EIA-176, EIA-895, and EIA-910 in Appendix A: "summary of data collection operations and report methodology" though weirdly EIA-895 doesn't show up on their big list of forms. I assume that the the EIA-757 (Natural Gas Processing Plant Survey) will probably relate to some of the same pipeline infrastructure that's reported in PHMSA. Compressor stations, NGL precipitators, H2S removal facilities. I assume linking the facilities and their owners/operators between the EIA and PHMSA data will be another entity matching circus that we probably don't want to get into now, but since we're also working on the PHMSA pipelines extractions now, being able to care that data and EIA-757 at the same time might help us understand how they relate and map out a plan for future integrations. Similarly I'd guess that the EIA-191 (monthly underground natural gas storage report) and EIA-191L (monthly LNG storage report) will have some relationship to the gas storage facility data that comes out of PHMSA, so those would be good to look at in tandem too, but I don't think we've gotten to the gas storage facility data in PHMSA yet, so maybe we don't prioritize EIA-191 yet if we have to choose. |
One alternative method to access 176/191/757 data that may or may not work: Excel: It appears from inspecting this site for 176 downloads that if you CSV: Similarly but more simply, Observations about this GET endpoint behavior:
In short, I think there are hidden endpoints here that don't involve hand mapping LINE columns and would produce hopefully more usable CSVs for raw data integration, and I'd like to explore these a bit more before we commit to the bundled data. The CSV extractor built to handle the bundled data should still be usable here. @davidmudrauskas I know you had mentioned maybe starting to map LINE columns as a next step, so I'd suggest just pausing on that if you've started already since there may be a possibility this isn't necessary. |
EIA-895 does show up on the ancient form page: https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s895 |
Notes on reviewing the freshly extracted data from #3402: EIA-176
n_co = eia176.groupby("id").company.transform("nunique")
eia176.loc[n_co == 2, ["company", "id"]].drop_duplicates().sort_values("id")
EIA-191
EIA-757A
EntitiesSeveral kinds of entities are being referenced in these tables, and could potentially be pulled out into their own separate tables and linked to the data via FKs:
|
This is a short code for the line numbers on the form, with some lines referring to a combination of two rows (e.g.
We added In terms of the lines and definitions, we'll want to refer to the NGQS guide - see Appendix A for definitions of all the |
Some observations on EIA 176 IDs:
|
This is out of date given current data availability, so I'm moving it out of the PR description and archiving it below: Proposed table designcore_eia176__yearly_companiesPriority: High This table will include data on the company filling out the form, including their address, company characteristics and distribution territory. This will include Parts 1, 3, and 7 from EIA Form 176. The primary key should be EIA ID number, year, and state. Some of these characteristics could get harvested across states or years (e.g., company address) and characteristics, but this seems somewhat low-value add for now. core_eia176__yearly_sources_dispositionsPriority: High This table will include data on a company's natural gas sources and dispositions, with each row representing one state's data. This will include Part 4 and Part 6 of Form EIA 176. This will give us information on international and cross-state gas transfers, numbers of residential, commercial, industrial etc end-use consumers receiving natural gas, and volumes and revenues associated with these dispositions. The primary key will be EIA ID number, year and state. Ideally if a footnote in Part 7B is included, it should be attached in a column to the relevant data it refers to. core_eia__yearly_liquefied_natural_gas_storagePriority: medium This table will include EIA 191L data (see above) and Part 5 of the EIA 176 data. The primary key should be EIA ID number, year, and facility. The data included from EIA 191L should be from the end of December, and the volume and capacity should be directly comparable in the table so they can be validated against one another. 191L data does not seem to be included in all_data_191.csv so it would need to be obtained through a separate source. As a first step, we could just include EIA 176 data for now. Ideally if a footnote in 176 Part 7B is included, it should be attached in a column to the relevant data it refers to. core_eia191__monthly_underground_storagePriority: medium This table will include EIA 191 data. The primary key should be EIA ID number, year-month, and facility (field_name), with each row representing one facility's report in a given month. The EIA 191 form is released monthly. all_data_191.csv looks like it has actual column names that correspond to the form, so we won't have to deal with the frustrating LINE renames here. core_eia191__monthly_liquefied_natural_gas_storagePriority: medium This table will include EIA 191L data, which isn't part of the bulk electricity zipfile as far as I can tell. The primary key should be EIA ID number, year-month, and facility, with each row representing one facility's report in a given month. core_eia757__{freq}_processing_plantsPriority: medium This table will include data from EIA Form 757 Schedule A. Schedule A, the Baseline Report is filled out no more often than every 3 years, and includes data on capacity, status, operations, and connecting infrastructure of core_eia757__{freq}_processing_disruptionsPriority: medium This table will include data from EIA Form 757 Schedule B. Schedule B monitors post-emergency natural gas processing plans operational statuses. all_data_757.csv does not contain this data so we'd need to track down a downloadable form of the data if we wanted to include it, which would probably involve using the API data. Primary key should be EIA ID, date of filing, and some combination of plant identifiers as required, with each row representing one plant's report. |
What tables are we sure we'll need? Data Tables
Entity TablesFor most of our other datasets, we've called respondents "utilities" but it's not clear that applies here. What entity name do we want to use? If there isn't much information associated with a respondent, and it only appears in a single table, it's probably not worth breaking an entity table out. E.g. in the case of the
|
I've been looking at the EIA-191 data and the additions to it mentioned here and feel I can take on that work. Looks like the PK is defined already, and the columns that may require an ENUM or FK only have a few unique values there. One open question here for me is which, if any, fields need to be normalized and how (saw some mention of this in a previous comment). |
I haven't looked at it in a bit, but it seemed like there was probably one one real data table to be made in the EIA-191, and that as of yet there wasn't much benefit to stripping out entity (respondent) fields to make a separate more normalized table. So I'd probably try making a single core table for now. |
Description
Integrate the EIA-176 natural gas sources & dispositions data into PUDL.
What we've been calling the EIA-176 is actually 3 different related forms that are bundled together for bulk distribution:
Each of these forms uses the same IDs to refer to the reporting companies, and that shared company ID information is provided as a separate standalone table with the company ID, name, and activity status. Note that there is no date information associated with the company information, so the activity status probably just pertains to when the reporting was done, and there are no historical archives, so this field is pretty useless. Presumably we'll be able to guess which companies are active based on whether they're reporting data in the other tables?
There's also a lot more natural gas data available from EIA that we might be able to download in bulk from their API or other hidden endpoints.
Motivation
This gas source, disposition, storage, and processing plant data should help us target existing natural gas utilities and the capital locked up in existing infrastructure for early retirement, and may help advocates prevent new investments in natural gas facilities that would need to be decommissioned well before end of life to maintain a stable climate.
In Scope
We'll know we're done when:
Known data issues
The EIA-176 and EIA-176 "other" tables have some cryptic codes and "LINE" numbers in them. Based on Document codes and data structures associated with the EIA 176 #562 it seems that LINE is similar to FERC'sThis is out of date, see below for discussion ofrow_id
in that the number corresponds to a particular variable reported on the form, and the mapping of number to variable evolves over time, so we'll need to do some kind of pivot of the data, and an alignment of those LINE number meanings across years. The complete list of LINE numbers can be found on p. 32 here, but has also certainly changed over time.line
anditemsort
in newly extracted data.Initial data selection and archiving
Extraction
CsvExtractor
to handle new 176/191/757A data format #3339EIA-176 Data Transformation
item
consistent over time, even when line numbers change? If not, we need to map changesline
oritemsort
over time. #3501EIA-191 Data Transformation
Cleaning and integration - EIA 757A
Integrate data from across EIA datasets
The text was updated successfully, but these errors were encountered: