Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932

e-belfer · 2023-10-11T12:48:27Z

In this PR, adapt the generic Excel extractor to work on PHMSA datasets. The goal of this PR is to extract raw PHMSA distribution and selected transmission (tables A-D, H, I) data and create corresponding raw assets in PUDL.

This includes:

specifying column maps, footers, skiprows and file names for all years of PHMSA distribution and selected transmission data (tables A-D, H, I)
adapting the generic Excel extractor to work with multi-year zip files with a list of years partitions
adding PHMSA settings to the PUDL datapackage
fixing the missing keywords issue in pudl_datastore (in concert with PR#167 in the pudl-archiver repo)
drops columns that are mapped to other tables during extraction, to prevent split up years of data from getting appended to different datasets multiple times (using process_final_page)
creating a notebook to help diagnose column mapping errors before extraction
creating documentation for PHMSA gas using a new jinja template
adding all .txt and .pdf files that contain PHMSA form instructions to docs.data_sources.phmsagas

Still to do:

Rename Zenodo partition as form not dataset to avoid variable clashes and add listed years partitions in PHMSA archiver: add multi-year years partition, add new dataset pudl-archiver#200
Adapt extraction to handle multi-file zips with two partitions
Add PDFs to data_sources
Add PHMSA to release notes
Add PHMSA to docs
Add PHMSA to README
Add text files to the readthedocs

Out of scope:

Sometimes data spills over into unnamed columns. This happens between 2004-2009 for distribution data, when the comments section overflows into the last four columns and beyond. The data lost is the preparer's email, fax, and name, along with portions of the comment section. As this is a relatively minor issue and affects preparer metadata only, I think we can ignore this issue for now. These columns are dropped during extraction due to the changes made to process_final_page, and are logged as a warning.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

it seems to all just work which is tres fun but makes sense after looking at it

update the 860m doi

…ions of mixed type

…y docstrings

…tive/pudl into phmsa-extractor

e-belfer · 2024-01-17T20:34:53Z

Doc updates can be viewed here: https://catalystcoop-pudl--2932.org.readthedocs.build/en/2932/data_sources/phmsagas.html
@aesharpe

docs/templates/phmsagas_child.rst.jinja

aesharpe · 2024-01-18T07:01:28Z

src/pudl/extract/phmsagas.py

+        data. This means one page may get split into multiple tables by column. To
+        prevent each table from containing all columns from these older years, filter by
+        the list of columns specified for the page, with a warning.


This last sentence is still a bit confusing, but I think I'm starting to understand better. Is the idea that the content from one page (tab) gets split into multiple tables because the content in later years is spread across several pages (tabs). And we don't want to retain duplicate columns across tables, one of which will contain data and one of which will be empty?

I tried to clarify again.

aesharpe

Looks good! Just a few minor, non-blocking comments. Still feel like it would be better as just phmsa instead of phmsagas but it might be more work than it's worth right now.

…n in jinja

…tive/pudl into phmsa-extractor

jdangerx

All looks pretty reasonable! Based on my potentially flawed understanding, I think there's some refactoring opps but nothing critical...

Here's my understanding of things, let me know what I got wrong :)

When we want to get the data for a specific (Page, year), we need to:

download the resource based on the partition(s) - we're using the raw_df_factory, which assumes yearly partitions; since we need other partitions to ID the resource to download, we use page_part_map.csv
unzip the resource and look in the filename determined in file_map.csv
within that filename, look on the sheet in page_map.csv
within that sheet, rename columns based on column_maps/page_name.csv and the year.

And there's actually two sets of partitions

ZIP resource partitions - we need to know this when we're downloading the zipfile. This includes year + form.
asset partitions - no single table comes from both distribution AND transmission, so the only meaningful partition here is by year.

Then the excel extractor basically:

takes a dataset name
looks for all the years in that dataset's settings
makes an extractor filtering for each of those years
combines those into one dataframe at the end

In the extractor, we assume that the resource partitions are always the same as the asset partitions plus the extra bits defined in page_part_map.csv, which is why we need to add those bits in the self.zipfile_resource_partitions() method.

If all that is correct, it might be nice to explicitly call out the "input partitions/zipfile resource partitions" and "output partitions/just straight-up years" as two separate important concepts in the GenericExtractor docs. But if I'm the only one who finds that framing enlightening I'm happy to keep it filed away in my own brain 😅

src/pudl/workspace/datastore.py

jdangerx · 2024-01-18T17:39:29Z

src/pudl/extract/excel.py

+        """Final processing stage applied to a page DataFrame."""
+        return df
+
+    def zipfile_resource_partitions(self, page, **partition) -> dict:


This returns the "input data" partition that tells us where to find the source data for a certain output partition, right?

And if we want to do weird custom logic here in a dataset-specific subclass of GenericExtractor we could just override this method?

If so, it might be clearer if we rename the method + update the docstring to establish: "if you are looking to do some fancy mapping from input partitions to output partitions, for example if you output quarterly data but read in yearly files, override this method. the default is to take the output partitions + add any additional partitions defined in the page_part_map.csv."

cmgosnell · 2024-01-19T14:45:29Z

src/pudl/extract/phmsagas.py

+    def process_final_page(self, df, page):
+        """Drop columns that get mapped to other assets.
+
+        Older years of PHMSA data have one Excel tab in the raw data, while newer data
+        has multiple tabs. To extract data into tables that follow the newer data format
+        without duplicating the older data, we need to split older pages into multiple
+        tables by column. To prevent each table from containing all columns from these
+        older years, filter by the list of columns specified for the page, with a
+        warning.
+        """
+        to_drop = [
+            c
+            for c in df.columns
+            if c not in self._metadata.get_all_columns(page)
+            and c not in self.cols_added
+        ]
+        if to_drop:
+            logger.warning(
+                f"Dropping columns {to_drop} that are not mapped to this asset."
+            )
+            df = df.drop(columns=to_drop, errors="ignore")
+        return df


This doesn't seem like a bad idea overall but it will result in the extractor needing to read the old excel files for the same chunk of data to extract each of the new parts right?

We could just read the old data as one big squished together table and sort it out during the transform step. Although I do like the idea of not needing to re-org quite as much during the transform step.

The parts of the form still exist, it's just that they're all included in one giant Excel sheet instead of being split into Excel sheets in the earlier years. My preference is to form the raw table that we want with the columns that belong to it, with that design being informed by the most recent years of data, and to use the transform step to do cleaning. I'm subclassing this method since I think this is a more strict check than we want for the other data sources, but I think that it is relatively consistent with how we've handled column movements across tabs in other datasets over time.

e-belfer added 3 commits September 11, 2023 12:11

Start adding PHMSA extraction functionality

e7db559

WIP Add CSV maps for distribution data

2534c40

WIP add raw assets, need to fix partition

57e6a9f

e-belfer linked an issue Oct 11, 2023 that may be closed by this pull request

Integrate PHMSA transmission and distribution data #2848

Open

e-belfer and others added 2 commits October 11, 2023 15:07

fix merge conflicts

f7f040d

Merge branch 'dev' into phmsa-extractor

c0a4f1c

e-belfer self-assigned this Oct 16, 2023

e-belfer and others added 2 commits October 26, 2023 14:41

Merge branch 'dev' into phmsa-extractor

33220e9

WIP migrate to start_year partition

b06585e

e-belfer assigned cmgosnell Nov 20, 2023

e-belfer and others added 16 commits December 18, 2023 12:38

Resolve merge conflicts and fix phmsa partitions and settings

d9916ab

WIP: add list partitions to _matches

4fb33e2

Fix csv file name

a4cbf89

revert fast etl settings

aaecdfa

update the 860m doi

ee9b3bf

it seems to all just work which is tres fun but makes sense after looking at it

Fix docs build

61bc757

Merge pull request #3189 from catalyst-cooperative/eia860m-extraction

542ce85

update the 860m doi

Merge branch 'dev' into cems-extraction

cf0454d

Update to non-broken CEMS archive

d26b4aa

Try adding datastore to CI

0b08162

Update docker to point at actually right year

e1215fe

Actually fix in GH action

4f50183

Move pudl_datastore call

65af95d

Fix typo

2cbd7dd

Fix partition option

5587b2e

Merge branch 'dev' into phmsa-extractor

8b7bbc9

e-belfer changed the base branch from dev to cems-extraction January 2, 2024 18:02

e-belfer and others added 3 commits January 2, 2024 10:02

Merge branch 'dev' into cems-extraction

36752c3

Merge branch 'cems-extraction' into phmsa-extractor

1994527

Add temp form extraction method, fix matching to take multiple partit…

cce25e2

…ions of mixed type

Update docs and release notes, add jinja template and sources, clarif…

f8ab0e0

…y docstrings

e-belfer marked this pull request as draft January 17, 2024 18:36

e-belfer and others added 5 commits January 17, 2024 13:49

Resolve merge conflict with main

90220be

Merge branch 'main' into phmsa-extractor

bfe5dd2

Fix docs build issue

6871b92

Merge branch 'phmsa-extractor' of https://github.com/catalyst-coopera…

7b0aa39

…tive/pudl into phmsa-extractor

Merge branch 'main' into phmsa-extractor

f5b4fbb

aesharpe reviewed Jan 18, 2024

View reviewed changes

docs/templates/phmsagas_child.rst.jinja Outdated Show resolved Hide resolved

aesharpe reviewed Jan 18, 2024

View reviewed changes

aesharpe approved these changes Jan 18, 2024

View reviewed changes

Minor docstring and jinja cleanup, add .txt documentation files to jinja

5b37e94

e-belfer requested a review from jdangerx January 18, 2024 14:24

e-belfer and others added 5 commits January 18, 2024 09:33

Fix docs build

aee4548

Merge branch 'main' into phmsa-extractor

1ac438e

Rename pdf and txt to sort better, split distribution and transmissio…

e98c456

…n in jinja

Merge branch 'phmsa-extractor' of https://github.com/catalyst-coopera…

37c4dd0

…tive/pudl into phmsa-extractor

A bit more cleaning of file names

c5d22e6

e-belfer marked this pull request as ready for review January 18, 2024 15:35

jdangerx approved these changes Jan 18, 2024

View reviewed changes

Update docstring for part-partition method

20f05b1

e-belfer enabled auto-merge January 18, 2024 19:07

e-belfer disabled auto-merge January 18, 2024 20:53

Merge branch 'main' into phmsa-extractor

ffaa8da

cmgosnell reviewed Jan 19, 2024

View reviewed changes

e-belfer added 2 commits January 19, 2024 10:35

Move units to end of column names and clean up 192.710

30d237b

Clean up CSVs

27dab3e

e-belfer enabled auto-merge January 19, 2024 15:40

e-belfer merged commit 97e49ad into main Jan 19, 2024
15 checks passed

e-belfer deleted the phmsa-extractor branch January 19, 2024 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932

Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932

e-belfer commented Oct 11, 2023 •

edited

e-belfer commented Jan 17, 2024

aesharpe Jan 18, 2024

e-belfer Jan 18, 2024

aesharpe left a comment •

edited

jdangerx left a comment

jdangerx Jan 18, 2024

cmgosnell Jan 19, 2024

e-belfer Jan 19, 2024

Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932

Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932

Conversation

e-belfer commented Oct 11, 2023 • edited

PR Checklist

e-belfer commented Jan 17, 2024

aesharpe Jan 18, 2024

Choose a reason for hiding this comment

e-belfer Jan 18, 2024

Choose a reason for hiding this comment

aesharpe left a comment • edited

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

jdangerx Jan 18, 2024

Choose a reason for hiding this comment

cmgosnell Jan 19, 2024

Choose a reason for hiding this comment

e-belfer Jan 19, 2024

Choose a reason for hiding this comment

e-belfer commented Oct 11, 2023 •

edited

aesharpe left a comment •

edited