Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932

Merged
merged 57 commits into from
Jan 19, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Oct 11, 2023

In this PR, adapt the generic Excel extractor to work on PHMSA datasets. The goal of this PR is to extract raw PHMSA distribution and selected transmission (tables A-D, H, I) data and create corresponding raw assets in PUDL.

This includes:

  • specifying column maps, footers, skiprows and file names for all years of PHMSA distribution and selected transmission data (tables A-D, H, I)
  • adapting the generic Excel extractor to work with multi-year zip files with a list of years partitions
  • adding PHMSA settings to the PUDL datapackage
  • fixing the missing keywords issue in pudl_datastore (in concert with PR#167 in the pudl-archiver repo)
  • drops columns that are mapped to other tables during extraction, to prevent split up years of data from getting appended to different datasets multiple times (using process_final_page)
  • creating a notebook to help diagnose column mapping errors before extraction
  • creating documentation for PHMSA gas using a new jinja template
  • adding all .txt and .pdf files that contain PHMSA form instructions to docs.data_sources.phmsagas

Still to do:

Out of scope:

  • Sometimes data spills over into unnamed columns. This happens between 2004-2009 for distribution data, when the comments section overflows into the last four columns and beyond. The data lost is the preparer's email, fax, and name, along with portions of the comment section. As this is a relatively minor issue and affects preparer metadata only, I think we can ignore this issue for now. These columns are dropped during extraction due to the changes made to process_final_page, and are logged as a warning.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer linked an issue Oct 11, 2023 that may be closed by this pull request
@e-belfer e-belfer self-assigned this Oct 16, 2023
@e-belfer e-belfer changed the base branch from dev to cems-extraction January 2, 2024 18:02
@e-belfer e-belfer marked this pull request as draft January 17, 2024 18:36
@e-belfer
Copy link
Member Author

Comment on lines 32 to 34
data. This means one page may get split into multiple tables by column. To
prevent each table from containing all columns from these older years, filter by
the list of columns specified for the page, with a warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last sentence is still a bit confusing, but I think I'm starting to understand better. Is the idea that the content from one page (tab) gets split into multiple tables because the content in later years is spread across several pages (tabs). And we don't want to retain duplicate columns across tables, one of which will contain data and one of which will be empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify again.

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few minor, non-blocking comments. Still feel like it would be better as just phmsa instead of phmsagas but it might be more work than it's worth right now.

@e-belfer e-belfer marked this pull request as ready for review January 18, 2024 15:35
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks pretty reasonable! Based on my potentially flawed understanding, I think there's some refactoring opps but nothing critical...

Here's my understanding of things, let me know what I got wrong :)

When we want to get the data for a specific (Page, year), we need to:

  • download the resource based on the partition(s) - we're using the raw_df_factory, which assumes yearly partitions; since we need other partitions to ID the resource to download, we use page_part_map.csv
  • unzip the resource and look in the filename determined in file_map.csv
  • within that filename, look on the sheet in page_map.csv
  • within that sheet, rename columns based on column_maps/page_name.csv and the year.

And there's actually two sets of partitions

  • ZIP resource partitions - we need to know this when we're downloading the zipfile. This includes year + form.
  • asset partitions - no single table comes from both distribution AND transmission, so the only meaningful partition here is by year.

Then the excel extractor basically:

  • takes a dataset name
  • looks for all the years in that dataset's settings
  • makes an extractor filtering for each of those years
  • combines those into one dataframe at the end

In the extractor, we assume that the resource partitions are always the same as the asset partitions plus the extra bits defined in page_part_map.csv, which is why we need to add those bits in the self.zipfile_resource_partitions() method.

If all that is correct, it might be nice to explicitly call out the "input partitions/zipfile resource partitions" and "output partitions/just straight-up years" as two separate important concepts in the GenericExtractor docs. But if I'm the only one who finds that framing enlightening I'm happy to keep it filed away in my own brain 😅

src/pudl/workspace/datastore.py Show resolved Hide resolved
"""Final processing stage applied to a page DataFrame."""
return df

def zipfile_resource_partitions(self, page, **partition) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the "input data" partition that tells us where to find the source data for a certain output partition, right?

And if we want to do weird custom logic here in a dataset-specific subclass of GenericExtractor we could just override this method?

If so, it might be clearer if we rename the method + update the docstring to establish: "if you are looking to do some fancy mapping from input partitions to output partitions, for example if you output quarterly data but read in yearly files, override this method. the default is to take the output partitions + add any additional partitions defined in the page_part_map.csv."

Comment on lines +28 to +49
def process_final_page(self, df, page):
"""Drop columns that get mapped to other assets.

Older years of PHMSA data have one Excel tab in the raw data, while newer data
has multiple tabs. To extract data into tables that follow the newer data format
without duplicating the older data, we need to split older pages into multiple
tables by column. To prevent each table from containing all columns from these
older years, filter by the list of columns specified for the page, with a
warning.
"""
to_drop = [
c
for c in df.columns
if c not in self._metadata.get_all_columns(page)
and c not in self.cols_added
]
if to_drop:
logger.warning(
f"Dropping columns {to_drop} that are not mapped to this asset."
)
df = df.drop(columns=to_drop, errors="ignore")
return df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a bad idea overall but it will result in the extractor needing to read the old excel files for the same chunk of data to extract each of the new parts right?

We could just read the old data as one big squished together table and sort it out during the transform step. Although I do like the idea of not needing to re-org quite as much during the transform step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parts of the form still exist, it's just that they're all included in one giant Excel sheet instead of being split into Excel sheets in the earlier years. My preference is to form the raw table that we want with the columns that belong to it, with that design being informed by the most recent years of data, and to use the transform step to do cleaning. I'm subclassing this method since I think this is a more strict check than we want for the other data sources, but I think that it is relatively consistent with how we've handled column movements across tabs in other datasets over time.

@e-belfer e-belfer merged commit 97e49ad into main Jan 19, 2024
15 checks passed
@e-belfer e-belfer deleted the phmsa-extractor branch January 19, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Integrate PHMSA transmission and distribution data
5 participants