Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PHMSA transmission and distribution data #2848

Open
17 of 33 tasks
Tracked by #2729
e-belfer opened this issue Sep 11, 2023 · 1 comment · Fixed by #2932 or #3235
Open
17 of 33 tasks
Tracked by #2729

Integrate PHMSA transmission and distribution data #2848

e-belfer opened this issue Sep 11, 2023 · 1 comment · Fixed by #2932 or #3235
Labels
epic Any issue whose primary purpose is to organize other issues into a group. excel Issues involving data in Microsoft Excel spreadsheets new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration

Comments

@e-belfer
Copy link
Member

e-belfer commented Sep 11, 2023

Scope of PR:

Produce PHMSA assets in dagster from the transmission and distribution tables, mirroring the format of the most recent year's form.

Design Notes

Transmission Data (1970-present)

Table design

All form parts are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables.

Parts A-D: core_phmsa__yearly_transmission_summary_by_commodity

  • Priority: high
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER
  • The summary table: one row per commodity, lists summarized miles of pipe, volume, and onshore/offshore miles of pipe by category.
  • Fields B and D are actually summaries from other fields, and could be compared directly to them as an additional step.
  • Part C is instructed to only be completed one time in all forms. This field could be filled in by operator ID and year to all relevant reports it covers.

Parts F-G: core_phmsa__yearly_inspections_and_assessments

  • Priority: high
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should have either information on interstate inspections or information on inspections in a particular state (as determined by the INTER_INTRA column
  • These fields don't exist prior to 2010.

Part H: core_phmsa__yearly_miles_of_transmission_pipe_by_nps

  • Priority: medium
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.
  • Could maybe be combined with Part I.

Part I: core_phmsa__yearly_miles_of_gathering_pipe_by_nps

  • Priority: medium
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.
  • Could maybe be combined with Part H.
  • Additional columns (e.g., PARTIONCADDITIONAL) are strings that will require parsing and reformatting to be of actual use.

Part J: core_phmsa__yearly_miles_of_pipe_by_decade_installed

  • Priority: high
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.
  • Could maybe get scavenged into a harvested table across years, since this information should largely be identical year to year.

Part K: core_phmsa__yearly_miles_of_transmission_pipe_by_specified_minimum_yield_strength

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.

Part L: core_phmsa__yearly_miles_of_pipe_by_class_location

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.
  • Feeds into Part B.

Part M: core_phmsa__yearly_failures_leaks_repairs

  • Priority: High
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.

Part P: core_phmsa__yearly_miles_of_pipe_by_material

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.
  • Feeds into Part D.

Part Q: core_phmsa__yearly_gas_transmission_miles_by_maop_determination_method

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.

Part R: core_phmsa__yearly_gas_transmission_miles_by_pt_range_and_internal_inspection

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.

Part S: core_phmsa__yearly_transmission_materials_verification

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.

Part T: core_phmsa__yearly_transmission_hca_miles_by_determination_method_and_risk_model

  • Priority: ??
  • Primary key: OPERATOR_ID, REPORT_YEAR, REPORT_NUMBER, INTER_INTRA, STATE_NAME
  • Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
  • This table has a bunch of different categoricals that will want to get reorganized.

Misc.

  • Parts O + N are the preparer's signature, email, telephone etc. (N), and the senior executive officers who signed off on the form (O). Not sure if we want to include these at all in our DB, alternately they could get appended to Parts A-E as part of the summary information on the report.
  • Some fields are not in the form but are added to the dataset, they're generally noted in the bottom of the PDF or in a separate file for each folder.

Distribution Data (1970-present)

Table design

  • All form parts are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables. The original files (CSV and Excel) are not split by form part.
  • Each report corresponds to 1 state and one commodity group.

Part A: Operator Information

  • Essentially just the operator ID, name, address, and info on the state and commodity pertinent to the report.

Part B: System Description

  1. Summary stats on miles and services by material.
  2. Miles of main by material and size of pipe.
  3. Miles of services by material and size of pipe.
  4. Miles of main and services by decade of installation.
  • Each part is a table that has a bunch of different categoricals that will want to get reorganized, probably changing the PK of the table. Organization should probably mirror what we do for transmission data since the breakouts are similar in nature.

Part C: core_phmsa__yearly_leaks_and_repairs

  • Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"

Part D-I

  • Each field occurs once per report, these should probably be kept in one table with Part A.
  • Part D, Excavation Damage: 6 fields of summary stats, should probably be collapsed into another table.
  • Part E, EFV and Service Valve Data: 4 fields of summary stats, should probably be collapsed into another table.
  • Part F: Total number of leaks on federal lands scheduled for repair or repaired.
  • Part G: %age of unaccounted gas.
  • Part H: Additional Info: a Notes field, discusses corrections, changes in calculations, and other ambiguities.
  • Part I: Preparer's info: Email, initial or supplemental report, contact info for preparer.

Adapt infrastructure to handle PHMSA partitions

  1. datastore zenodo
    e-belfer
  2. phmsagas
    e-belfer
  3. metadata phmsagas
    e-belfer
  4. metadata phmsagas
    e-belfer

Extraction into raw assets

  1. new-data phmsa
    cmgosnell e-belfer
  2. new-data phmsa
    e-belfer
  3. new-data phmsa
    jdangerx
  4. new-data phmsa
    cmgosnell
  5. new-data phmsa
    e-belfer
  6. new-data phmsa
    cmgosnell
  7. new-data phmsa
    e-belfer
  8. new-data phmsa
    jdangerx
  9. new-data phmsa
    cmgosnell
  10. new-data phmsa
    e-belfer
  11. new-data phmsa
    e-belfer
  12. new-data phmsa
    e-belfer
  13. new-data phmsa
    e-belfer
  14. new-data phmsa

After raw assets are extracted, we will have to define a core set of transformations for the PHMSA data. For each table, this could include: defining all columns with a datatype, transforming columns into categoricals to reduce the width of the table, defining primary keys, standardizing NAs.

Known cleaning steps:

  • Convert 2-digit report_year into 4-digit years (pre 2000).
  • Standardize report_state to use either shorthand or full state name
  • Ideally, adapt existing wide_to_tidy infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).
  • Standardize the multiple treatments of time noted in Clean up and standardize column names #3277 (filing date, data date, revision date)
  • Handle different aggregations of reporting over time for each form section (e.g., 1 form per state, one form per system)
  • Where granularity increases over time (e.g. onshore becomes onshore types A, B, C), aggregate these increasingly disaggregated columns back to have comparable totals over time.
  • Deal with extremely varying telephone formats
  • Standardize use of office vs. HQ addresses over time, and do general address cleaning

First round of cleaning into core assets

@e-belfer e-belfer changed the title Extract PHMSA data Extract PHMSA transmission and distribution data Sep 11, 2023
@e-belfer e-belfer added admin Catalyst operational tasks not related to coding. new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration excel Issues involving data in Microsoft Excel spreadsheets and removed admin Catalyst operational tasks not related to coding. labels Sep 11, 2023
@e-belfer e-belfer self-assigned this Sep 11, 2023
@e-belfer
Copy link
Member Author

e-belfer commented Sep 18, 2023

2017-2021 Excel and CSV comparison:

  • Column names are identical.
  • Excel files have day month year and time of report date, filing date and last data update columns, CSVs only have day month and year.
  • Minor rounding differences between values (e.g. one row out of 1300).
  • Otherwise no major differences found between data sources, good!

@e-belfer e-belfer linked a pull request Jan 12, 2024 that will close this issue
@e-belfer e-belfer changed the title Extract PHMSA transmission and distribution data Integrate PHMSA transmission and distribution data Jan 12, 2024
@e-belfer e-belfer reopened this Jan 19, 2024
@e-belfer e-belfer added the epic Any issue whose primary purpose is to organize other issues into a group. label Jan 19, 2024
@e-belfer e-belfer removed their assignment Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Any issue whose primary purpose is to organize other issues into a group. excel Issues involving data in Microsoft Excel spreadsheets new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration
Projects
Status: Icebox
1 participant