Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

Merged
merged 36 commits into from Jun 8, 2023
Merged

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented May 4, 2023

This adds the capability to extract FERC Form 2 years 1996-2020 from dbf archives.

This is very much not working, but lays groundwork. First issue encountered
is due to the fact that multiple files/resources can be retrievied for a given
year. Refactoring of DbfReader might be necessary to support this more generalized
problem.
1. move PK/FK logic into new ferc.py shared file
2. embed partition information in PudlResourceKey object
@zaneselvans zaneselvans added ferc2 Issues related to the FERC Form 2 dataset dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels May 5, 2023
@zaneselvans zaneselvans added this to the 2023 Spring milestone May 5, 2023
@rousik rousik changed the base branch from main to dev May 9, 2023 22:04
rousik added 2 commits May 9, 2023 16:06
This now handles the processing by looping over valid partitions
rather than iterating over the years. It turns out this may not
be necessary, because the split-archive years have very strange
formats anyways that will need totally different approach.
1. added is_disabled property to dataset settings, which can be set
specifically on ferc1 and ferc2 dbf settings (not globally applied
and/or enforced, but it would be nice)
2. construct ferc extractors in a loop
3. pull the right config out of the global FercToSqliteSettings to
facilitate (2)
2. construct ferc extractors in a loop
3. pull the right config out of the global FercToSqliteSettings to
facilitate (2)
2. construct ferc extractors in a loop
3. pull the right config out of the global FercToSqliteSettings to
facilitate (2)
2. construct ferc extractors in a loop
3. pull the right config out of the global FercToSqliteSettings to
facilitate (2)
@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None:
"""Clone the FERC Form 1 Visual FoxPro databases into SQLite."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be better to use one @op per dataset here rather than clumping all of this into single dbf2sqlite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I totally agree. I think this is an artifact of only have 1 dataset to work with early on. Especially as the amount of XBRL data grows over time being able to separate them out will be very helpful in development, since the XBRL processing is quite slow compared to DBF (though also much more parallelizable currently)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @bendnorman has some thoughts on how / whether we should break the different FERC forms out into individual jobs / assets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the purpose of expediency, it might be good idea to do this separately from this PR. I'm big fan of gradually improving the code but not necessarily solving all the problems at once (which may drag on).

@rousik rousik requested a review from zaneselvans May 23, 2023 20:10
@rousik rousik marked this pull request as ready for review May 23, 2023 20:10
@zaneselvans
Copy link
Member

@rousik is the comment on this PR still accurate? It looks more like it allows data from 1996-2021 to be converted from DBF to SQLite. Does it do the older years too?

@rousik
Copy link
Collaborator Author

rousik commented May 23, 2023

@rousik is the comment on this PR still accurate? It looks more like it allows data from 1996-2021 to be converted from DBF to SQLite. Does it do the older years too?

Updated the comment. Prior to 1996, the split respondent archives is just a collection of DAT files, doesn't seem to be DBF at all.

@zaneselvans
Copy link
Member

@rousik it looks like some of the integration tests need to be updated to work with the new general purpose DBF extractor.

FAILED test/integration/etl_test.py::test_ferc1_schema - AttributeError: 'FercDbfReader' object has no attribute 'get_db_schema'
ERROR test/integration/datasette_metadata_test.py::test_datasette_metadata_to_yml - KeyError: "Multiple resources found for ferc1: {'year': 2020}"

There are a bunch of errors due to the "Multiple resources found" which I think are probably because there's both XBRL and DBF data for Q3 and Q4 of 2020. This will be more and more of an issue over time as revisions to the old DBF data are published in XBRL and we end up with both DBF and small XBRL resources in the Zenodo archives for more and more years. So we'll just need to be good about always specifying the format as well as the year. I don't think this was an issue before because we hard-coded the block of years that was served by each dataset (which won't work going forward given the older data revisions).

1. split off ferc_dbf tests into separate suite
2. adapt for the recent code changes
3. split off the fixtures so that dbf/xbrl could be tested independently
   w/o having to run the other data extraction
@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Patch coverage: 88.4% and no project coverage change.

Comparison is base (a1a7ef7) 86.9% compared to head (30c9aec) 86.9%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2564   +/-   ##
=====================================
  Coverage   86.9%   86.9%           
=====================================
  Files         84      86    +2     
  Lines       9720    9809   +89     
=====================================
+ Hits        8447    8525   +78     
- Misses      1273    1284   +11     
Impacted Files Coverage Δ
src/pudl/metadata/sources.py 100.0% <ø> (ø)
src/pudl/extract/xbrl.py 94.2% <33.3%> (-2.8%) ⬇️
src/pudl/ferc_to_sqlite/cli.py 70.7% <71.4%> (-0.7%) ⬇️
src/pudl/workspace/datastore.py 73.0% <71.4%> (-0.1%) ⬇️
src/pudl/extract/dbf.py 88.8% <87.8%> (-0.2%) ⬇️
src/pudl/extract/ferc.py 91.6% <91.6%> (ø)
src/pudl/extract/ferc1.py 99.1% <100.0%> (+<0.1%) ⬆️
src/pudl/extract/ferc2.py 100.0% <100.0%> (ø)
src/pudl/settings.py 98.7% <100.0%> (+<0.1%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how concise the per-form extractors ultimately are! Just a little metadata compilation and some optional fixes.

I left a bunch of comments, but it seems like maybe one higher level design question we both have is whether it would be better to split out the individual FERC forms as separately addressible jobs / ops / processes rather than lumping all the DBF and all the XBRL together.

src/pudl/extract/dbf.py Outdated Show resolved Hide resolved
src/pudl/extract/dbf.py Show resolved Hide resolved
src/pudl/extract/dbf.py Outdated Show resolved Hide resolved
src/pudl/extract/dbf.py Show resolved Hide resolved
src/pudl/extract/dbf.py Show resolved Hide resolved
test/integration/ferc_dbf_extract_test.py Outdated Show resolved Hide resolved
src/pudl/ferc_to_sqlite/cli.py Outdated Show resolved Hide resolved

Returns:
The job definition to be executed.
"""
if not (enable_xbrl or enable_dbf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we're splitting up or specifying which data to translate into SQLite feels messy to me in general. Not sure if there's a better way to do it though.

Right now in the ETL settings we ask the user for datasets and years, and then figure out whether that means translating from XBRL or DBF internally, which seems correct -- ideally the user wouldn't need to know which source the data is coming from. But with 5 FERC forms (1, 2, 6, 60, 714) involved and 2 data formats for each of them (though the older 714 data isn't distributed as DBF for some reason) it feels like having some clear enumeration of the chunks of work that are going to get done internally would be more readable than having this high level job switch that says do DBF or do XBRL, and then another lower layer that's only visible in the settings that says which datasets to translate, and then yet another layer within the settings beneath that which indicates which partitions (years) to translate.

I guess what I'm imagining is that instead of just pulling the DBF vs. XBRL processing up to this high level, it might also make sense to pull both the dataset and data format up to this level and treat each dataset + data format as a separate job, since they'll each result in a separate SQLite output, and in a development context we'll often just want to re-run one of them in isolation. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I agree that having one job per dataset/format combination would be the most natural structure. Let me refactor this accordingly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some early tinkering with this, I feel like I will need to dig deeper into dagster and how it works and how to properly structure this. I think that it should be possible to auto generate some of these dagster ops/graphs programmatically which should reduce the boilerplate, but it might take a little extra effort and care to do so.

My suggestion would be to proceed with this PR as-is and follow up with cleanup/refactoring independently of this. What do you think about that? I can file an issue to track this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think that's beyond the scope of this PR for right now, but it'd be good to make it more dynamic / granular at some point. And we don't extract the FERC DBs very frequently, so this won't come up too often.

src/pudl/extract/ferc2.py Show resolved Hide resolved
@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None:
"""Clone the FERC Form 1 Visual FoxPro databases into SQLite."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I totally agree. I think this is an artifact of only have 1 dataset to work with early on. Especially as the amount of XBRL data grows over time being able to separate them out will be very helpful in development, since the XBRL processing is quite slow compared to DBF (though also much more parallelizable currently)

@rousik
Copy link
Collaborator Author

rousik commented Jun 5, 2023

When attempting to diff the output I found out that form 2 requires use of --clobber because the current production dataset does not contain part=N partition annotation which causes non-uniqueness violations when opening the archive. So, perhaps we should bump the production versions before merging this.

@zaneselvans
Copy link
Member

Ah, you mean the production raw data archive on Zenodo? Let me try and update it.

@zaneselvans
Copy link
Member

New FERC 2 archive up here: https://zenodo.org/record/8006881

@rousik
Copy link
Collaborator Author

rousik commented Jun 7, 2023

I have ran against non-sandboxed archives with no issues and no diffs on the existing datasets. This should be good to go.

Note that I have added disabled configuration field to bunch of ferc_to_sqlite configuration so that we can (during local development) individually disable/enable these. I don't expect this should be used in production scenarios, but for local point-wise testing it's quite helpful.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

@zaneselvans
Copy link
Member

Oh, probably want to add the new FERC 2 SQLite DB to the list of DBs which are published to Datasette in devtools/datasette/publish.sh

@zaneselvans zaneselvans self-requested a review June 7, 2023 22:26
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the ETL locally including the new FERC 2 DBF to SQLite conversion and I inspected 10 tables and found that none of them contained data from before 2004. Looking at the input file sizes, it seems like the archives become sizeable (~100MB) starting in 2002 and are only a few MB each from 1996-2001, but that doesn't seem to line up with the lack of data in 2002-2003 so I'm wondering if something is wrong with the extraction of earlier tables? Is there a reason why we might expect to see this behavior?

ferc2_db_path = "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc2.sqlite"
ferc2_engine = sa.create_engine(ferc2_db_path)
md = sa.MetaData()
md.reflect(bind=ferc2_engine)
ferc2_dfs = {}
for t in md.tables:
    ferc2_dfs[t] = pd.read_sql(t, ferc2_engine)
{t: ferc2_dfs[t].report_yr.min() for t in ferc2_dfs if "report_yr" in ferc2_dfs[t].columns}

@zaneselvans zaneselvans self-requested a review June 8, 2023 14:37
@rousik rousik merged commit 8df08c4 into dev Jun 8, 2023
10 checks passed
@zaneselvans zaneselvans deleted the rousik-ferc2 branch September 12, 2023 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc2 Issues related to the FERC Form 2 dataset
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1
2 participants