New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564
Conversation
This is very much not working, but lays groundwork. First issue encountered is due to the fact that multiple files/resources can be retrievied for a given year. Refactoring of DbfReader might be necessary to support this more generalized problem.
1. move PK/FK logic into new ferc.py shared file 2. embed partition information in PudlResourceKey object
This now handles the processing by looping over valid partitions rather than iterating over the years. It turns out this may not be necessary, because the split-archive years have very strange formats anyways that will need totally different approach.
1. added is_disabled property to dataset settings, which can be set specifically on ferc1 and ferc2 dbf settings (not globally applied and/or enforced, but it would be nice) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2)
For more information, see https://pre-commit.ci
@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None: | |||
"""Clone the FERC Form 1 Visual FoxPro databases into SQLite.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be better to use one @op
per dataset here rather than clumping all of this into single dbf2sqlite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I totally agree. I think this is an artifact of only have 1 dataset to work with early on. Especially as the amount of XBRL data grows over time being able to separate them out will be very helpful in development, since the XBRL processing is quite slow compared to DBF (though also much more parallelizable currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @bendnorman has some thoughts on how / whether we should break the different FERC forms out into individual jobs / assets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for the purpose of expediency, it might be good idea to do this separately from this PR. I'm big fan of gradually improving the code but not necessarily solving all the problems at once (which may drag on).
@rousik is the comment on this PR still accurate? It looks more like it allows data from 1996-2021 to be converted from DBF to SQLite. Does it do the older years too? |
Updated the comment. Prior to 1996, the split respondent archives is just a collection of DAT files, doesn't seem to be DBF at all. |
@rousik it looks like some of the integration tests need to be updated to work with the new general purpose DBF extractor.
There are a bunch of errors due to the "Multiple resources found" which I think are probably because there's both XBRL and DBF data for Q3 and Q4 of 2020. This will be more and more of an issue over time as revisions to the old DBF data are published in XBRL and we end up with both DBF and small XBRL resources in the Zenodo archives for more and more years. So we'll just need to be good about always specifying the format as well as the year. I don't think this was an issue before because we hard-coded the block of years that was served by each dataset (which won't work going forward given the older data revisions). |
1. split off ferc_dbf tests into separate suite 2. adapt for the recent code changes 3. split off the fixtures so that dbf/xbrl could be tested independently w/o having to run the other data extraction
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2564 +/- ##
=====================================
Coverage 86.9% 86.9%
=====================================
Files 84 86 +2
Lines 9720 9809 +89
=====================================
+ Hits 8447 8525 +78
- Misses 1273 1284 +11
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love how concise the per-form extractors ultimately are! Just a little metadata compilation and some optional fixes.
I left a bunch of comments, but it seems like maybe one higher level design question we both have is whether it would be better to split out the individual FERC forms as separately addressible jobs / ops / processes rather than lumping all the DBF and all the XBRL together.
|
||
Returns: | ||
The job definition to be executed. | ||
""" | ||
if not (enable_xbrl or enable_dbf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we're splitting up or specifying which data to translate into SQLite feels messy to me in general. Not sure if there's a better way to do it though.
Right now in the ETL settings we ask the user for datasets and years, and then figure out whether that means translating from XBRL or DBF internally, which seems correct -- ideally the user wouldn't need to know which source the data is coming from. But with 5 FERC forms (1, 2, 6, 60, 714) involved and 2 data formats for each of them (though the older 714 data isn't distributed as DBF for some reason) it feels like having some clear enumeration of the chunks of work that are going to get done internally would be more readable than having this high level job switch that says do DBF or do XBRL, and then another lower layer that's only visible in the settings that says which datasets to translate, and then yet another layer within the settings beneath that which indicates which partitions (years) to translate.
I guess what I'm imagining is that instead of just pulling the DBF vs. XBRL processing up to this high level, it might also make sense to pull both the dataset and data format up to this level and treat each dataset + data format as a separate job, since they'll each result in a separate SQLite output, and in a development context we'll often just want to re-run one of them in isolation. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I agree that having one job per dataset/format combination would be the most natural structure. Let me refactor this accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some early tinkering with this, I feel like I will need to dig deeper into dagster and how it works and how to properly structure this. I think that it should be possible to auto generate some of these dagster ops/graphs programmatically which should reduce the boilerplate, but it might take a little extra effort and care to do so.
My suggestion would be to proceed with this PR as-is and follow up with cleanup/refactoring independently of this. What do you think about that? I can file an issue to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think that's beyond the scope of this PR for right now, but it'd be good to make it more dynamic / granular at some point. And we don't extract the FERC DBs very frequently, so this won't come up too often.
@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None: | |||
"""Clone the FERC Form 1 Visual FoxPro databases into SQLite.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I totally agree. I think this is an artifact of only have 1 dataset to work with early on. Especially as the amount of XBRL data grows over time being able to separate them out will be very helpful in development, since the XBRL processing is quite slow compared to DBF (though also much more parallelizable currently)
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
When attempting to diff the output I found out that form 2 requires use of |
Ah, you mean the production raw data archive on Zenodo? Let me try and update it. |
New FERC 2 archive up here: https://zenodo.org/record/8006881 |
I have ran against non-sandboxed archives with no issues and no diffs on the existing datasets. This should be good to go. Note that I have added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting!
Oh, probably want to add the new FERC 2 SQLite DB to the list of DBs which are published to Datasette in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the ETL locally including the new FERC 2 DBF to SQLite conversion and I inspected 10 tables and found that none of them contained data from before 2004. Looking at the input file sizes, it seems like the archives become sizeable (~100MB) starting in 2002 and are only a few MB each from 1996-2001, but that doesn't seem to line up with the lack of data in 2002-2003 so I'm wondering if something is wrong with the extraction of earlier tables? Is there a reason why we might expect to see this behavior?
ferc2_db_path = "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc2.sqlite"
ferc2_engine = sa.create_engine(ferc2_db_path)
md = sa.MetaData()
md.reflect(bind=ferc2_engine)
ferc2_dfs = {}
for t in md.tables:
ferc2_dfs[t] = pd.read_sql(t, ferc2_engine)
{t: ferc2_dfs[t].report_yr.min() for t in ferc2_dfs if "report_yr" in ferc2_dfs[t].columns}
This adds the capability to extract FERC Form 2 years 1996-2020 from dbf archives.