Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

rousik · 2023-05-04T17:47:31Z

This adds the capability to extract FERC Form 2 years 1996-2020 from dbf archives.

This is very much not working, but lays groundwork. First issue encountered is due to the fact that multiple files/resources can be retrievied for a given year. Refactoring of DbfReader might be necessary to support this more generalized problem.

1. move PK/FK logic into new ferc.py shared file 2. embed partition information in PudlResourceKey object

This now handles the processing by looping over valid partitions rather than iterating over the years. It turns out this may not be necessary, because the split-archive years have very strange formats anyways that will need totally different approach.

1. added is_disabled property to dataset settings, which can be set specifically on ferc1 and ferc2 dbf settings (not globally applied and/or enforced, but it would be nice) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2) 2. construct ferc extractors in a loop 3. pull the right config out of the global FercToSqliteSettings to facilitate (2)

For more information, see https://pre-commit.ci

…nto rousik-ferc2

rousik · 2023-05-23T20:06:31Z

src/pudl/extract/ferc1.py

@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None:
    """Clone the FERC Form 1 Visual FoxPro databases into SQLite."""


Perhaps it would be better to use one @op per dataset here rather than clumping all of this into single dbf2sqlite?

Yeah I totally agree. I think this is an artifact of only have 1 dataset to work with early on. Especially as the amount of XBRL data grows over time being able to separate them out will be very helpful in development, since the XBRL processing is quite slow compared to DBF (though also much more parallelizable currently)

Maybe @bendnorman has some thoughts on how / whether we should break the different FERC forms out into individual jobs / assets.

I think that for the purpose of expediency, it might be good idea to do this separately from this PR. I'm big fan of gradually improving the code but not necessarily solving all the problems at once (which may drag on).

zaneselvans · 2023-05-23T20:19:47Z

@rousik is the comment on this PR still accurate? It looks more like it allows data from 1996-2021 to be converted from DBF to SQLite. Does it do the older years too?

rousik · 2023-05-23T20:35:45Z

@rousik is the comment on this PR still accurate? It looks more like it allows data from 1996-2021 to be converted from DBF to SQLite. Does it do the older years too?

Updated the comment. Prior to 1996, the split respondent archives is just a collection of DAT files, doesn't seem to be DBF at all.

zaneselvans · 2023-05-24T14:52:39Z

@rousik it looks like some of the integration tests need to be updated to work with the new general purpose DBF extractor.

FAILED test/integration/etl_test.py::test_ferc1_schema - AttributeError: 'FercDbfReader' object has no attribute 'get_db_schema'
ERROR test/integration/datasette_metadata_test.py::test_datasette_metadata_to_yml - KeyError: "Multiple resources found for ferc1: {'year': 2020}"

There are a bunch of errors due to the "Multiple resources found" which I think are probably because there's both XBRL and DBF data for Q3 and Q4 of 2020. This will be more and more of an issue over time as revisions to the old DBF data are published in XBRL and we end up with both DBF and small XBRL resources in the Zenodo archives for more and more years. So we'll just need to be good about always specifying the format as well as the year. I don't think this was an issue before because we hard-coded the block of years that was served by each dataset (which won't work going forward given the older data revisions).

1. split off ferc_dbf tests into separate suite 2. adapt for the recent code changes 3. split off the fixtures so that dbf/xbrl could be tested independently w/o having to run the other data extraction

codecov · 2023-05-25T05:31:10Z

Codecov Report

Patch coverage: 88.4% and no project coverage change.

Comparison is base (a1a7ef7) 86.9% compared to head (30c9aec) 86.9%.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2564   +/-   ##
=====================================
  Coverage   86.9%   86.9%           
=====================================
  Files         84      86    +2     
  Lines       9720    9809   +89     
=====================================
+ Hits        8447    8525   +78     
- Misses      1273    1284   +11

Impacted Files	Coverage Δ
src/pudl/metadata/sources.py	`100.0% <ø> (ø)`
src/pudl/extract/xbrl.py	`94.2% <33.3%> (-2.8%)`	⬇️
src/pudl/ferc_to_sqlite/cli.py	`70.7% <71.4%> (-0.7%)`	⬇️
src/pudl/workspace/datastore.py	`73.0% <71.4%> (-0.1%)`	⬇️
src/pudl/extract/dbf.py	`88.8% <87.8%> (-0.2%)`	⬇️
src/pudl/extract/ferc.py	`91.6% <91.6%> (ø)`
src/pudl/extract/ferc1.py	`99.1% <100.0%> (+<0.1%)`	⬆️
src/pudl/extract/ferc2.py	`100.0% <100.0%> (ø)`
src/pudl/settings.py	`98.7% <100.0%> (+<0.1%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans

I love how concise the per-form extractors ultimately are! Just a little metadata compilation and some optional fixes.

I left a bunch of comments, but it seems like maybe one higher level design question we both have is whether it would be better to split out the individual FERC forms as separately addressible jobs / ops / processes rather than lumping all the DBF and all the XBRL together.

src/pudl/extract/dbf.py

test/integration/ferc_dbf_extract_test.py

src/pudl/ferc_to_sqlite/cli.py

zaneselvans · 2023-05-25T15:49:11Z

src/pudl/ferc_to_sqlite/cli.py


    Returns:
        The job definition to be executed.
    """
+    if not (enable_xbrl or enable_dbf):


The way we're splitting up or specifying which data to translate into SQLite feels messy to me in general. Not sure if there's a better way to do it though.

Right now in the ETL settings we ask the user for datasets and years, and then figure out whether that means translating from XBRL or DBF internally, which seems correct -- ideally the user wouldn't need to know which source the data is coming from. But with 5 FERC forms (1, 2, 6, 60, 714) involved and 2 data formats for each of them (though the older 714 data isn't distributed as DBF for some reason) it feels like having some clear enumeration of the chunks of work that are going to get done internally would be more readable than having this high level job switch that says do DBF or do XBRL, and then another lower layer that's only visible in the settings that says which datasets to translate, and then yet another layer within the settings beneath that which indicates which partitions (years) to translate.

I guess what I'm imagining is that instead of just pulling the DBF vs. XBRL processing up to this high level, it might also make sense to pull both the dataset and data format up to this level and treat each dataset + data format as a separate job, since they'll each result in a separate SQLite output, and in a development context we'll often just want to re-run one of them in isolation. What do you think?

Yeah. I agree that having one job per dataset/format combination would be the most natural structure. Let me refactor this accordingly.

After some early tinkering with this, I feel like I will need to dig deeper into dagster and how it works and how to properly structure this. I think that it should be possible to auto generate some of these dagster ops/graphs programmatically which should reduce the boilerplate, but it might take a little extra effort and care to do so.

My suggestion would be to proceed with this PR as-is and follow up with cleanup/refactoring independently of this. What do you think about that? I can file an issue to track this.

Sure, I think that's beyond the scope of this PR for right now, but it'd be good to make it more dynamic / granular at some point. And we don't extract the FERC DBs very frequently, so this won't come up too often.

src/pudl/extract/ferc2.py

zaneselvans · 2023-05-25T15:57:05Z

src/pudl/extract/ferc1.py

@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None:
    """Clone the FERC Form 1 Visual FoxPro databases into SQLite."""


Yeah I totally agree. I think this is an artifact of only have 1 dataset to work with early on. Especially as the amount of XBRL data grows over time being able to separate them out will be very helpful in development, since the XBRL processing is quite slow compared to DBF (though also much more parallelizable currently)

For more information, see https://pre-commit.ci

rousik · 2023-06-05T15:57:14Z

When attempting to diff the output I found out that form 2 requires use of --clobber because the current production dataset does not contain part=N partition annotation which causes non-uniqueness violations when opening the archive. So, perhaps we should bump the production versions before merging this.

zaneselvans · 2023-06-05T15:59:53Z

Ah, you mean the production raw data archive on Zenodo? Let me try and update it.

zaneselvans · 2023-06-05T17:01:36Z

New FERC 2 archive up here: https://zenodo.org/record/8006881

rousik · 2023-06-07T08:23:54Z

I have ran against non-sandboxed archives with no issues and no diffs on the existing datasets. This should be good to go.

Note that I have added disabled configuration field to bunch of ferc_to_sqlite configuration so that we can (during local development) individually disable/enable these. I don't expect this should be used in production scenarios, but for local point-wise testing it's quite helpful.

zaneselvans

Exciting!

zaneselvans · 2023-06-07T21:08:36Z

Oh, probably want to add the new FERC 2 SQLite DB to the list of DBs which are published to Datasette in devtools/datasette/publish.sh

zaneselvans

I ran the ETL locally including the new FERC 2 DBF to SQLite conversion and I inspected 10 tables and found that none of them contained data from before 2004. Looking at the input file sizes, it seems like the archives become sizeable (~100MB) starting in 2002 and are only a few MB each from 1996-2001, but that doesn't seem to line up with the lack of data in 2002-2003 so I'm wondering if something is wrong with the extraction of earlier tables? Is there a reason why we might expect to see this behavior?

ferc2_db_path = "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc2.sqlite"
ferc2_engine = sa.create_engine(ferc2_db_path)
md = sa.MetaData()
md.reflect(bind=ferc2_engine)
ferc2_dfs = {}
for t in md.tables:
    ferc2_dfs[t] = pd.read_sql(t, ferc2_engine)
{t: ferc2_dfs[t].report_yr.min() for t in ferc2_dfs if "report_yr" in ferc2_dfs[t].columns}

rousik added 6 commits April 27, 2023 09:01

Fix the references to dbf_reader

19f01d7

Merge remote-tracking branch 'origin/dev' into rousik-ferc2

b6d473b

add basic descriptors for ferc2 dbf files

b82f402

WIP: Add experimental support for ferc 2 datasets.

a1ad0d6

WIP: Some minor code edits.

bb385c6

1. move PK/FK logic into new ferc.py shared file 2. embed partition information in PudlResourceKey object

zaneselvans added ferc2 Issues related to the FERC Form 2 dataset dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels May 5, 2023

zaneselvans added this to the 2023 Spring milestone May 5, 2023

rousik added 3 commits May 9, 2023 08:09

WIP: some changes to the ferc extraction.

7d6889d

Merge remote-tracking branch 'origin/dev' into rousik-ferc2

583a111

Restructure schema loading logic.

2ac579e

rousik changed the base branch from main to dev May 9, 2023 22:04

rousik added 2 commits May 9, 2023 16:06

Add support for form 2 years 1996-2020.

2d3a6a0

zaneselvans assigned rousik May 18, 2023

rousik force-pushed the rousik-ferc2 branch from 680fd33 to 8d16438 Compare May 23, 2023 19:28

rousik and others added 4 commits May 23, 2023 13:28

Merge remote-tracking branch 'origin/dev' into rousik-ferc2

10938b6

[pre-commit.ci] auto fixes from pre-commit.com hooks

ecfa594

For more information, see https://pre-commit.ci

add ferc2 dbf years to etl_full.yml

9d3f9b5

Merge branch 'rousik-ferc2' of github.com:catalyst-cooperative/pudl i…

2031e55

…nto rousik-ferc2

rousik commented May 23, 2023

View reviewed changes

Revert unrelated changes.

a23fcd1

rousik requested a review from zaneselvans May 23, 2023 20:10

rousik marked this pull request as ready for review May 23, 2023 20:10

address ci complaints

c000dc5

Minor improvements in the integration tests.

357f0b0

1. split off ferc_dbf tests into separate suite 2. adapt for the recent code changes 3. split off the fixtures so that dbf/xbrl could be tested independently w/o having to run the other data extraction

zaneselvans reviewed May 25, 2023

View reviewed changes

zaneselvans mentioned this pull request May 25, 2023

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Closed

zaneselvans linked an issue May 25, 2023 that may be closed by this pull request

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Closed

Jan Rous and others added 8 commits May 30, 2023 13:56

Merge remote-tracking branch 'origin/dev' into rousik-ferc2

014392c

Address code review comments.

bb8abfe

[pre-commit.ci] auto fixes from pre-commit.com hooks

378f7b4

For more information, see https://pre-commit.ci

Generalize extraction tests to ferc2 as well.

1a669b0

[pre-commit.ci] auto fixes from pre-commit.com hooks

aab4ea2

For more information, see https://pre-commit.ci

Some style fixes.

7e4c6d8

Fix some typo.

e49df15

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e132a5

For more information, see https://pre-commit.ci

rousik added 2 commits June 5, 2023 21:29

Move disabled parameter to the root configuration classes.

491e535

Bump up ferc2 production dataset.

c081685

Merge branch 'dev' into rousik-ferc2

95725b8

zaneselvans approved these changes Jun 7, 2023

View reviewed changes

zaneselvans self-requested a review June 7, 2023 22:26

zaneselvans requested changes Jun 7, 2023

View reviewed changes

Add ferc2 SQLite DB to Datasette deploy script.

30c9aec

zaneselvans self-requested a review June 8, 2023 14:37

zaneselvans approved these changes Jun 8, 2023

View reviewed changes

rousik merged commit 8df08c4 into dev Jun 8, 2023
10 checks passed

rousik mentioned this pull request Jun 8, 2023

FERC From 2 DBF seems to have gap in the early years #2642

Closed

zaneselvans deleted the rousik-ferc2 branch September 12, 2023 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

rousik commented May 4, 2023 •

edited

rousik May 23, 2023

zaneselvans May 25, 2023

zaneselvans May 25, 2023

rousik May 31, 2023

zaneselvans commented May 23, 2023

rousik commented May 23, 2023

zaneselvans commented May 24, 2023

codecov bot commented May 25, 2023 •

edited

zaneselvans left a comment

zaneselvans May 25, 2023

rousik May 30, 2023

rousik May 31, 2023

zaneselvans Jun 7, 2023

zaneselvans May 25, 2023

rousik commented Jun 5, 2023

zaneselvans commented Jun 5, 2023

zaneselvans commented Jun 5, 2023

rousik commented Jun 7, 2023

zaneselvans left a comment

zaneselvans commented Jun 7, 2023

zaneselvans left a comment •

edited

		@@ -350,13 +347,19 @@ def dbf2sqlite(context) -> None:
		"""Clone the FERC Form 1 Visual FoxPro databases into SQLite."""

Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

Conversation

rousik commented May 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented May 23, 2023

rousik commented May 23, 2023

zaneselvans commented May 24, 2023

codecov bot commented May 25, 2023 • edited

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousik commented Jun 5, 2023

zaneselvans commented Jun 5, 2023

zaneselvans commented Jun 5, 2023

rousik commented Jun 7, 2023

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans commented Jun 7, 2023

zaneselvans left a comment • edited

Choose a reason for hiding this comment

rousik commented May 4, 2023 •

edited

codecov bot commented May 25, 2023 •

edited

zaneselvans left a comment •

edited