Rename dbf-derived FERC SQLite DBs #3094

zaneselvans · 2023-11-29T09:35:40Z

PR Overview

Use more uniform naming of the FERC SQLite databases, with _dbf and _xbrl to indicate the source data format, so neither of them seems more generic.
Update the docs and download links to reflect the new database names.
Also removed a bit of cruft in the docs related to usage modes we don't support any more, or FERC Form 1 specific language that now kind of applies to all the FERC Forms we're extracting.
Refactor the Datasette deployment script to allow running it locally or just generating the metadata.yml file for testing / debugging purposes.
Updated the Jinja templates we use for Datasette Metadata to actually give the FERC 2/6/60 DBF databases a little bit of context.

Tasks

Give feedback

Run `make nuke` and see if anything breaks.

Run make nuke and see if anything breaks.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update documentation references to the FERC DBs to reflect rename.

Update documentation references to the FERC DBs to reflect rename.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update nightly build download links.

Update nightly build download links.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update references in docstrings and comments in the codebase.

Update references in docstrings and comments in the codebase.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update devtools or other maintained notebooks in the repo.

Update devtools or other maintained notebooks in the repo.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Modifly datasette publication script to allow just metdata output and local deployment.

Modifly datasette publication script to allow just metdata output and local deployment.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Modify datasette metadata class to produce `ferc*_dbf.sqlite` metadata correctly.

Modify datasette metadata class to produce ferc*_dbf.sqlite metadata correctly.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Options

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

e-belfer · 2023-11-29T14:53:28Z

Thank you for doing this, it's been bothering me for months!

jdangerx

Looks good, a few non-blocking questions for my own edification.

jdangerx · 2023-11-30T21:57:22Z

devtools/datasette/publish.py



 if __name__ == "__main__":
-    main()
+    sys.exit(deploy_datasette())


non-blocking: any reason you prefer sys.exit() with 1 and 0 vs. throwing an error in the "somehow got bad deployment destination" case?

No, I think it's just an old Unix habit. I tested this and actually what happens is that click catches the invalid input and gives a nice helpful error message.

docs/data_access.rst

docs/dev/clone_ferc1.rst

jdangerx · 2023-11-30T22:03:31Z

src/pudl/metadata/classes.py

            "eia923",
+            "ferc1",


non-blocking: What would it take to change these data_source_ids to fercX_dbf also?

This would be a significant change. Right now the DBF and XBRL data are both subsumed within the higher level conceptual data source known and ferc1 (or whatever) and each conceptual data source has a bunch of metadata associated with it. Breaking down those objects into smaller pieces that understand the internal complexities of where the data for different years are coming from would be substantial work, since the current idea of a "data source" shows up in a lot of different places.

That said, looking at all the complexity that's involved in bringing together the information required to document the databases in datasette, there's clearly room for improvement in our overall metadata modeling.

zaneselvans

Whoops. Didn't submit my own stupid review first.

zaneselvans · 2023-11-30T21:12:11Z

devtools/datasette/publish.py

 from pathlib import Path
 from subprocess import check_call, check_output

+import click


We already depend on click indirectly, and I'm working on migrating all of our CLIs to using it as a side project...

zaneselvans · 2023-11-30T21:15:04Z

docs/dev/clone_ferc1.rst

-
-.. note::
-
-    This script pulls *all* of the FERC Form 1 DBF data into a *single* database, but
-    FERC distributes a *separate* database for each year. Virtually all the database
-    tables contain a ``report_year`` column that indicates which year they came from,
-    preventing collisions between records in the merged multi-year database. One notable
-    exception is the ``f1_respondent_id`` table, which maps ``respondent_id`` to the
-    names of the respondents. For that table, we have allowed the most recently reported
-    record to take precedence, overwriting previous mappings if they exist.
-
-.. note::
-
-   There are a handful of ``respondent_id`` values that appear in the FERC Form 1
-   database tables but do not show up in ``f1_respondent_id``.  This renders the foreign
-   key relationships between those tables invalid.  During the database cloning process
-   we add these ``respondent_id`` values to the ``f1_respondent_id`` table with a
-   ``respondent_name`` indicating that the ID was filled in by PUDL.


This was all couched as specific to FERC 1, and felt quite a bit in the weeds, and so I removed it.

zaneselvans · 2023-11-30T21:18:09Z

docs/intro.rst

-* `FERC Form 1 (DBF) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc1.sqlite>`__
-* `FERC Form 1 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc1_xbrl.sqlite>`__
-* `FERC Form 2 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc2_xbrl.sqlite>`__
-* `FERC Form 6 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc6_xbrl.sqlite>`__
-* `FERC Form 60 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc60_xbrl.sqlite>`__
-* `FERC Form 714 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc714_xbrl.sqlite>`__


Missed these links when I updated the downloads to point at the compressed outputs earlier this week...

zaneselvans · 2023-11-30T21:19:03Z

src/pudl/convert/censusdp1tract_to_sqlite.py

-This is a thin wrapper around the GDAL ogr2ogr command line tool. We use it
-to convert the Census DP1 data which is distributed as an ESRI GeoDB into an
-SQLite DB. The module provides ogr2ogr with the Census DP 1 data from the
-PUDL datastore, and directs it to be output into the user's SQLite directory
-alongside our other SQLite Databases (ferc1.sqlite and pudl.sqlite)
-
-Note that the ogr2ogr command line utility must be available on the user's
-system for this to work. This tool is part of the ``pudl-dev`` conda
-environment, but if you are using PUDL outside of the conda environment, you
-will need to install ogr2ogr separately. On Debian Linux based systems such
-as Ubuntu it can be installed with ``sudo apt-get install gdal-bin`` (which
-is what we do in our CI setup and Docker images.)


Pared this down since we're not intending folks to really install it as a package anymore.

zaneselvans · 2023-11-30T21:20:41Z

src/pudl/io_managers.py

-        An XBRL context includes an entity ID, the time period the data applies
-        to, and other dimensions such as utility type. Each context has its own
-        ID, but they are frequently redefined with the same contents but
-        different IDs - so we identify them by their actual content.
+        An XBRL context includes an entity ID, the time period the data applies to, and
+        other dimensions such as utility type. Each context has its own ID, but they are
+        frequently redefined with the same contents but different IDs - so we identify
+        them by their actual content.


All the changes below are just re-wrapping text.

zaneselvans · 2023-11-30T21:23:15Z

src/pudl/metadata/templates/datasette-metadata.yml.jinja

@@ -167,6 +169,45 @@ databases:
          {%- endif %}
        {%- endfor %}
    {%- endfor %}
+
+  ferc2_dbf:


Added DBF database sections for the 2, 6, and 60 so that they have at least some basic preamble on their individual pages.

zaneselvans · 2023-11-30T21:23:44Z

src/pudl/workspace/setup_cli.py

  ...
  ├── pudl.sqlite
-  └── hourly_emissions_cems
+  └── hourly_emissions_cems.parquet


Changed to reflect the new monolithic-only Parquet output.

zaneselvans · 2023-11-30T21:25:36Z

devtools/datasette/publish.py

+    datasets = (
+        ["pudl.sqlite"]
+        + sorted(str(p.name) for p in pudl_out.glob("ferc*.sqlite"))
+        + ["censusdp1tract.sqlite"]
    )


It turns out that the ordering of this list is what controls the order in which the databases appear on the main Datasette page, so I manually ordered it to put PUDL first.

zaneselvans · 2023-11-30T22:19:51Z

devtools/datasette/publish.py



 if __name__ == "__main__":
-    main()
+    sys.exit(deploy_datasette())


No, I think it's just an old Unix habit. I tested this and actually what happens is that click catches the invalid input and gives a nice helpful error message.

zaneselvans · 2023-11-30T22:26:35Z

src/pudl/metadata/classes.py

            "eia923",
+            "ferc1",


This would be a significant change. Right now the DBF and XBRL data are both subsumed within the higher level conceptual data source known and ferc1 (or whatever) and each conceptual data source has a bunch of metadata associated with it. Breaking down those objects into smaller pieces that understand the internal complexities of where the data for different years are coming from would be substantial work, since the current idea of a "data source" shows up in a lot of different places.

That said, looking at all the complexity that's involved in bringing together the information required to document the databases in datasette, there's clearly room for improvement in our overall metadata modeling.

Co-authored-by: Dazhong Xia <dazhong.xia@catalyst.coop>

zaneselvans and others added 2 commits November 29, 2023 03:30

Rename DBF-derived FERC SQLite DBs to ferc*_dbf.sqlite

263ba4e

Update conda-lock.yml and rendered conda environment files.

2a0ea0d

zaneselvans self-assigned this Nov 29, 2023

zaneselvans linked an issue Nov 29, 2023 that may be closed by this pull request

Make nightly build outputs easier to download and access remotely #3079

Closed

zaneselvans mentioned this pull request Nov 29, 2023

Make nightly build outputs easier to download and access remotely #3079

Closed

Merge branch 'make-pre-commit-cleanup' into better-build-outputs

871ec7c

zaneselvans and others added 9 commits November 29, 2023 09:35

Merge branch 'dev' into better-build-outputs

f364c9a

Merge branch 'make-pre-commit-cleanup' into better-build-outputs

36887d1

Update docs/docstring refs to ferc1.sqlite to use ferc1_dbf.sqlite

594a495

Merge branch 'dev' into better-build-outputs

4d9ff66

Update conda-lock.yml and rendered conda environment files.

9e8edb1

Refactor datasette publish script to allow local testing.

13baae6

Update conda-lock.yml and rendered conda environment files.

e64de38

Add FERC DBF metadata for Forms 1, 2, 6, 60.

ef936e1

Fix cut-and-paste error in download link titles.

d3d903e

zaneselvans marked this pull request as ready for review November 30, 2023 21:27

zaneselvans requested a review from jdangerx November 30, 2023 21:27

Update release notes

f231878

jdangerx approved these changes Nov 30, 2023

View reviewed changes

zaneselvans commented Nov 30, 2023

View reviewed changes

zaneselvans and others added 2 commits November 30, 2023 17:32

Apply suggestions from code review

bb29ea0

Co-authored-by: Dazhong Xia <dazhong.xia@catalyst.coop>

Update datasette metadata integration test.

dd5566f

Add ferc1_engine_xbrl fixture back into datasette metadata test.

1d3f6a3

zaneselvans merged commit 85363b9 into dev Dec 1, 2023
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename dbf-derived FERC SQLite DBs #3094

Rename dbf-derived FERC SQLite DBs #3094

zaneselvans commented Nov 29, 2023 •

edited

Tasks

e-belfer commented Nov 29, 2023

jdangerx left a comment

jdangerx Nov 30, 2023

zaneselvans Nov 30, 2023

jdangerx Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans left a comment

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

zaneselvans Nov 30, 2023

Rename dbf-derived FERC SQLite DBs #3094

Rename dbf-derived FERC SQLite DBs #3094

Conversation

zaneselvans commented Nov 29, 2023 • edited

PR Overview

Tasks

PR Checklist

e-belfer commented Nov 29, 2023

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Nov 29, 2023 •

edited