Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename dbf-derived FERC SQLite DBs #3094

Merged
merged 16 commits into from Dec 1, 2023
Merged

Rename dbf-derived FERC SQLite DBs #3094

merged 16 commits into from Dec 1, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Nov 29, 2023

PR Overview

  • Use more uniform naming of the FERC SQLite databases, with _dbf and _xbrl to indicate the source data format, so neither of them seems more generic.
  • Update the docs and download links to reflect the new database names.
  • Also removed a bit of cruft in the docs related to usage modes we don't support any more, or FERC Form 1 specific language that now kind of applies to all the FERC Forms we're extracting.
  • Refactor the Datasette deployment script to allow running it locally or just generating the metadata.yml file for testing / debugging purposes.
  • Updated the Jinja templates we use for Datasette Metadata to actually give the FERC 2/6/60 DBF databases a little bit of context.

Tasks

Edit tasklist title
Beta Give feedback Tasklist Tasks, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Run make nuke and see if anything breaks.
    Options
  2. Update documentation references to the FERC DBs to reflect rename.
    Options
  3. Update nightly build download links.
    Options
  4. Update references in docstrings and comments in the codebase.
    Options
  5. Update devtools or other maintained notebooks in the repo.
    Options
  6. Modifly datasette publication script to allow just metdata output and local deployment.
    Options
  7. Modify datasette metadata class to produce ferc*_dbf.sqlite metadata correctly.
    Options

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans self-assigned this Nov 29, 2023
@zaneselvans zaneselvans linked an issue Nov 29, 2023 that may be closed by this pull request
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 sqlite Issues related to interacting with sqlite databases ferc2 Issues related to the FERC Form 2 dataset dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc6 ferc60 output Exporting data from PUDL into other platforms or interchange formats. labels Nov 29, 2023
@e-belfer
Copy link
Member

Thank you for doing this, it's been bothering me for months!

@zaneselvans zaneselvans marked this pull request as ready for review November 30, 2023 21:27
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few non-blocking questions for my own edification. :shipit:



if __name__ == "__main__":
main()
sys.exit(deploy_datasette())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: any reason you prefer sys.exit() with 1 and 0 vs. throwing an error in the "somehow got bad deployment destination" case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it's just an old Unix habit. I tested this and actually what happens is that click catches the invalid input and gives a nice helpful error message.

docs/data_access.rst Outdated Show resolved Hide resolved
docs/dev/clone_ferc1.rst Outdated Show resolved Hide resolved
"eia923",
"ferc1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: What would it take to change these data_source_ids to fercX_dbf also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a significant change. Right now the DBF and XBRL data are both subsumed within the higher level conceptual data source known and ferc1 (or whatever) and each conceptual data source has a bunch of metadata associated with it. Breaking down those objects into smaller pieces that understand the internal complexities of where the data for different years are coming from would be substantial work, since the current idea of a "data source" shows up in a lot of different places.

That said, looking at all the complexity that's involved in bringing together the information required to document the databases in datasette, there's clearly room for improvement in our overall metadata modeling.

Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Didn't submit my own stupid review first.

from pathlib import Path
from subprocess import check_call, check_output

import click
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already depend on click indirectly, and I'm working on migrating all of our CLIs to using it as a side project...

Comment on lines -61 to -78

.. note::

This script pulls *all* of the FERC Form 1 DBF data into a *single* database, but
FERC distributes a *separate* database for each year. Virtually all the database
tables contain a ``report_year`` column that indicates which year they came from,
preventing collisions between records in the merged multi-year database. One notable
exception is the ``f1_respondent_id`` table, which maps ``respondent_id`` to the
names of the respondents. For that table, we have allowed the most recently reported
record to take precedence, overwriting previous mappings if they exist.

.. note::

There are a handful of ``respondent_id`` values that appear in the FERC Form 1
database tables but do not show up in ``f1_respondent_id``. This renders the foreign
key relationships between those tables invalid. During the database cloning process
we add these ``respondent_id`` values to the ``f1_respondent_id`` table with a
``respondent_name`` indicating that the ID was filled in by PUDL.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all couched as specific to FERC 1, and felt quite a bit in the weeds, and so I removed it.

Comment on lines -31 to -36
* `FERC Form 1 (DBF) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc1.sqlite>`__
* `FERC Form 1 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc1_xbrl.sqlite>`__
* `FERC Form 2 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc2_xbrl.sqlite>`__
* `FERC Form 6 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc6_xbrl.sqlite>`__
* `FERC Form 60 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc60_xbrl.sqlite>`__
* `FERC Form 714 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc714_xbrl.sqlite>`__
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed these links when I updated the downloads to point at the compressed outputs earlier this week...

Comment on lines -3 to -14
This is a thin wrapper around the GDAL ogr2ogr command line tool. We use it
to convert the Census DP1 data which is distributed as an ESRI GeoDB into an
SQLite DB. The module provides ogr2ogr with the Census DP 1 data from the
PUDL datastore, and directs it to be output into the user's SQLite directory
alongside our other SQLite Databases (ferc1.sqlite and pudl.sqlite)

Note that the ogr2ogr command line utility must be available on the user's
system for this to work. This tool is part of the ``pudl-dev`` conda
environment, but if you are using PUDL outside of the conda environment, you
will need to install ogr2ogr separately. On Debian Linux based systems such
as Ubuntu it can be installed with ``sudo apt-get install gdal-bin`` (which
is what we do in our CI setup and Docker images.)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pared this down since we're not intending folks to really install it as a package anymore.

Comment on lines -692 to +695
An XBRL context includes an entity ID, the time period the data applies
to, and other dimensions such as utility type. Each context has its own
ID, but they are frequently redefined with the same contents but
different IDs - so we identify them by their actual content.
An XBRL context includes an entity ID, the time period the data applies to, and
other dimensions such as utility type. Each context has its own ID, but they are
frequently redefined with the same contents but different IDs - so we identify
them by their actual content.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes below are just re-wrapping text.

@@ -167,6 +169,45 @@ databases:
{%- endif %}
{%- endfor %}
{%- endfor %}

ferc2_dbf:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added DBF database sections for the 2, 6, and 60 so that they have at least some basic preamble on their individual pages.

...
├── pudl.sqlite
└── hourly_emissions_cems
└── hourly_emissions_cems.parquet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to reflect the new monolithic-only Parquet output.

Comment on lines +118 to 122
datasets = (
["pudl.sqlite"]
+ sorted(str(p.name) for p in pudl_out.glob("ferc*.sqlite"))
+ ["censusdp1tract.sqlite"]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that the ordering of this list is what controls the order in which the databases appear on the main Datasette page, so I manually ordered it to put PUDL first.



if __name__ == "__main__":
main()
sys.exit(deploy_datasette())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it's just an old Unix habit. I tested this and actually what happens is that click catches the invalid input and gives a nice helpful error message.

"eia923",
"ferc1",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a significant change. Right now the DBF and XBRL data are both subsumed within the higher level conceptual data source known and ferc1 (or whatever) and each conceptual data source has a bunch of metadata associated with it. Breaking down those objects into smaller pieces that understand the internal complexities of where the data for different years are coming from would be substantial work, since the current idea of a "data source" shows up in a lot of different places.

That said, looking at all the complexity that's involved in bringing together the information required to document the databases in datasette, there's clearly room for improvement in our overall metadata modeling.

zaneselvans and others added 2 commits November 30, 2023 17:32
@zaneselvans zaneselvans merged commit 85363b9 into dev Dec 1, 2023
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc1 Anything having to do with FERC Form 1 ferc2 Issues related to the FERC Form 2 dataset ferc6 ferc60 output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Make nightly build outputs easier to download and access remotely
3 participants