New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename dbf-derived FERC SQLite DBs #3094
Conversation
Thank you for doing this, it's been bothering me for months! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a few non-blocking questions for my own edification.
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
sys.exit(deploy_datasette()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: any reason you prefer sys.exit()
with 1 and 0 vs. throwing an error in the "somehow got bad deployment destination" case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think it's just an old Unix habit. I tested this and actually what happens is that click
catches the invalid input and gives a nice helpful error message.
"eia923", | ||
"ferc1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: What would it take to change these data_source_ids to fercX_dbf
also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a significant change. Right now the DBF and XBRL data are both subsumed within the higher level conceptual data source known and ferc1
(or whatever) and each conceptual data source has a bunch of metadata associated with it. Breaking down those objects into smaller pieces that understand the internal complexities of where the data for different years are coming from would be substantial work, since the current idea of a "data source" shows up in a lot of different places.
That said, looking at all the complexity that's involved in bringing together the information required to document the databases in datasette, there's clearly room for improvement in our overall metadata modeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Didn't submit my own stupid review first.
from pathlib import Path | ||
from subprocess import check_call, check_output | ||
|
||
import click |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already depend on click indirectly, and I'm working on migrating all of our CLIs to using it as a side project...
|
||
.. note:: | ||
|
||
This script pulls *all* of the FERC Form 1 DBF data into a *single* database, but | ||
FERC distributes a *separate* database for each year. Virtually all the database | ||
tables contain a ``report_year`` column that indicates which year they came from, | ||
preventing collisions between records in the merged multi-year database. One notable | ||
exception is the ``f1_respondent_id`` table, which maps ``respondent_id`` to the | ||
names of the respondents. For that table, we have allowed the most recently reported | ||
record to take precedence, overwriting previous mappings if they exist. | ||
|
||
.. note:: | ||
|
||
There are a handful of ``respondent_id`` values that appear in the FERC Form 1 | ||
database tables but do not show up in ``f1_respondent_id``. This renders the foreign | ||
key relationships between those tables invalid. During the database cloning process | ||
we add these ``respondent_id`` values to the ``f1_respondent_id`` table with a | ||
``respondent_name`` indicating that the ID was filled in by PUDL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all couched as specific to FERC 1, and felt quite a bit in the weeds, and so I removed it.
* `FERC Form 1 (DBF) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc1.sqlite>`__ | ||
* `FERC Form 1 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc1_xbrl.sqlite>`__ | ||
* `FERC Form 2 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc2_xbrl.sqlite>`__ | ||
* `FERC Form 6 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc6_xbrl.sqlite>`__ | ||
* `FERC Form 60 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc60_xbrl.sqlite>`__ | ||
* `FERC Form 714 (XBRL) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/dev/ferc714_xbrl.sqlite>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed these links when I updated the downloads to point at the compressed outputs earlier this week...
This is a thin wrapper around the GDAL ogr2ogr command line tool. We use it | ||
to convert the Census DP1 data which is distributed as an ESRI GeoDB into an | ||
SQLite DB. The module provides ogr2ogr with the Census DP 1 data from the | ||
PUDL datastore, and directs it to be output into the user's SQLite directory | ||
alongside our other SQLite Databases (ferc1.sqlite and pudl.sqlite) | ||
|
||
Note that the ogr2ogr command line utility must be available on the user's | ||
system for this to work. This tool is part of the ``pudl-dev`` conda | ||
environment, but if you are using PUDL outside of the conda environment, you | ||
will need to install ogr2ogr separately. On Debian Linux based systems such | ||
as Ubuntu it can be installed with ``sudo apt-get install gdal-bin`` (which | ||
is what we do in our CI setup and Docker images.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pared this down since we're not intending folks to really install it as a package anymore.
An XBRL context includes an entity ID, the time period the data applies | ||
to, and other dimensions such as utility type. Each context has its own | ||
ID, but they are frequently redefined with the same contents but | ||
different IDs - so we identify them by their actual content. | ||
An XBRL context includes an entity ID, the time period the data applies to, and | ||
other dimensions such as utility type. Each context has its own ID, but they are | ||
frequently redefined with the same contents but different IDs - so we identify | ||
them by their actual content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the changes below are just re-wrapping text.
@@ -167,6 +169,45 @@ databases: | |||
{%- endif %} | |||
{%- endfor %} | |||
{%- endfor %} | |||
|
|||
ferc2_dbf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added DBF database sections for the 2, 6, and 60 so that they have at least some basic preamble on their individual pages.
... | ||
├── pudl.sqlite | ||
└── hourly_emissions_cems | ||
└── hourly_emissions_cems.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to reflect the new monolithic-only Parquet output.
datasets = ( | ||
["pudl.sqlite"] | ||
+ sorted(str(p.name) for p in pudl_out.glob("ferc*.sqlite")) | ||
+ ["censusdp1tract.sqlite"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that the ordering of this list is what controls the order in which the databases appear on the main Datasette page, so I manually ordered it to put PUDL first.
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
sys.exit(deploy_datasette()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think it's just an old Unix habit. I tested this and actually what happens is that click
catches the invalid input and gives a nice helpful error message.
"eia923", | ||
"ferc1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a significant change. Right now the DBF and XBRL data are both subsumed within the higher level conceptual data source known and ferc1
(or whatever) and each conceptual data source has a bunch of metadata associated with it. Breaking down those objects into smaller pieces that understand the internal complexities of where the data for different years are coming from would be substantial work, since the current idea of a "data source" shows up in a lot of different places.
That said, looking at all the complexity that's involved in bringing together the information required to document the databases in datasette, there's clearly room for improvement in our overall metadata modeling.
Co-authored-by: Dazhong Xia <dazhong.xia@catalyst.coop>
PR Overview
_dbf
and_xbrl
to indicate the source data format, so neither of them seems more generic.metadata.yml
file for testing / debugging purposes.Tasks
make nuke
and see if anything breaks.ferc*_dbf.sqlite
metadata correctly.PR Checklist
dev
).