You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the multi-annual FERC Form 1 DB that we build is based on a schema that's an awkward hybrid of the mappings between unreadable DBF filenames and actual table names that we found online, which pertains to the 2015 data, and a our parsing of the strings in the DBC file for whatever reference year we're trying to build a database for -- typically the most recent data release year.
This means that there are some kinds of data that we could be missing -- if tables and/or columns existed in the past but don't exist now, we'll probably miss them when reading in earlier years of data. This doesn't appear to be common, but it could trip someone up.
The "right way" to fix this seems like it's probably to create a super-schema that contains all of the tables that have ever existed and all of the columns that have ever existed, by parsing the DBC files for all of the data years... and crossing our fingers that the data types didn't change over the years, or there's one data type that can be imposed on each column which all the data can be forced into. We would also need to figure out the historical mappings between database table names and F1_whatevs.DBF filenames.
The text was updated successfully, but these errors were encountered:
I have verified that the 2015 & 2017 FERC Form 1 database schemas are identical, and that they include all the tables and fields which exist in all previous years of data going back as far as 1994, so the super-schema would super-fluous. An ETL test has been integrated (test/etl_test.py::test_ferc1_lost_data()) which verifies that this situation contiues to be true (and checks whether any new tables have been added in future years, relative to our 2015 mapping of DBF files to database table names, so we can update that if need be.
Currently the multi-annual FERC Form 1 DB that we build is based on a schema that's an awkward hybrid of the mappings between unreadable DBF filenames and actual table names that we found online, which pertains to the 2015 data, and a our parsing of the strings in the DBC file for whatever reference year we're trying to build a database for -- typically the most recent data release year.
This means that there are some kinds of data that we could be missing -- if tables and/or columns existed in the past but don't exist now, we'll probably miss them when reading in earlier years of data. This doesn't appear to be common, but it could trip someone up.
The "right way" to fix this seems like it's probably to create a super-schema that contains all of the tables that have ever existed and all of the columns that have ever existed, by parsing the DBC files for all of the data years... and crossing our fingers that the data types didn't change over the years, or there's one data type that can be imposed on each column which all the data can be forced into. We would also need to figure out the historical mappings between database table names and F1_whatevs.DBF filenames.
The text was updated successfully, but these errors were encountered: