datapkg_to_sqlite fails to load all of EPA CEMS #450

zaneselvans · 2019-10-15T01:53:25Z

After doing a full ETL of all years and states in CEMS, the datapkg_to_sqlite script doesn't seem to load all of that data into the SQLite database. Rather, it only loads the last year of data into the database. However, the process terminates quickly, so it's probably not even attempting to load all the data. Suspect it's an issue with the iteration and/or partitioning...

The text was updated successfully, but these errors were encountered:

cmgosnell · 2019-10-22T21:51:53Z

Hmmm I tried to recreate this issue with a grouped table and am getting a similar result. For me only the last resource described in the datapackage.json file is showing up in the sqlite database. @roll, is it possible that for the grouped table sqlite export is not appending the parts of the grouped table but just clobbering each of the earlier groups.

I generated data packages for CEMS with these parameters:

epacems_years: [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
epacems_states: [ID,MT,CO,TX,IL]

It is being partitioned by state and year so each tabluar resource for the
hourly_emissions_epacems table is a part of the same group. Here is a description of one of the resources:

       {
            "profile": "tabular-data-resource",
            "name": "hourly_emissions_epacems_2005_id",
            "path": "data/hourly_emissions_epacems_2005_id.csv.gz",
            "title": "hourly_emissions_epacems",
            "encoding": "utf-8",
            "mediatype": "text/csv",
            "format": "csv",
            "dialect": {
                "delimiter": ",",
                "header": true,
                "quoteChar": "\"",
                "doubleQuote": true,
                "lineTerminator": "\r\n",
                "skipInitialSpace": true,
                "caseSensitiveHeader": false
            },
            "schema": {
                "fields": [
                    {
                        "name": "plant_id_eia",
                        "type": "integer",
                        "description": "The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration.",
                        "format": "default"
                    },
## a bunch of deleted field descriptions
                            ]
                        }
                    }
                }
            ],
            "bytes": 229820,
            "hash": "sha256:12d2998a68c0689c19061e001910519c7a19dfe6be3fbc8444f914d9a9ff4c9f",
            "created": "2019-10-22T19:58:40Z",
            "start_date": "2005-01-01",
            "end_date": "2014-21-31",
            "group": "hourly_emissions_epacems"
        },

roll · 2019-10-25T14:55:51Z

@cmgosnell
Could you please share something I can run to debug this issue?

PS.
Please make sure you're on the latest versions of tableschema and datapackage

roll · 2019-10-30T14:31:15Z

@cmgosnell
Has updating the libs versions helped?

zaneselvans · 2019-11-19T17:59:07Z

No, still getting the error with the most recent versions of the datapackage libraries. A very simple version with a couple of resources in a group seems to work as expected, but the simplest PUDL output that tests the behavior doesn't work. I'm trying to simplify that resource group output one step at a time until I get to a minimal example to share with you.

Closes Issue #450. In flattening the data packages, I was using the wrong key (`title`) to de-deplicate the list of resources. This was resulting in only one CEMS table ending up in the metadata and this in the SQLite db.

cmgosnell · 2019-11-20T00:28:56Z

Hey @roll! this was totally just our mistake, so don't worry about it. We are making a bunch of data source specific data packages and then squishing them together into one package, I set up a process for determining how to generate a new data package without duplicating elements of the metadata.... but I messed up and it was only grabbing one of the CEMS resources. It was a very simple fix once we figured out what was happening.

zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. sqlite Issues related to interacting with sqlite databases labels Oct 15, 2019

zaneselvans added ready bug Things that are just plain broken. labels Oct 22, 2019

zaneselvans added this to the PUDL Sprint 5 milestone Oct 23, 2019

roll mentioned this issue Oct 29, 2019

Investigate the grouping problem with SQLite frictionlessdata/pilot-catalyst#19

Closed

zaneselvans self-assigned this Nov 18, 2019

cmgosnell closed this as completed Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datapkg_to_sqlite fails to load all of EPA CEMS #450

datapkg_to_sqlite fails to load all of EPA CEMS #450

zaneselvans commented Oct 15, 2019

cmgosnell commented Oct 22, 2019

roll commented Oct 25, 2019 •

edited

Loading

roll commented Oct 30, 2019 •

edited

Loading

zaneselvans commented Nov 19, 2019

cmgosnell commented Nov 20, 2019

datapkg_to_sqlite fails to load all of EPA CEMS #450

datapkg_to_sqlite fails to load all of EPA CEMS #450

Comments

zaneselvans commented Oct 15, 2019

cmgosnell commented Oct 22, 2019

roll commented Oct 25, 2019 • edited Loading

roll commented Oct 30, 2019 • edited Loading

zaneselvans commented Nov 19, 2019

cmgosnell commented Nov 20, 2019

roll commented Oct 25, 2019 •

edited

Loading

roll commented Oct 30, 2019 •

edited

Loading