Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapkg_to_sqlite fails to load all of EPA CEMS #450

Closed
zaneselvans opened this issue Oct 15, 2019 · 5 comments
Closed

datapkg_to_sqlite fails to load all of EPA CEMS #450

zaneselvans opened this issue Oct 15, 2019 · 5 comments
Assignees
Labels
bug Things that are just plain broken. epacems Integration and analysis of the EPA CEMS dataset. ready sqlite Issues related to interacting with sqlite databases

Comments

@zaneselvans
Copy link
Member

After doing a full ETL of all years and states in CEMS, the datapkg_to_sqlite script doesn't seem to load all of that data into the SQLite database. Rather, it only loads the last year of data into the database. However, the process terminates quickly, so it's probably not even attempting to load all the data. Suspect it's an issue with the iteration and/or partitioning...

@zaneselvans zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. sqlite Issues related to interacting with sqlite databases labels Oct 15, 2019
@cmgosnell
Copy link
Member

Hmmm I tried to recreate this issue with a grouped table and am getting a similar result. For me only the last resource described in the datapackage.json file is showing up in the sqlite database. @roll, is it possible that for the grouped table sqlite export is not appending the parts of the grouped table but just clobbering each of the earlier groups.

I generated data packages for CEMS with these parameters:

epacems_years: [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
epacems_states: [ID,MT,CO,TX,IL]

It is being partitioned by state and year so each tabluar resource for the
hourly_emissions_epacems table is a part of the same group. Here is a description of one of the resources:

       {
            "profile": "tabular-data-resource",
            "name": "hourly_emissions_epacems_2005_id",
            "path": "data/hourly_emissions_epacems_2005_id.csv.gz",
            "title": "hourly_emissions_epacems",
            "encoding": "utf-8",
            "mediatype": "text/csv",
            "format": "csv",
            "dialect": {
                "delimiter": ",",
                "header": true,
                "quoteChar": "\"",
                "doubleQuote": true,
                "lineTerminator": "\r\n",
                "skipInitialSpace": true,
                "caseSensitiveHeader": false
            },
            "schema": {
                "fields": [
                    {
                        "name": "plant_id_eia",
                        "type": "integer",
                        "description": "The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration.",
                        "format": "default"
                    },
## a bunch of deleted field descriptions
                            ]
                        }
                    }
                }
            ],
            "bytes": 229820,
            "hash": "sha256:12d2998a68c0689c19061e001910519c7a19dfe6be3fbc8444f914d9a9ff4c9f",
            "created": "2019-10-22T19:58:40Z",
            "start_date": "2005-01-01",
            "end_date": "2014-21-31",
            "group": "hourly_emissions_epacems"
        },

@zaneselvans zaneselvans added ready bug Things that are just plain broken. labels Oct 22, 2019
@zaneselvans zaneselvans added this to the PUDL Sprint 5 milestone Oct 23, 2019
@roll
Copy link

roll commented Oct 25, 2019

@cmgosnell
Could you please share something I can run to debug this issue?

PS.
Please make sure you're on the latest versions of tableschema and datapackage

@roll
Copy link

roll commented Oct 30, 2019

@cmgosnell
Has updating the libs versions helped?

@zaneselvans zaneselvans self-assigned this Nov 18, 2019
@zaneselvans
Copy link
Member Author

No, still getting the error with the most recent versions of the datapackage libraries. A very simple version with a couple of resources in a group seems to work as expected, but the simplest PUDL output that tests the behavior doesn't work. I'm trying to simplify that resource group output one step at a time until I get to a minimal example to share with you.

cmgosnell added a commit that referenced this issue Nov 20, 2019
Closes Issue #450. In flattening the data packages, I was using the 
wrong key (`title`) to de-deplicate the list of resources. This was 
resulting in only one CEMS table ending up in the metadata and this in 
the SQLite db.
@cmgosnell
Copy link
Member

Hey @roll! this was totally just our mistake, so don't worry about it. We are making a bunch of data source specific data packages and then squishing them together into one package, I set up a process for determining how to generate a new data package without duplicating elements of the metadata.... but I messed up and it was only grabbing one of the CEMS resources. It was a very simple fix once we figured out what was happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. epacems Integration and analysis of the EPA CEMS dataset. ready sqlite Issues related to interacting with sqlite databases
Projects
None yet
Development

No branches or pull requests

3 participants