Eia epa crosswalk #822

aesharpe · 2020-11-12T18:40:51Z

So far I've created a replica of Greg's transformation notebook that takes the EIA crosswalk file and fills in some of the gaps. I'm looking for feedback on next steps regarding how to modularize this. Should it live as a function in the transform directory in the eia.py file? Should it be called in the final transform step?

Waiting on the (hopefully) imminent release of the "official" EPA version of the crosswalk.

…preliminary metadata

cmgosnell

Okay! I'm going to try to briefly outline the structure of what I think is a good next step for the structue of where these things should live/what else needs to be done. I see two main options here. @zaneselvans feel free to comment.

Add crosswalk all processing inside `pudl.transform.eia`

As we discussed on a design call, I think it would be best to have this table generated anytime EIA is being generated. Which would lead me to want this to live inside the pudl.transform.eia module.

a crosswalk function would be called inside of pudl.transform.eia.transform() method that uses the dictionary of EIA dataframes, grabs the stored CSV, does the clean up transformations and adds the transformed table to the dictionary of EIA dataframes (the key being the name of the table, the value being the dataframe). I would do this at the end of the current function.
Once the crosswalk table is added to the dictionary of dfs, the next step in its processing life will be to be loaded (inside of pudl.etl._etl_eia()). Since you've added the metadata into the json file, the load step should just know how to deal with this new table without any adjustments.

Add separate E/T steps and execute inside `pudl.etl._etl_eia()`

Another option would be to create a new extract and transform module for the crosswalk and inset the processing for this inside the pudl.etl._etl_eia() function.

Extract step: add a module as pudl.extract.crosswalk_eia_epa.py. Right now I believe this would just be the read_csv step and column renaming of the current module, but if/when EPA publishes this table, we can edit this step to use the datastore.
Transform step: add a module as pudl.transform.crosswalk_eia_epa.py. This would need both the result of the extract step for the crosswalk and the dictionary of transformed EIA dfs. Do all the mucking and again add the crosswalk table to the dictionary of EIA dataframes.
Tie it all together: put the extract and transform functions inside of pudl.etl._etl_eia() after the transformed_dfs = {"Entities": entities_dfs, "EIA": eia_transformed_dfs} line (right before the tables get loaded).

cmgosnell · 2020-11-12T20:12:39Z

src/pudl/package_data/meta/datapkg/datapackage.json

+        {
+            "profile": "tabular-data-resource",
+            "name": "eia_epacems_crosswalk",
+            "path": "data/plants_eia.csv",  # NOT SURE


this needs to be data/<the name of the tabe>.csv

We should come up with a naming convention for the association tables so it's clear that they're all doing the same kind of thing, and which datasets and/or entities they're connecting. We've used _assn as a suffix in some places (bga, balancing authorities) so maybe something like eia_epacems_assn.csv or eia_ferc1_assn.xlsx for the spreadsheet that assigns the PUDL IDs.

Note that comments aren't allowed in JSON files. That's why they're highlighted in red here.

I'm not sure where exactly this data directory is. If I look in my datastore's data directory, for instance, I don't see any csvs (just more directories). For now, the original EPA-EIA crosswalk csv is stored in package_data/glue.

Inside each of the datapackages we publish, there's a datapackage.json and there's a directory called data which contains the actual CSV files. All file paths within the datapackage specification are relative to the directory in which the datapakcage.json file is found. The use of this data directory structure isn't required by the datapackage standard, but it's a common idiom, because people will sometimes include other non-data auxiliary information inside the datapackage, like a directory called scripts or the raw input files that went into the generation of the datapackage, or a README file, etc.

Does that mean that I should move the raw EPA csv file?

cmgosnell · 2020-11-12T20:13:15Z

src/pudl/package_data/meta/datapkg/datapackage.json

+            "encoding": "utf-8",  # NOT SURE
+            "mediatype": "text/csv",
+            "format": "csv",
+            "dialect": {  # NOT SURE ABOUT ANY OF THESE


I believe all of these look correct

You can find more details on what the CSV dialects are on Frictionless Data. Unless there's something wonky about the formatting of the file that we get from EPA this should be right. We can also use the frictionless command line tool or python library to try and infer what these values should be.

cmgosnell · 2020-11-12T20:14:44Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                        "format": "default"
+                    },
+                    {
+                        "name": "plant_name_eia",


perhaps we should remove this column for standardization.

Agree, for normalization.

which column?

the plant_name_eia column -- since its home table is plants_entity_eia (I think).

We could do that, but this column was also used by Greg to fill in some of the mapping gaps.

Is it needed for the purposes of the association table once that ETL / gap-filling has been done? Or can the appropriate plant names be looked up in the plants_entity_eia table based on the included plant_id_eia?

That's a good point

cmgosnell · 2020-11-12T20:16:58Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                "primaryKey": [
+                    "plant_id_eia"
+                ],
+                "foreignKeys": [{  # DONT KNOW WHERE THIS STANDS


I believe we should also add "plant_id_eia" and "generator_id" which can be added to this list like

{ "fields": ["plant_id_eia", "generator_id"], "reference": { "resource": "generators_entity_eia", "fields": ["plant_id_eia", "generator_id"] }

cmgosnell · 2020-11-12T20:23:25Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                "title": "Creative Commons Attribution 4.0",
+                "path": "https://creativecommons.org/licenses/by/4.0/"
+            }],
+            "sources": []


we should decide what to list here as a source. we should at the very minimum list epa as a source like this:

{ "title": "epa", "path": "https://www.epa.gov/" }

Under the datapackge standard I think title is meant to be a freeform human readable field, but we've been restricting it to the short names of datasets we've integrated (epacems or eia923 etc.) and I'm not 100% sure if we are relying on that anywhere in the code. We might want to choose a more specific title for this to be safe.

Sources can include an email address as well as a path (see here), and that seems like it would be appropriate here since we're getting it from a person at EPA, not their website (yet).

Hmm I agree. Wondering, though, if we should leave the sources as is until we get the updated version of the crosswalk.

cmgosnell · 2020-11-12T20:43:59Z

notebooks/work-in-progress/eia-epacems-crosswalk.ipynb

+    "# Transform column headers\n",
+    "new_cols = eia_epacems_crosswalk.columns.str.lower()\n",
+    "new_cols = [col.replace(' ', '_') for col in new_cols]\n",
+    "eia_epacems_crosswalk.columns = new_cols\n",


you can use pudl.helpers.simplify_strings() for this if you want!

There's a function which specifically does this for column names, called pudl.helpers.simplify_columns()

aesharpe · 2020-11-13T00:19:09Z

Next steps would then be to load the crosswalk table from the database and add it to the appropriate output tables in the output modules?

zaneselvans · 2020-11-13T18:42:10Z

I feel like conceptually the right place to deal with this is the _etl_glue() step, but right now there's no way to pass data between the different ETL steps (i.e. between datasets) which is also why we have awful hack where epacems cheats and looks for eia CSV files in the outputs to figure out what timezone the plants are in. I imagine there's only going to be more of these kinds of inter-dataset dependencies over time.

I think that the Prefect + Dask based refactoring of the ETL module that @rousik is working on will eventually make it easy to pass the ETL outputs between the different steps in the ETL as required, since it can cache the outputs between steps, and tracks dependencies downstream. Is that right?

Anyway for now it seems like @cmgosnell's suggestion to trigger this process from the EIA module is simplest, but maybe we can actually put the code in a pudl.glue.eia_epacems_assn.py module so that it's clear this is an inter-dataset-joining process, and then when the Prefect / Dask refactoring of the ETL is done we can set up the dependencies so that it can really be in the glue step, and get access to all the data it needs.

zaneselvans

I'll submit a separate review on the notebook with the other notebook review tool thingy.

zaneselvans · 2020-11-12T23:40:35Z

src/pudl/package_data/meta/datapkg/datapackage.json

+            "encoding": "utf-8",  # NOT SURE
+            "mediatype": "text/csv",
+            "format": "csv",
+            "dialect": {  # NOT SURE ABOUT ANY OF THESE


You can find more details on what the CSV dialects are on Frictionless Data. Unless there's something wonky about the formatting of the file that we get from EPA this should be right. We can also use the frictionless command line tool or python library to try and infer what these values should be.

zaneselvans · 2020-11-12T23:43:58Z

src/pudl/package_data/meta/datapkg/datapackage.json

+        {
+            "profile": "tabular-data-resource",
+            "name": "eia_epacems_crosswalk",
+            "path": "data/plants_eia.csv",  # NOT SURE


We should come up with a naming convention for the association tables so it's clear that they're all doing the same kind of thing, and which datasets and/or entities they're connecting. We've used _assn as a suffix in some places (bga, balancing authorities) so maybe something like eia_epacems_assn.csv or eia_ferc1_assn.xlsx for the spreadsheet that assigns the PUDL IDs.

Note that comments aren't allowed in JSON files. That's why they're highlighted in red here.

zaneselvans · 2020-11-12T23:45:14Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                        "type": "string",
+                        "format": "default",
+                        "description": "Prime mover code as seen in EIA."
+                ],


Missing a curlybrace } here.

If this is referring to the same set of prime mover codes that we've got in the EIA data, then it should have the same column name here (I think it's prime_mover_code).

zaneselvans · 2020-11-12T23:45:53Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                        "format": "default"
+                    },
+                    {
+                        "name": "plant_name_eia",


Agree, for normalization.

zaneselvans · 2020-11-12T23:56:12Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                "title": "Creative Commons Attribution 4.0",
+                "path": "https://creativecommons.org/licenses/by/4.0/"
+            }],
+            "sources": []


Under the datapackge standard I think title is meant to be a freeform human readable field, but we've been restricting it to the short names of datasets we've integrated (epacems or eia923 etc.) and I'm not 100% sure if we are relying on that anywhere in the code. We might want to choose a more specific title for this to be safe.

Sources can include an email address as well as a path (see here), and that seems like it would be appropriate here since we're getting it from a person at EPA, not their website (yet).

zaneselvans · 2020-11-13T00:00:32Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                "path": "https://creativecommons.org/licenses/by/4.0/"
+            }],
+            "sources": []
+        },


No trailing commas allowed at the end of a list of entities in JSON.

zaneselvans · 2020-11-13T00:02:44Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                    {
+                        "name": "plant_id_epa",
+                        "type": "integer",
+                        "format": "default"
+                    },


Where are the plant_id_epa values coming from? Is there an authoritative resource we can refer to here, so that it's easier for people to connect this with other datasets which use the same EPA plant IDs?

In the original file, this is called the ORIS code and the plant_id_eia is called the EIA ORIS. I haven't seen the EPA ORIS code show up elsewhere, but it's possible that it would be useful for connection with other EPA datasets.

zaneselvans · 2020-11-13T18:25:09Z

notebooks/work-in-progress/eia-epacems-crosswalk.ipynb

+    "# Transform column headers\n",
+    "new_cols = eia_epacems_crosswalk.columns.str.lower()\n",
+    "new_cols = [col.replace(' ', '_') for col in new_cols]\n",
+    "eia_epacems_crosswalk.columns = new_cols\n",


There's a function which specifically does this for column names, called pudl.helpers.simplify_columns()

zaneselvans · 2020-11-13T18:31:04Z

src/pudl/package_data/meta/datapkg/datapackage.json

+        {
+            "profile": "tabular-data-resource",
+            "name": "eia_epacems_crosswalk",
+            "path": "data/plants_eia.csv",  # NOT SURE


Inside each of the datapackages we publish, there's a datapackage.json and there's a directory called data which contains the actual CSV files. All file paths within the datapackage specification are relative to the directory in which the datapakcage.json file is found. The use of this data directory structure isn't required by the datapackage standard, but it's a common idiom, because people will sometimes include other non-data auxiliary information inside the datapackage, like a directory called scripts or the raw input files that went into the generation of the datapackage, or a README file, etc.

zaneselvans · 2020-11-13T18:43:35Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                        "format": "default"
+                    },
+                    {
+                        "name": "plant_name_eia",


the plant_name_eia column -- since its home table is plants_entity_eia (I think).

zaneselvans · 2020-11-13T19:02:26Z

notebooks/work-in-progress/eia-epacems-crosswalk.ipynb

@@ -0,0 +1,299 @@
+{


There's a special way to access data which is stored inside the package_data directory that you'll need to use, since this filesystem hierarchy will not be available when the pudl package is installed on a user's system. The package is importlib.resources and you'll want to use open_text() You can find some examples of how this is being used right now in the pudl.constants.py module.

Reply via ReviewNB

zaneselvans · 2020-11-13T19:02:26Z

notebooks/work-in-progress/eia-epacems-crosswalk.ipynb

@@ -0,0 +1,299 @@
+{


The string NaN suggests there's some datatype shenanigans going on here. However the file is getting read it, we should make sure it has real NA values, not stringified-NA values, so they can be dealt with uniformly.

Reply via ReviewNB

fixed! There wasn't actually a datatype problem, I had just borrowed that from Greg's code.

zaneselvans · 2020-11-13T19:02:26Z

notebooks/work-in-progress/eia-epacems-crosswalk.ipynb

@@ -0,0 +1,299 @@
+{


Merging on plant_name_eia makes me nervous, because the names are both poorly standardized, and not guaranteed to be unique.

Reply via ReviewNB

I totally agree with you on this, but I'm inclined to keep it in because it doesn't do any harm. What it does is add more EIA ids to the crosswalk if there happens to be an exact name match between the plant_name_eia and the plant_name_epa. In this case there are actually 2883 exact matches so it's pretty useful for filling in the gaps!

zaneselvans · 2020-11-13T19:02:26Z

notebooks/work-in-progress/eia-epacems-crosswalk.ipynb

@@ -0,0 +1,299 @@
+{


Isn't the set of plants where the ORISPL ID doesn't match small and pretty well defined? I think we explored this in some depth a while ago. Could the matching process be simplified by harmonizing those codes first?

Reply via ReviewNB

I looked through the issue and my understanding is that you can access EPA information by using the plant_id_eia (as shown by Karl's function pulling information from the FRS site) but there still isn't a clear mapping of plant_id_eia to this "epa unit". This links at the top of the issue seemed promising, but the first link was dead and the second (all the eGrid files) seems like it would be out of date given that it stops in 2016. The issue also culminates with Greg talking about the file he made (that I'm mimicking here) so it seems like he would have incorporated that data if it pertained to direct id linkage. I can do some more digging though if you think there's more concrete information out there.

On that note, however, it might be really useful to integrate the FRS Regional Ids into this same crosswalk table (as discussed in the issue). I'm working on a function to do so, but I can't figure out how to avoid using the notorious 'apply' function at the moment. Read: it's taking forever.

zaneselvans · 2020-11-13T19:03:38Z

The notebook comments will probably only really make sense here: https://app.reviewnb.com/catalyst-cooperative/pudl/pull/822

…crosswalk notebook

… process.

zaneselvans · 2020-11-20T23:10:31Z

src/pudl/package_data/meta/datapkg/datapackage.json

+        {
+            "profile": "tabular-data-resource",
+            "name": "assn_eia_epacems",
+            "path": "data/plants_eia.csv",


You'll need to update the path here to refer to the correct output CSV file which it looks like will be data/assn_eia_epacems.csv

…udl into eia-epa-crosswalk

zaneselvans · 2020-11-21T01:09:00Z

src/pudl/__init__.py

@@ -23,6 +23,7 @@
 import pudl.extract.excel
 import pudl.extract.ferc1
 import pudl.extract.ferc714
+import pudl.glue.eia_epacems


@aesharpe Any time you add a new module, if you want it to be available when someone does import pudl it needs to end up in here. Otherwise, it has to be specifically imported like import pudl.glue.eia_epacems.

ahhh thank you, this clears up some confusion I was having!

zaneselvans · 2020-11-21T01:13:11Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                "primaryKey": [
+                    "plant_id_eia"
+                ],
+                "foreignKeys": [{
+                    "fields": ["plant_id_eia", "generator_id"],
+                    "reference": {
+                        "resource": "generators_entity_eia",
+                        "fields": ["plant_id_eia", "generator_id"]
+                    }
+                }],


The primaryKey has to uniquely identify a row in the table -- so unless there's only one record per plant_id_eia, I think we're going to need some additional columns in here. Pure association tables frequently include all of their columns as part of the primary key, since each row is meant to uniquely specify a relationship between other tables that are being referenced. In a similar vein, you'll need at least 2 different foreignKey values in here, referring to the multiple different tables which are being related to each other through this association table.

zaneselvans · 2020-11-21T01:34:52Z

src/pudl/glue/eia_epacems.py

+        .rename(columns={
+            'oris_code': 'plant_id_epa',
+            'eia_oris': 'plant_id_eia',
+            'unit_id': 'epa_point_source_unit',
+            'facility_name': 'plant_name_eia',
+            'unit_type': 'prime_mover_code'})
+        .drop([
+            'fuel_type_primary',
+            'edat_mw_cap',
+            'way_gen_id_matched',
+            'unit_op_status_date',
+            'notes',
+            'annual_heat_input',
+            'op_status'], axis=1)


The pattern of dropping the columns you don't want, rather than explicitly keeping only the columns you do want, is vulnerable to changes in the column names or the inclusion of new columns, which would mean the dataframe has unexpected columns in it at the end of the function. We definitely did this a lot early on in our transform function writing, but have tried to move toward having the dataframes be more reliably of a particular form at the end.

Good bit of advice - I switched it to filter vs drop

zaneselvans · 2020-11-21T02:23:44Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                    {
+                        "name": "epa_point_source_unit",
+                        "type": "integer",
+                        "format": "default",
+                        "description": "'Smokestack' unit monitored by EPA CEMS."
+                    },
+                    {
+                        "name": "generator_id",
+                        "type": "integer",
+                        "format": "default"
+                    },
+                    {
+                        "name": "boiler_id",
+                        "type": "integer",
+                        "format": "default"
+                    },


It looks like a lot of these IDs are actually strings, rather than integers, so the datapackage validation is currently failing. The type (and potentially format) need to be compatible with whatever the contents of the column are in the CSV.

what exactly does the "format" mean?

A CSV contains characters which make up strings, but the various columns of the CSV are meant to be interpreted as things that aren't necessarily strings. The format attribute indicates how the string stored in the CSV should be interpreted as... whatever type it's supposed to ultimately turn into (e.g. there's lots of different ways to encode a date or a time as a string). Lots of them are simple and only have the default case, but some of them are more complicated.

All of these attributes are defined by the TableSchema spec, as are the various ways of defining primary keys, foreign key constraints, values that should be interpreted as True or False or NA, etc.

zaneselvans · 2020-11-21T02:26:30Z

src/pudl/package_data/meta/datapkg/datapackage.json

+                        "name": "plant_id_eia",
+                        "type": "integer",
+                        "format": "default",
+                        "description": "The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration."
+                    },


It looks like the plant_id_eia values in the CSV are being stored as floats (at least some of the time) rather than integers, so this is causing the datapackage validation to fail. You might need to run the dtype fixing function from helpers, and/or add a new section to the datatype dictionary in pudl.constants to cover the columns in this new table.

…data

…se additional steps due to foreign key errors; stand-in until EPA version 2 ready

codecov · 2020-12-02T21:22:19Z

Codecov Report

Merging #822 (b3f2890) into sprint27 (470e159) will increase coverage by 0.15%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##           sprint27     #822      +/-   ##
============================================
+ Coverage     70.31%   70.46%   +0.15%     
============================================
  Files            39       40       +1     
  Lines          4871     4895      +24     
============================================
+ Hits           3425     3449      +24     
  Misses         1446     1446

Impacted Files	Coverage Δ
src/pudl/constants.py	`100.00% <ø> (ø)`
src/pudl/transform/eia.py	`97.13% <ø> (ø)`
src/pudl/transform/eia860.py	`96.43% <ø> (ø)`
src/pudl/etl.py	`77.92% <100.00%> (+0.35%)`	⬆️
src/pudl/glue/eia_epacems.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 470e159...b3f2890. Read the comment docs.

cmgosnell

I think it would be cleaner to get rid of the processing we don't plan on using. perhaps we can bookmark the current commit so we can come back to it if we need to use this in the future.

I'd also like a little more documentation of the functions in glue.eia_epacems!

Besides that this looks good! It came out pretty simple and clean which is great.

cmgosnell · 2020-12-03T18:04:27Z

src/pudl/glue/eia_epacems.py

+
+
+def simple_clean():
+    """No attempt to fill in the gaps, simply returns available data."""


can you add a quick description of the returns here?

cmgosnell · 2020-12-03T18:05:19Z

src/pudl/etl.py

+    """Add the EIA-EPA crosswalk to the transformed dfs dict."""
+    assn_dfs = pudl.glue.eia_epacems.simple_clean()
+    eia_transformed_dfs.update(assn_dfs)
+


i like how straightforward this is!

cmgosnell · 2020-12-03T18:14:33Z

src/pudl/package_data/meta/datapkg/datapackage.json

+        {
+            "profile": "tabular-data-resource",
+            "name": "plant_unit_epa",
+            "path": "data/assn_eia_epacems.csv",


Is it okay that the path and the name are different here? I had thought that they needed to be the same.

Hmmm, I'm not sure about the "legality" of having different path/table names, but in this case they are difference because we are taking one CSV and splitting it into three separate normalized tables.

@zaneselvans do you think this will be an issue?

oooh i think i understand why this isn't a problem rn. Because we are always overwritting the path when we generate the "real" datapackage metadata in pudl.load.metadata.get_tabular_data_resource().

so truly the paths in this file don't matter - I suppose they will be functionally gone when we convert the metadata into the new system. But for now I'd either remove or rename these to be consistent with the name just for consistency. This path will be the path from within the datapacakge that an etl run generates - and the names and the file names are always the same right now.

is the path supposed to be the path to the raw datafile or the output datafile?

the output file! which is why it doesn't have an effect in this file because this is the thing that we use to cobble together the metadata for the datapackage once it is generated.

cmgosnell · 2020-12-03T18:16:38Z

src/pudl/transform/eia.py

@@ -814,6 +814,17 @@ def _restrict_years(df,
    return df


+# def _add_eia_epacems_crosswalk(eia_transformed_dfs):


can you get rid of this lingering code here now that it is integrated into etl.py??

yes! thanks for spotting that.

…f code to fill in the gaps

re-create greg's eia-epacems crosswalk transformations notebook; add …

68ea6a2

…preliminary metadata

aesharpe added eia860 Anything having to do with EIA Form 860 epacems Integration and analysis of the EPA CEMS dataset. labels Nov 12, 2020

aesharpe requested review from zaneselvans and cmgosnell November 12, 2020 18:40

aesharpe changed the base branch from master to sprint27 November 12, 2020 18:41

cmgosnell reviewed Nov 12, 2020

View reviewed changes

Respond to quick-fix PR comments

0a7872f

zaneselvans reviewed Nov 13, 2020

View reviewed changes

aesharpe added 5 commits November 13, 2020 13:18

Merge branch 'sprint27' into eia-epa-crosswalk

5e1b4bf

respond to some of the PR comments for eia-epa-crosswalk

e34d336

Address PR comments regardings the recreation of Greg's EIA-EPA cems …

02499c3

…crosswalk notebook

Modularize and functionalize EIA-EPA crosswalk notebook; test eia_etl…

e59d214

… process.

refine FRS and system id grab - not yet integrated into the module

5eab537

zaneselvans reviewed Nov 20, 2020

View reviewed changes

aesharpe and others added 4 commits November 20, 2020 16:43

tiny meta-data path update

a9f0be5

Fix import order

a88d0af

Merge branch 'eia-epa-crosswalk' of github.com:catalyst-cooperative/p…

aa9fd0b

…udl into eia-epa-crosswalk

Fixed table name typo

e8a321c

zaneselvans reviewed Nov 21, 2020

View reviewed changes

zaneselvans and others added 2 commits November 20, 2020 21:24

Fix boiler/generator/unit ID datatypes

22bf908

split EIA-EPA crosswalk into three normalized dataframes; update meta…

b9895d1

…data

aesharpe added 4 commits November 25, 2020 08:43

update titles in metadata

78a0b8a

update plant_id_eia type in meta data from string to integer

1ff6d66

write function that skips filling in the gaps; load table without tho…

2274485

…se additional steps due to foreign key errors; stand-in until EPA version 2 ready

save changes to eia_epacems.py

4cdf5b6

cmgosnell requested changes Dec 3, 2020

View reviewed changes

aesharpe added 5 commits December 3, 2020 12:51

Address PR comments: update crosswalk module documentation, get rid o…

4c0ae06

…f code to fill in the gaps

make documentation imperative mode

396a061

change path names to match names

a206bef

Merge branch 'sprint27' into eia-epa-crosswalk

a5ee0a8

tested SQL db integration, removed test notebook

b3f2890

aesharpe marked this pull request as ready for review December 4, 2020 23:27

aesharpe merged commit 147cca3 into sprint27 Dec 4, 2020

zaneselvans deleted the eia-epa-crosswalk branch June 1, 2021 19:35

TrentonBush mentioned this pull request Aug 13, 2021

Integrate EPA Crosswalk into Scraper and Archiver #1123

Closed

2 tasks



		def simple_clean():
		"""No attempt to fill in the gaps, simply returns available data."""

		@@ -814,6 +814,17 @@ def _restrict_years(df,
		return df


		# def _add_eia_epacems_crosswalk(eia_transformed_dfs):

Eia epa crosswalk #822

Eia epa crosswalk #822

Conversation

aesharpe commented Nov 12, 2020 • edited

cmgosnell left a comment • edited

Choose a reason for hiding this comment

Add crosswalk all processing inside pudl.transform.eia

Add separate E/T steps and execute inside pudl.etl._etl_eia()

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe commented Nov 13, 2020

zaneselvans commented Nov 13, 2020 • edited

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Nov 14, 2020 • edited

Choose a reason for hiding this comment

zaneselvans commented Nov 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 2, 2020 • edited

Codecov Report

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Dec 3, 2020 • edited

Choose a reason for hiding this comment

cmgosnell Dec 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe commented Nov 12, 2020 •

edited

cmgosnell left a comment •

edited

Add crosswalk all processing inside `pudl.transform.eia`

Add separate E/T steps and execute inside `pudl.etl._etl_eia()`

zaneselvans commented Nov 13, 2020 •

edited

aesharpe Nov 14, 2020 •

edited

codecov bot commented Dec 2, 2020 •

edited

aesharpe Dec 3, 2020 •

edited

cmgosnell Dec 3, 2020 •

edited