-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eia epa crosswalk #822
Eia epa crosswalk #822
Conversation
…preliminary metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I'm going to try to briefly outline the structure of what I think is a good next step for the structue of where these things should live/what else needs to be done. I see two main options here. @zaneselvans feel free to comment.
Add crosswalk all processing inside pudl.transform.eia
As we discussed on a design call, I think it would be best to have this table generated anytime EIA is being generated. Which would lead me to want this to live inside the pudl.transform.eia
module.
- a crosswalk function would be called inside of
pudl.transform.eia.transform()
method that uses the dictionary of EIA dataframes, grabs the stored CSV, does the clean up transformations and adds the transformed table to the dictionary of EIA dataframes (the key being the name of the table, the value being the dataframe). I would do this at the end of the current function. - Once the crosswalk table is added to the dictionary of dfs, the next step in its processing life will be to be loaded (inside of
pudl.etl._etl_eia()
). Since you've added the metadata into the json file, the load step should just know how to deal with this new table without any adjustments.
Add separate E/T steps and execute inside pudl.etl._etl_eia()
Another option would be to create a new extract and transform module for the crosswalk and inset the processing for this inside the pudl.etl._etl_eia()
function.
- Extract step: add a module as
pudl.extract.crosswalk_eia_epa.py
. Right now I believe this would just be the read_csv step and column renaming of the current module, but if/when EPA publishes this table, we can edit this step to use the datastore. - Transform step: add a module as
pudl.transform.crosswalk_eia_epa.py
. This would need both the result of the extract step for the crosswalk and the dictionary of transformed EIA dfs. Do all the mucking and again add the crosswalk table to the dictionary of EIA dataframes. - Tie it all together: put the extract and transform functions inside of
pudl.etl._etl_eia()
after thetransformed_dfs = {"Entities": entities_dfs, "EIA": eia_transformed_dfs}
line (right before the tables get loaded).
{ | ||
"profile": "tabular-data-resource", | ||
"name": "eia_epacems_crosswalk", | ||
"path": "data/plants_eia.csv", # NOT SURE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be data/<the name of the tabe>.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should come up with a naming convention for the association tables so it's clear that they're all doing the same kind of thing, and which datasets and/or entities they're connecting. We've used _assn
as a suffix in some places (bga, balancing authorities) so maybe something like eia_epacems_assn.csv
or eia_ferc1_assn.xlsx
for the spreadsheet that assigns the PUDL IDs.
Note that comments aren't allowed in JSON files. That's why they're highlighted in red here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where exactly this data
directory is. If I look in my datastore's data
directory, for instance, I don't see any csvs (just more directories). For now, the original EPA-EIA crosswalk csv is stored in package_data/glue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside each of the datapackages we publish, there's a datapackage.json
and there's a directory called data
which contains the actual CSV files. All file paths within the datapackage specification are relative to the directory in which the datapakcage.json
file is found. The use of this data
directory structure isn't required by the datapackage standard, but it's a common idiom, because people will sometimes include other non-data auxiliary information inside the datapackage, like a directory called scripts
or the raw input files that went into the generation of the datapackage, or a README file, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that I should move the raw EPA csv file?
"encoding": "utf-8", # NOT SURE | ||
"mediatype": "text/csv", | ||
"format": "csv", | ||
"dialect": { # NOT SURE ABOUT ANY OF THESE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all of these look correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find more details on what the CSV dialects are on Frictionless Data. Unless there's something wonky about the formatting of the file that we get from EPA this should be right. We can also use the frictionless
command line tool or python library to try and infer what these values should be.
"format": "default" | ||
}, | ||
{ | ||
"name": "plant_name_eia", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should remove this column for standardization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, for normalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plant_name_eia
column -- since its home table is plants_entity_eia
(I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but this column was also used by Greg to fill in some of the mapping gaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed for the purposes of the association table once that ETL / gap-filling has been done? Or can the appropriate plant names be looked up in the plants_entity_eia
table based on the included plant_id_eia
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point
"primaryKey": [ | ||
"plant_id_eia" | ||
], | ||
"foreignKeys": [{ # DONT KNOW WHERE THIS STANDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should also add "plant_id_eia" and "generator_id" which can be added to this list like
{
"fields": ["plant_id_eia", "generator_id"],
"reference": {
"resource": "generators_entity_eia",
"fields": ["plant_id_eia", "generator_id"]
}
"title": "Creative Commons Attribution 4.0", | ||
"path": "https://creativecommons.org/licenses/by/4.0/" | ||
}], | ||
"sources": [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should decide what to list here as a source. we should at the very minimum list epa as a source like this:
{
"title": "epa",
"path": "https://www.epa.gov/"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the datapackge standard I think title
is meant to be a freeform human readable field, but we've been restricting it to the short names of datasets we've integrated (epacems
or eia923
etc.) and I'm not 100% sure if we are relying on that anywhere in the code. We might want to choose a more specific title for this to be safe.
Sources can include an email address as well as a path (see here), and that seems like it would be appropriate here since we're getting it from a person at EPA, not their website (yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I agree. Wondering, though, if we should leave the sources as is until we get the updated version of the crosswalk.
"# Transform column headers\n", | ||
"new_cols = eia_epacems_crosswalk.columns.str.lower()\n", | ||
"new_cols = [col.replace(' ', '_') for col in new_cols]\n", | ||
"eia_epacems_crosswalk.columns = new_cols\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use pudl.helpers.simplify_strings()
for this if you want!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a function which specifically does this for column names, called pudl.helpers.simplify_columns()
Next steps would then be to load the crosswalk table from the database and add it to the appropriate output tables in the output modules? |
I feel like conceptually the right place to deal with this is the I think that the Prefect + Dask based refactoring of the ETL module that @rousik is working on will eventually make it easy to pass the ETL outputs between the different steps in the ETL as required, since it can cache the outputs between steps, and tracks dependencies downstream. Is that right? Anyway for now it seems like @cmgosnell's suggestion to trigger this process from the EIA module is simplest, but maybe we can actually put the code in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll submit a separate review on the notebook with the other notebook review tool thingy.
"encoding": "utf-8", # NOT SURE | ||
"mediatype": "text/csv", | ||
"format": "csv", | ||
"dialect": { # NOT SURE ABOUT ANY OF THESE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find more details on what the CSV dialects are on Frictionless Data. Unless there's something wonky about the formatting of the file that we get from EPA this should be right. We can also use the frictionless
command line tool or python library to try and infer what these values should be.
{ | ||
"profile": "tabular-data-resource", | ||
"name": "eia_epacems_crosswalk", | ||
"path": "data/plants_eia.csv", # NOT SURE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should come up with a naming convention for the association tables so it's clear that they're all doing the same kind of thing, and which datasets and/or entities they're connecting. We've used _assn
as a suffix in some places (bga, balancing authorities) so maybe something like eia_epacems_assn.csv
or eia_ferc1_assn.xlsx
for the spreadsheet that assigns the PUDL IDs.
Note that comments aren't allowed in JSON files. That's why they're highlighted in red here.
"type": "string", | ||
"format": "default", | ||
"description": "Prime mover code as seen in EIA." | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a curlybrace }
here.
If this is referring to the same set of prime mover codes that we've got in the EIA data, then it should have the same column name here (I think it's prime_mover_code
).
"format": "default" | ||
}, | ||
{ | ||
"name": "plant_name_eia", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, for normalization.
"title": "Creative Commons Attribution 4.0", | ||
"path": "https://creativecommons.org/licenses/by/4.0/" | ||
}], | ||
"sources": [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the datapackge standard I think title
is meant to be a freeform human readable field, but we've been restricting it to the short names of datasets we've integrated (epacems
or eia923
etc.) and I'm not 100% sure if we are relying on that anywhere in the code. We might want to choose a more specific title for this to be safe.
Sources can include an email address as well as a path (see here), and that seems like it would be appropriate here since we're getting it from a person at EPA, not their website (yet).
"path": "https://creativecommons.org/licenses/by/4.0/" | ||
}], | ||
"sources": [] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No trailing commas allowed at the end of a list of entities in JSON.
{ | ||
"name": "plant_id_epa", | ||
"type": "integer", | ||
"format": "default" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the plant_id_epa
values coming from? Is there an authoritative resource we can refer to here, so that it's easier for people to connect this with other datasets which use the same EPA plant IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original file, this is called the ORIS code
and the plant_id_eia
is called the EIA ORIS
. I haven't seen the EPA ORIS code show up elsewhere, but it's possible that it would be useful for connection with other EPA datasets.
"# Transform column headers\n", | ||
"new_cols = eia_epacems_crosswalk.columns.str.lower()\n", | ||
"new_cols = [col.replace(' ', '_') for col in new_cols]\n", | ||
"eia_epacems_crosswalk.columns = new_cols\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a function which specifically does this for column names, called pudl.helpers.simplify_columns()
{ | ||
"profile": "tabular-data-resource", | ||
"name": "eia_epacems_crosswalk", | ||
"path": "data/plants_eia.csv", # NOT SURE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside each of the datapackages we publish, there's a datapackage.json
and there's a directory called data
which contains the actual CSV files. All file paths within the datapackage specification are relative to the directory in which the datapakcage.json
file is found. The use of this data
directory structure isn't required by the datapackage standard, but it's a common idiom, because people will sometimes include other non-data auxiliary information inside the datapackage, like a directory called scripts
or the raw input files that went into the generation of the datapackage, or a README file, etc.
"format": "default" | ||
}, | ||
{ | ||
"name": "plant_name_eia", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plant_name_eia
column -- since its home table is plants_entity_eia
(I think).
@@ -0,0 +1,299 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a special way to access data which is stored inside the package_data
directory that you'll need to use, since this filesystem hierarchy will not be available when the pudl
package is installed on a user's system. The package is importlib.resources
and you'll want to use open_text()
You can find some examples of how this is being used right now in the pudl.constants.py
module.
Reply via ReviewNB
@@ -0,0 +1,299 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string NaN
suggests there's some datatype shenanigans going on here. However the file is getting read it, we should make sure it has real NA values, not stringified-NA values, so they can be dealt with uniformly.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! There wasn't actually a datatype problem, I had just borrowed that from Greg's code.
@@ -0,0 +1,299 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging on plant_name_eia
makes me nervous, because the names are both poorly standardized, and not guaranteed to be unique.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with you on this, but I'm inclined to keep it in because it doesn't do any harm. What it does is add more EIA ids to the crosswalk if there happens to be an exact name match between the plant_name_eia
and the plant_name_epa
. In this case there are actually 2883 exact matches so it's pretty useful for filling in the gaps!
@@ -0,0 +1,299 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the set of plants where the ORISPL ID doesn't match small and pretty well defined? I think we explored this in some depth a while ago. Could the matching process be simplified by harmonizing those codes first?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the issue and my understanding is that you can access EPA information by using the plant_id_eia
(as shown by Karl's function pulling information from the FRS site) but there still isn't a clear mapping of plant_id_eia
to this "epa unit". This links at the top of the issue seemed promising, but the first link was dead and the second (all the eGrid files) seems like it would be out of date given that it stops in 2016. The issue also culminates with Greg talking about the file he made (that I'm mimicking here) so it seems like he would have incorporated that data if it pertained to direct id linkage. I can do some more digging though if you think there's more concrete information out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note, however, it might be really useful to integrate the FRS Regional Ids into this same crosswalk table (as discussed in the issue). I'm working on a function to do so, but I can't figure out how to avoid using the notorious 'apply' function at the moment. Read: it's taking forever.
The notebook comments will probably only really make sense here: https://app.reviewnb.com/catalyst-cooperative/pudl/pull/822 |
…crosswalk notebook
{ | ||
"profile": "tabular-data-resource", | ||
"name": "assn_eia_epacems", | ||
"path": "data/plants_eia.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to update the path here to refer to the correct output CSV file which it looks like will be data/assn_eia_epacems.csv
@@ -23,6 +23,7 @@ | |||
import pudl.extract.excel | |||
import pudl.extract.ferc1 | |||
import pudl.extract.ferc714 | |||
import pudl.glue.eia_epacems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aesharpe Any time you add a new module, if you want it to be available when someone does import pudl
it needs to end up in here. Otherwise, it has to be specifically imported like import pudl.glue.eia_epacems
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh thank you, this clears up some confusion I was having!
"primaryKey": [ | ||
"plant_id_eia" | ||
], | ||
"foreignKeys": [{ | ||
"fields": ["plant_id_eia", "generator_id"], | ||
"reference": { | ||
"resource": "generators_entity_eia", | ||
"fields": ["plant_id_eia", "generator_id"] | ||
} | ||
}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primaryKey
has to uniquely identify a row in the table -- so unless there's only one record per plant_id_eia
, I think we're going to need some additional columns in here. Pure association tables frequently include all of their columns as part of the primary key, since each row is meant to uniquely specify a relationship between other tables that are being referenced. In a similar vein, you'll need at least 2 different foreignKey
values in here, referring to the multiple different tables which are being related to each other through this association table.
src/pudl/glue/eia_epacems.py
Outdated
.rename(columns={ | ||
'oris_code': 'plant_id_epa', | ||
'eia_oris': 'plant_id_eia', | ||
'unit_id': 'epa_point_source_unit', | ||
'facility_name': 'plant_name_eia', | ||
'unit_type': 'prime_mover_code'}) | ||
.drop([ | ||
'fuel_type_primary', | ||
'edat_mw_cap', | ||
'way_gen_id_matched', | ||
'unit_op_status_date', | ||
'notes', | ||
'annual_heat_input', | ||
'op_status'], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern of dropping the columns you don't want, rather than explicitly keeping only the columns you do want, is vulnerable to changes in the column names or the inclusion of new columns, which would mean the dataframe has unexpected columns in it at the end of the function. We definitely did this a lot early on in our transform function writing, but have tried to move toward having the dataframes be more reliably of a particular form at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good bit of advice - I switched it to filter
vs drop
{ | ||
"name": "epa_point_source_unit", | ||
"type": "integer", | ||
"format": "default", | ||
"description": "'Smokestack' unit monitored by EPA CEMS." | ||
}, | ||
{ | ||
"name": "generator_id", | ||
"type": "integer", | ||
"format": "default" | ||
}, | ||
{ | ||
"name": "boiler_id", | ||
"type": "integer", | ||
"format": "default" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a lot of these IDs are actually strings, rather than integers, so the datapackage validation is currently failing. The type (and potentially format) need to be compatible with whatever the contents of the column are in the CSV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly does the "format" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CSV contains characters which make up strings, but the various columns of the CSV are meant to be interpreted as things that aren't necessarily strings. The format
attribute indicates how the string stored in the CSV should be interpreted as... whatever type it's supposed to ultimately turn into (e.g. there's lots of different ways to encode a date or a time as a string). Lots of them are simple and only have the default case, but some of them are more complicated.
All of these attributes are defined by the TableSchema spec, as are the various ways of defining primary keys, foreign key constraints, values that should be interpreted as True or False or NA, etc.
"name": "plant_id_eia", | ||
"type": "integer", | ||
"format": "default", | ||
"description": "The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration." | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the plant_id_eia
values in the CSV are being stored as floats (at least some of the time) rather than integers, so this is causing the datapackage validation to fail. You might need to run the dtype fixing function from helpers, and/or add a new section to the datatype dictionary in pudl.constants
to cover the columns in this new table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
…se additional steps due to foreign key errors; stand-in until EPA version 2 ready
Codecov Report
@@ Coverage Diff @@
## sprint27 #822 +/- ##
============================================
+ Coverage 70.31% 70.46% +0.15%
============================================
Files 39 40 +1
Lines 4871 4895 +24
============================================
+ Hits 3425 3449 +24
Misses 1446 1446
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to get rid of the processing we don't plan on using. perhaps we can bookmark the current commit so we can come back to it if we need to use this in the future.
I'd also like a little more documentation of the functions in glue.eia_epacems
!
Besides that this looks good! It came out pretty simple and clean which is great.
src/pudl/glue/eia_epacems.py
Outdated
|
||
|
||
def simple_clean(): | ||
"""No attempt to fill in the gaps, simply returns available data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a quick description of the returns here?
"""Add the EIA-EPA crosswalk to the transformed dfs dict.""" | ||
assn_dfs = pudl.glue.eia_epacems.simple_clean() | ||
eia_transformed_dfs.update(assn_dfs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like how straightforward this is!
{ | ||
"profile": "tabular-data-resource", | ||
"name": "plant_unit_epa", | ||
"path": "data/assn_eia_epacems.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay that the path and the name are different here? I had thought that they needed to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm not sure about the "legality" of having different path/table names, but in this case they are difference because we are taking one CSV and splitting it into three separate normalized tables.
@zaneselvans do you think this will be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh i think i understand why this isn't a problem rn. Because we are always overwritting the path when we generate the "real" datapackage metadata in pudl.load.metadata.get_tabular_data_resource()
.
so truly the paths in this file don't matter - I suppose they will be functionally gone when we convert the metadata into the new system. But for now I'd either remove or rename these to be consistent with the name just for consistency. This path will be the path from within the datapacakge that an etl run generates - and the names and the file names are always the same right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the path supposed to be the path to the raw datafile or the output datafile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output file! which is why it doesn't have an effect in this file because this is the thing that we use to cobble together the metadata for the datapackage once it is generated.
src/pudl/transform/eia.py
Outdated
@@ -814,6 +814,17 @@ def _restrict_years(df, | |||
return df | |||
|
|||
|
|||
# def _add_eia_epacems_crosswalk(eia_transformed_dfs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you get rid of this lingering code here now that it is integrated into etl.py
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! thanks for spotting that.
…f code to fill in the gaps
So far I've created a replica of Greg's transformation notebook that takes the EIA crosswalk file and fills in some of the gaps. I'm looking for feedback on next steps regarding how to modularize this. Should it live as a function in the transform directory in the
eia.py
file? Should it be called in the final transform step?Waiting on the (hopefully) imminent release of the "official" EPA version of the crosswalk.