Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eia epa crosswalk #822

Merged
merged 22 commits into from
Dec 4, 2020
Merged

Eia epa crosswalk #822

merged 22 commits into from
Dec 4, 2020

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Nov 12, 2020

So far I've created a replica of Greg's transformation notebook that takes the EIA crosswalk file and fills in some of the gaps. I'm looking for feedback on next steps regarding how to modularize this. Should it live as a function in the transform directory in the eia.py file? Should it be called in the final transform step?

Waiting on the (hopefully) imminent release of the "official" EPA version of the crosswalk.

@aesharpe aesharpe added eia860 Anything having to do with EIA Form 860 epacems Integration and analysis of the EPA CEMS dataset. labels Nov 12, 2020
@aesharpe aesharpe changed the base branch from master to sprint27 November 12, 2020 18:41
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I'm going to try to briefly outline the structure of what I think is a good next step for the structue of where these things should live/what else needs to be done. I see two main options here. @zaneselvans feel free to comment.

Add crosswalk all processing inside pudl.transform.eia

As we discussed on a design call, I think it would be best to have this table generated anytime EIA is being generated. Which would lead me to want this to live inside the pudl.transform.eia module.

  • a crosswalk function would be called inside of pudl.transform.eia.transform() method that uses the dictionary of EIA dataframes, grabs the stored CSV, does the clean up transformations and adds the transformed table to the dictionary of EIA dataframes (the key being the name of the table, the value being the dataframe). I would do this at the end of the current function.
  • Once the crosswalk table is added to the dictionary of dfs, the next step in its processing life will be to be loaded (inside of pudl.etl._etl_eia()). Since you've added the metadata into the json file, the load step should just know how to deal with this new table without any adjustments.

Add separate E/T steps and execute inside pudl.etl._etl_eia()

Another option would be to create a new extract and transform module for the crosswalk and inset the processing for this inside the pudl.etl._etl_eia() function.

  • Extract step: add a module as pudl.extract.crosswalk_eia_epa.py. Right now I believe this would just be the read_csv step and column renaming of the current module, but if/when EPA publishes this table, we can edit this step to use the datastore.
  • Transform step: add a module as pudl.transform.crosswalk_eia_epa.py. This would need both the result of the extract step for the crosswalk and the dictionary of transformed EIA dfs. Do all the mucking and again add the crosswalk table to the dictionary of EIA dataframes.
  • Tie it all together: put the extract and transform functions inside of pudl.etl._etl_eia() after the transformed_dfs = {"Entities": entities_dfs, "EIA": eia_transformed_dfs} line (right before the tables get loaded).

{
"profile": "tabular-data-resource",
"name": "eia_epacems_crosswalk",
"path": "data/plants_eia.csv", # NOT SURE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be data/<the name of the tabe>.csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should come up with a naming convention for the association tables so it's clear that they're all doing the same kind of thing, and which datasets and/or entities they're connecting. We've used _assn as a suffix in some places (bga, balancing authorities) so maybe something like eia_epacems_assn.csv or eia_ferc1_assn.xlsx for the spreadsheet that assigns the PUDL IDs.

Note that comments aren't allowed in JSON files. That's why they're highlighted in red here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where exactly this data directory is. If I look in my datastore's data directory, for instance, I don't see any csvs (just more directories). For now, the original EPA-EIA crosswalk csv is stored in package_data/glue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside each of the datapackages we publish, there's a datapackage.json and there's a directory called data which contains the actual CSV files. All file paths within the datapackage specification are relative to the directory in which the datapakcage.json file is found. The use of this data directory structure isn't required by the datapackage standard, but it's a common idiom, because people will sometimes include other non-data auxiliary information inside the datapackage, like a directory called scripts or the raw input files that went into the generation of the datapackage, or a README file, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that I should move the raw EPA csv file?

"encoding": "utf-8", # NOT SURE
"mediatype": "text/csv",
"format": "csv",
"dialect": { # NOT SURE ABOUT ANY OF THESE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all of these look correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find more details on what the CSV dialects are on Frictionless Data. Unless there's something wonky about the formatting of the file that we get from EPA this should be right. We can also use the frictionless command line tool or python library to try and infer what these values should be.

"format": "default"
},
{
"name": "plant_name_eia",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should remove this column for standardization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, for normalization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plant_name_eia column -- since its home table is plants_entity_eia (I think).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but this column was also used by Greg to fill in some of the mapping gaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed for the purposes of the association table once that ETL / gap-filling has been done? Or can the appropriate plant names be looked up in the plants_entity_eia table based on the included plant_id_eia?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point

"primaryKey": [
"plant_id_eia"
],
"foreignKeys": [{ # DONT KNOW WHERE THIS STANDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should also add "plant_id_eia" and "generator_id" which can be added to this list like

{
                    "fields": ["plant_id_eia", "generator_id"],
                    "reference": {
                        "resource": "generators_entity_eia",
                        "fields": ["plant_id_eia", "generator_id"]
                    }

"title": "Creative Commons Attribution 4.0",
"path": "https://creativecommons.org/licenses/by/4.0/"
}],
"sources": []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should decide what to list here as a source. we should at the very minimum list epa as a source like this:

{
  "title": "epa",
  "path": "https://www.epa.gov/"
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the datapackge standard I think title is meant to be a freeform human readable field, but we've been restricting it to the short names of datasets we've integrated (epacems or eia923 etc.) and I'm not 100% sure if we are relying on that anywhere in the code. We might want to choose a more specific title for this to be safe.

Sources can include an email address as well as a path (see here), and that seems like it would be appropriate here since we're getting it from a person at EPA, not their website (yet).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I agree. Wondering, though, if we should leave the sources as is until we get the updated version of the crosswalk.

"# Transform column headers\n",
"new_cols = eia_epacems_crosswalk.columns.str.lower()\n",
"new_cols = [col.replace(' ', '_') for col in new_cols]\n",
"eia_epacems_crosswalk.columns = new_cols\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use pudl.helpers.simplify_strings() for this if you want!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a function which specifically does this for column names, called pudl.helpers.simplify_columns()

@aesharpe
Copy link
Member Author

Next steps would then be to load the crosswalk table from the database and add it to the appropriate output tables in the output modules?

@zaneselvans
Copy link
Member

zaneselvans commented Nov 13, 2020

I feel like conceptually the right place to deal with this is the _etl_glue() step, but right now there's no way to pass data between the different ETL steps (i.e. between datasets) which is also why we have awful hack where epacems cheats and looks for eia CSV files in the outputs to figure out what timezone the plants are in. I imagine there's only going to be more of these kinds of inter-dataset dependencies over time.

I think that the Prefect + Dask based refactoring of the ETL module that @rousik is working on will eventually make it easy to pass the ETL outputs between the different steps in the ETL as required, since it can cache the outputs between steps, and tracks dependencies downstream. Is that right?

Anyway for now it seems like @cmgosnell's suggestion to trigger this process from the EIA module is simplest, but maybe we can actually put the code in a pudl.glue.eia_epacems_assn.py module so that it's clear this is an inter-dataset-joining process, and then when the Prefect / Dask refactoring of the ETL is done we can set up the dependencies so that it can really be in the glue step, and get access to all the data it needs.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll submit a separate review on the notebook with the other notebook review tool thingy.

"encoding": "utf-8", # NOT SURE
"mediatype": "text/csv",
"format": "csv",
"dialect": { # NOT SURE ABOUT ANY OF THESE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find more details on what the CSV dialects are on Frictionless Data. Unless there's something wonky about the formatting of the file that we get from EPA this should be right. We can also use the frictionless command line tool or python library to try and infer what these values should be.

{
"profile": "tabular-data-resource",
"name": "eia_epacems_crosswalk",
"path": "data/plants_eia.csv", # NOT SURE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should come up with a naming convention for the association tables so it's clear that they're all doing the same kind of thing, and which datasets and/or entities they're connecting. We've used _assn as a suffix in some places (bga, balancing authorities) so maybe something like eia_epacems_assn.csv or eia_ferc1_assn.xlsx for the spreadsheet that assigns the PUDL IDs.

Note that comments aren't allowed in JSON files. That's why they're highlighted in red here.

"type": "string",
"format": "default",
"description": "Prime mover code as seen in EIA."
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a curlybrace } here.

If this is referring to the same set of prime mover codes that we've got in the EIA data, then it should have the same column name here (I think it's prime_mover_code).

"format": "default"
},
{
"name": "plant_name_eia",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, for normalization.

"title": "Creative Commons Attribution 4.0",
"path": "https://creativecommons.org/licenses/by/4.0/"
}],
"sources": []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the datapackge standard I think title is meant to be a freeform human readable field, but we've been restricting it to the short names of datasets we've integrated (epacems or eia923 etc.) and I'm not 100% sure if we are relying on that anywhere in the code. We might want to choose a more specific title for this to be safe.

Sources can include an email address as well as a path (see here), and that seems like it would be appropriate here since we're getting it from a person at EPA, not their website (yet).

"path": "https://creativecommons.org/licenses/by/4.0/"
}],
"sources": []
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No trailing commas allowed at the end of a list of entities in JSON.

Comment on lines 5503 to 5507
{
"name": "plant_id_epa",
"type": "integer",
"format": "default"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the plant_id_epa values coming from? Is there an authoritative resource we can refer to here, so that it's easier for people to connect this with other datasets which use the same EPA plant IDs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original file, this is called the ORIS code and the plant_id_eia is called the EIA ORIS. I haven't seen the EPA ORIS code show up elsewhere, but it's possible that it would be useful for connection with other EPA datasets.

"# Transform column headers\n",
"new_cols = eia_epacems_crosswalk.columns.str.lower()\n",
"new_cols = [col.replace(' ', '_') for col in new_cols]\n",
"eia_epacems_crosswalk.columns = new_cols\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a function which specifically does this for column names, called pudl.helpers.simplify_columns()

{
"profile": "tabular-data-resource",
"name": "eia_epacems_crosswalk",
"path": "data/plants_eia.csv", # NOT SURE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside each of the datapackages we publish, there's a datapackage.json and there's a directory called data which contains the actual CSV files. All file paths within the datapackage specification are relative to the directory in which the datapakcage.json file is found. The use of this data directory structure isn't required by the datapackage standard, but it's a common idiom, because people will sometimes include other non-data auxiliary information inside the datapackage, like a directory called scripts or the raw input files that went into the generation of the datapackage, or a README file, etc.

"format": "default"
},
{
"name": "plant_name_eia",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plant_name_eia column -- since its home table is plants_entity_eia (I think).

@@ -0,0 +1,299 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a special way to access data which is stored inside the package_data directory that you'll need to use, since this filesystem hierarchy will not be available when the pudl package is installed on a user's system. The package is importlib.resources and you'll want to use open_text() You can find some examples of how this is being used right now in the pudl.constants.py module.


Reply via ReviewNB

@@ -0,0 +1,299 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string NaN suggests there's some datatype shenanigans going on here. However the file is getting read it, we should make sure it has real NA values, not stringified-NA values, so they can be dealt with uniformly.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! There wasn't actually a datatype problem, I had just borrowed that from Greg's code.

@@ -0,0 +1,299 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging on plant_name_eia makes me nervous, because the names are both poorly standardized, and not guaranteed to be unique.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with you on this, but I'm inclined to keep it in because it doesn't do any harm. What it does is add more EIA ids to the crosswalk if there happens to be an exact name match between the plant_name_eia and the plant_name_epa. In this case there are actually 2883 exact matches so it's pretty useful for filling in the gaps!

@@ -0,0 +1,299 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the set of plants where the ORISPL ID doesn't match small and pretty well defined? I think we explored this in some depth a while ago. Could the matching process be simplified by harmonizing those codes first?


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the issue and my understanding is that you can access EPA information by using the plant_id_eia (as shown by Karl's function pulling information from the FRS site) but there still isn't a clear mapping of plant_id_eia to this "epa unit". This links at the top of the issue seemed promising, but the first link was dead and the second (all the eGrid files) seems like it would be out of date given that it stops in 2016. The issue also culminates with Greg talking about the file he made (that I'm mimicking here) so it seems like he would have incorporated that data if it pertained to direct id linkage. I can do some more digging though if you think there's more concrete information out there.

Copy link
Member Author

@aesharpe aesharpe Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, however, it might be really useful to integrate the FRS Regional Ids into this same crosswalk table (as discussed in the issue). I'm working on a function to do so, but I can't figure out how to avoid using the notorious 'apply' function at the moment. Read: it's taking forever.

@zaneselvans
Copy link
Member

The notebook comments will probably only really make sense here: https://app.reviewnb.com/catalyst-cooperative/pudl/pull/822

{
"profile": "tabular-data-resource",
"name": "assn_eia_epacems",
"path": "data/plants_eia.csv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to update the path here to refer to the correct output CSV file which it looks like will be data/assn_eia_epacems.csv

@@ -23,6 +23,7 @@
import pudl.extract.excel
import pudl.extract.ferc1
import pudl.extract.ferc714
import pudl.glue.eia_epacems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aesharpe Any time you add a new module, if you want it to be available when someone does import pudl it needs to end up in here. Otherwise, it has to be specifically imported like import pudl.glue.eia_epacems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh thank you, this clears up some confusion I was having!

Comment on lines 5535 to 5544
"primaryKey": [
"plant_id_eia"
],
"foreignKeys": [{
"fields": ["plant_id_eia", "generator_id"],
"reference": {
"resource": "generators_entity_eia",
"fields": ["plant_id_eia", "generator_id"]
}
}],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primaryKey has to uniquely identify a row in the table -- so unless there's only one record per plant_id_eia, I think we're going to need some additional columns in here. Pure association tables frequently include all of their columns as part of the primary key, since each row is meant to uniquely specify a relationship between other tables that are being referenced. In a similar vein, you'll need at least 2 different foreignKey values in here, referring to the multiple different tables which are being related to each other through this association table.

Comment on lines 30 to 43
.rename(columns={
'oris_code': 'plant_id_epa',
'eia_oris': 'plant_id_eia',
'unit_id': 'epa_point_source_unit',
'facility_name': 'plant_name_eia',
'unit_type': 'prime_mover_code'})
.drop([
'fuel_type_primary',
'edat_mw_cap',
'way_gen_id_matched',
'unit_op_status_date',
'notes',
'annual_heat_input',
'op_status'], axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern of dropping the columns you don't want, rather than explicitly keeping only the columns you do want, is vulnerable to changes in the column names or the inclusion of new columns, which would mean the dataframe has unexpected columns in it at the end of the function. We definitely did this a lot early on in our transform function writing, but have tried to move toward having the dataframes be more reliably of a particular form at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good bit of advice - I switched it to filter vs drop

Comment on lines 5513 to 5528
{
"name": "epa_point_source_unit",
"type": "integer",
"format": "default",
"description": "'Smokestack' unit monitored by EPA CEMS."
},
{
"name": "generator_id",
"type": "integer",
"format": "default"
},
{
"name": "boiler_id",
"type": "integer",
"format": "default"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a lot of these IDs are actually strings, rather than integers, so the datapackage validation is currently failing. The type (and potentially format) need to be compatible with whatever the contents of the column are in the CSV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly does the "format" mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CSV contains characters which make up strings, but the various columns of the CSV are meant to be interpreted as things that aren't necessarily strings. The format attribute indicates how the string stored in the CSV should be interpreted as... whatever type it's supposed to ultimately turn into (e.g. there's lots of different ways to encode a date or a time as a string). Lots of them are simple and only have the default case, but some of them are more complicated.

All of these attributes are defined by the TableSchema spec, as are the various ways of defining primary keys, foreign key constraints, values that should be interpreted as True or False or NA, etc.

Comment on lines 5498 to 5502
"name": "plant_id_eia",
"type": "integer",
"format": "default",
"description": "The unique six-digit facility identification number, also called an ORISPL, assigned by the Energy Information Administration."
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the plant_id_eia values in the CSV are being stored as floats (at least some of the time) rather than integers, so this is causing the datapackage validation to fail. You might need to run the dtype fixing function from helpers, and/or add a new section to the datatype dictionary in pudl.constants to cover the columns in this new table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@codecov
Copy link

codecov bot commented Dec 2, 2020

Codecov Report

Merging #822 (b3f2890) into sprint27 (470e159) will increase coverage by 0.15%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##           sprint27     #822      +/-   ##
============================================
+ Coverage     70.31%   70.46%   +0.15%     
============================================
  Files            39       40       +1     
  Lines          4871     4895      +24     
============================================
+ Hits           3425     3449      +24     
  Misses         1446     1446              
Impacted Files Coverage Δ
src/pudl/constants.py 100.00% <ø> (ø)
src/pudl/transform/eia.py 97.13% <ø> (ø)
src/pudl/transform/eia860.py 96.43% <ø> (ø)
src/pudl/etl.py 77.92% <100.00%> (+0.35%) ⬆️
src/pudl/glue/eia_epacems.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 470e159...b3f2890. Read the comment docs.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to get rid of the processing we don't plan on using. perhaps we can bookmark the current commit so we can come back to it if we need to use this in the future.

I'd also like a little more documentation of the functions in glue.eia_epacems!

Besides that this looks good! It came out pretty simple and clean which is great.



def simple_clean():
"""No attempt to fill in the gaps, simply returns available data."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a quick description of the returns here?

"""Add the EIA-EPA crosswalk to the transformed dfs dict."""
assn_dfs = pudl.glue.eia_epacems.simple_clean()
eia_transformed_dfs.update(assn_dfs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like how straightforward this is!

{
"profile": "tabular-data-resource",
"name": "plant_unit_epa",
"path": "data/assn_eia_epacems.csv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay that the path and the name are different here? I had thought that they needed to be the same.

Copy link
Member Author

@aesharpe aesharpe Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm not sure about the "legality" of having different path/table names, but in this case they are difference because we are taking one CSV and splitting it into three separate normalized tables.

@zaneselvans do you think this will be an issue?

Copy link
Member

@cmgosnell cmgosnell Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh i think i understand why this isn't a problem rn. Because we are always overwritting the path when we generate the "real" datapackage metadata in pudl.load.metadata.get_tabular_data_resource().

so truly the paths in this file don't matter - I suppose they will be functionally gone when we convert the metadata into the new system. But for now I'd either remove or rename these to be consistent with the name just for consistency. This path will be the path from within the datapacakge that an etl run generates - and the names and the file names are always the same right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the path supposed to be the path to the raw datafile or the output datafile?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output file! which is why it doesn't have an effect in this file because this is the thing that we use to cobble together the metadata for the datapackage once it is generated.

@@ -814,6 +814,17 @@ def _restrict_years(df,
return df


# def _add_eia_epacems_crosswalk(eia_transformed_dfs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you get rid of this lingering code here now that it is integrated into etl.py??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! thanks for spotting that.

@aesharpe aesharpe marked this pull request as ready for review December 4, 2020 23:27
@aesharpe aesharpe merged commit 147cca3 into sprint27 Dec 4, 2020
@zaneselvans zaneselvans deleted the eia-epa-crosswalk branch June 1, 2021 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia860 Anything having to do with EIA Form 860 epacems Integration and analysis of the EPA CEMS dataset.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants