-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate and save csv of all 1:m FERC-EIA matches #2516
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
While validating, made one minor change to Xcel data. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2516 +/- ##
=======================================
- Coverage 86.9% 86.7% -0.2%
=======================================
Files 86 86
Lines 9661 9688 +27
=======================================
+ Hits 8400 8406 +6
- Misses 1261 1282 +21
... and 1 file with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
This makes me feel like we should have a test for the training data. We test the validation functions that input new training data, but we don't check that the training data itself hasn't been accidentally altered. But this is outside the scope of this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! My one main comment is about the one_to_many=False
param and whether or not it should be a param.
src/pudl/analysis/ferc1_eia_train.py
Outdated
pd.DataFrame: A dataframe of 1:m matches formatted to fit into the existing | ||
validation framework. | ||
""" | ||
multi_match_cols = [f"record_id_eia_override_{i}" for i in range(2, 4)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a small but non-zero possibility that this could be more than 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to handle all possible #s
src/pudl/analysis/ferc1_eia_train.py
Outdated
def validate_and_add_to_training( | ||
utils_eia860, | ||
ppe, | ||
ferc1_eia, | ||
input_dir_path, | ||
expect_override_overrides=False, | ||
allow_mismatched_utilities=True, | ||
one_to_many=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost feel like this should not be a param. Either that, or it should be set to True by default. Otherwise if you forget to switch it to True it will map all one-to-many values to whatever is in record_id_eia_override_1
which we don't want. I can't think of an occasion where we wouldn't want to do this if there were in fact one-to-many matches in a batch. Maybe we could add a conditional to check the data for one_to_many values with something like if record_id_eia_override_2.notnull().any()
and then execute the one_to_many validation code instead of having it as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this as a parameter with the default set to True
. I agree that it should be the default behavior, but I'd like to leave this functionality available to toggle given that it's an additional layer of abstraction added onto the existing plant part infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense. I wonder if we should null out the EIA matches for FERC records if one_to_many = False
so that they aren't mapped rather than being mapped to just one of their EIA subcomponents.
@@ -1,213 +1,6 @@ | |||
record_id_eia,record_id_ferc1,signature_1,signature_2,notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, but all the deleted records here are one-to-many records that were matched to whatever was in record_id_eia_override_1
and all the added records here are the same record_id_ferc1
but mapped to the new PPE one_to_many record you created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, no. The 1:m process does not overwrite this CSV, but merely updates the records internally. Any changes here are a result of running the validation process with the complete set of newly integrated data. If this shouldn't be included in this PR, I can remove it (though I'm not sure why the result of this would be different than what's currently in dev
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I'm curious about. It might be a good idea to compare this file to the one in dev to see what's going on. I don't see why there would be any changes to this file if you're not updating it with 1:m. One thought is that the new training data (created when you run the validate and add to training function) might count as overrides even if it's already in the old training data. i.e., it replaces current data with the same values instead of only replacing values that are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some look exactly the same, some have different signatures, and some only appear in the new data. Which suggests to me that the source data from your PR and this one aren't actually the same.
I'm not sure I'm fully understanding this comment. So far we've only validated the first column of matches. This PR just extends the same checks to the other columns, which is how this issue initially was flagged and I knew to address it. What is the additional check you are proposing? |
…e_to_many to True by default
I wouldn't worry about it because it's definitely outside the scope of this issue. I was just thinking that if you found bad matches already in the training data that's because we don't run any of our validation functions on the training data itself, just the proposed additions to the training data. So if something were to accidentally slip through unvalidated or get altered in the training data csv it would go undetected. I may have read this wrong too - idk if those matches you found were already in the training data or they were from |
Ok I went through the values in In the "Files Changed" diff it says there are thousands of changes -- this is because the override capability replaces matches even if they are the same as what is already in the training data. All of these spreadsheets have already been validated and run through the training data, so it makes sense that there are tons of overrides. Some of them (below) are actual overrides with different values, but most of them are the same thing as what's already in the training. We should probably create a flag in the code that only "overrides" values for which the Otherwise, I think we can just keep these changes as part of this PR.
|
Background
The Plant Part List (PPL) shows EIA generators at a variety of useful aggregations.
Part of the role of the PPL is to help connect EIA records with FERC records that might be reported at different aggregations. Most of the time, pre-defined aggregations such as plant, unit, primemover, etc. are good enough, but occasionally there are instances where FERC records are comprised of a unique aggregation of EIA records not currently represented in the PPL.
This becomes apparent when you're going through the programmatically generated overrides sheet for a given utility subset and you notice that a FERC does not match any one of the existing PPL records, but rather a combination of two or more. When this is the case, we need a way to fabricate a new PPL record that corresponds to that FERC record so we can more accurately map the two datasets.
The code reuses the existing validation infrastructure with different inputs, so additional new tests are not needed.
Scope of PR
This PR addresses the first part of issue #1555.
ferc_eia_train.py
PR Checklist
dev
).