-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accumulation of any changes in the xbrl validation tests #2071
Conversation
Current error from
|
current validation failure: missing small plants errors:
number of rows errors
columns renames/removals -> output failures
data errors
Dupes (is this test still appropriate for the xbrl data)?
|
Codecov ReportBase: 85.0% // Head: 85.1% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## xbrl_integration #2071 +/- ##
================================================
Coverage 85.0% 85.1%
================================================
Files 72 72
Lines 8162 8183 +21
================================================
+ Hits 6945 6968 +23
+ Misses 1217 1215 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
The duplicates in the hydro and purchased power tables are coming from the DBF data, not the XBRL data. We've punted on figuring out why they are duplicates, but I think they actually should be addressed (and there are only a tiny handful of them) |
|
@cmgosnell some findings...
|
and then there were 9:
|
I think the unexpectedly large low-end tail in natural gas prices may be due to some unfixed unit errors. There's a cluster of fuel prices reported between 0-0.5 USD/Mcf that looks suspiciously similar to the main distribution between 0-5 USD/Mcf. So maybe it was reported in USD / hundred cubic feet, rather than USD / thousand cubic feet? |
and then there were 4:
|
I think the oil price distribution actually just includes more lower priced records now. Probably as a result of all the fuel type / units cleaning. There's a little spike at ~0 that's probably bad data, but removing it doesn't make much of a difference in where the 10% cutoff is. So I would say drop the 10% cutoff from 4.0 USD/MMBtu to 3.0 USD/MMBtu on this one:
|
On the cost correlations, I think you can just remove |
src/pudl/transform/ferc1.py
Outdated
eachother, except one have nulls in the capex columns. Surgically remove the | ||
record with the nulls. | ||
""" | ||
if 2019 in df.report_year.unique(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i made this run only if 2019 is being processed, but I could change the assertion below to only yell if len(possible_dupes) != 2
and 2019 in df.report_year.unique()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only having the assertion in the conditional is better -- we want these kinds of spot fixes to run and work regardless of what data is being processed, and having the whole thing enclosed in a conditional suggests that it will not work if 2019 isn't in the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the tiniest of changes requested!
if float_cols: | ||
df.loc[:, float_cols] = df.loc[:, float_cols].replace(0, np.nan) | ||
return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not any worse than it was before but... it would be better if we could distinguish the original nan/0.0 values from each other and end up with the same values. We should make an issue to ensure that we get the same values out of the whole steam plant classifier that we put into it, even if the NULL values need to be filled for the classification process to run.
src/pudl/transform/ferc1.py
Outdated
eachother, except one have nulls in the capex columns. Surgically remove the | ||
record with the nulls. | ||
""" | ||
if 2019 in df.report_year.unique(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only having the assertion in the conditional is better -- we want these kinds of spot fixes to run and work regardless of what data is being processed, and having the whole thing enclosed in a conditional suggests that it will not work if 2019 isn't in the data.
PR Checklist
Before requesting a review of your pull request, please make sure you've done the
following:
dev
(or the appropriate upstream branch) intoyour branch and resolved any merge conflicts. You may need to do this several
times over the course of a PR as
dev
changes frequently.Running Tests with Tox
for details on how to run the full test suite locally if you need to debug a
particular failure.
descriptive enough for developers and users to understand your code.
data validation tests
pass locally on a fresh DB.
unit tests.
will catch unexpected data issues.
release notes
to reflect your changes. Make sure to reference the PR and any related issues.
questions you'd like reviewers to answer, known issues, solutions you're
unsatisfied with, or other things that deserve special attention from the
reviewer.