-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release v0.4.0 #1087
Release v0.4.0 #1087
Conversation
* Flesh out release notes with changes from last release and known issues. * Update data validation criteria to accommodate new numbers of records in many tables which now have more data in them. * Update and comment the databeta.sh script for making quick-and-dirty data releases. * Added the censusdp1tract SQLite DB to the list of DBs which are accessed directly instead of being regenerated when you run pytest --live-dbs * Changed the integration tests to skip epacems_to_parquet when running with --live-dbs * Reduced margin for checking expected line numbers to zero -- any change in the number of output rows will now cause validation to fail. It's informational to see that the results have changed even if it's by less than an additional year's worth of data. E.g. fixing the leading zeroes on generator IDs changed the number of rows by a few here and there. When we're running the validations automatically it'll be good to see these changes appear as a result of code changes to know what we're affecting.
- removed the replaces that were dealing with null values for string types. added a check to ensure all string types in pc.column_dtypes are the nullable string types - inserted locs in the zip code zero padding
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## dev #1087 +/- ##
==========================================
- Coverage 81.43% 81.41% -0.02%
==========================================
Files 97 49 -48
Lines 12109 6089 -6020
==========================================
- Hits 9860 4957 -4903
+ Misses 2249 1132 -1117 Continue to review full report at Codecov.
|
Added a parametrized test that checks the date frequency of output dataframes, relative to the gens_eia860 dataframe, to ensure that we have annual / monthly dataframes as expected. This test is currently expected to fail because of the bug referenced in issue #1088. Also consolidated some of the "fast output tests" using parametrization, and improved coverage to include all of the potential output ferc1, eia860, eia923, and mcoe output tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've got a few questions but nothing major
.pipe(pv.check_max_rows, expected_rows=expected_rows, | ||
margin=0.05, df_name=df_name) | ||
margin=0.0, df_name=df_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why turn the margin to 0? that seems extreme given that we know there will be some fluctuation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may often want to know about these consequences though. Like fixing the leading-zeroes in generator IDs does change the number of records in some cases since the aggregations lead to fewer groupings. In this case I also wanted to do it so that I could update the expected number of rows to reflect the actual expected number of rows. It's more a warning that "Hey, something changed! Did you expect it to change? If not, maybe you should look into why it changed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense to me. especially in the world in which we have the validation tests running every night.
in the before times of the nightly validation tests (now!), I think this behavior is going to be hard to interact with. We will tweak things over time, not realize when and have to play forensic investigator for issues that are probably mostly non-issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is definitely with an 👁️ toward the aftertimes. I guess I was also thinking that since we run the validation tests so rarely now it's not going to be a common annoyance. And we can look at the numbers and be like "Meh, close enough." if we want to.
This addresses a bunch of unsatisfied foreign key constraints in the original | ||
databases published by FERC. | ||
* We're doing much more software testing and data validation, and so hopefully | ||
we're catching more issues early on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add the (experimental!) net generation allocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please feel free to add a blurb explaining it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I added a section. I mostly stole content from the main module docs. Feel free to edit or condense
… into release-v0.4.0
Documentation and data validation tweaks in preparation for releasing 0.4.0.
Please suggest content for the Release Notes!