Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.4.0 #1087

Merged
merged 21 commits into from
Jul 26, 2021
Merged

Release v0.4.0 #1087

merged 21 commits into from
Jul 26, 2021

Conversation

zaneselvans
Copy link
Member

Documentation and data validation tweaks in preparation for releasing 0.4.0.

Please suggest content for the Release Notes!

zaneselvans and others added 12 commits July 14, 2021 20:14
* Flesh out release notes with changes from last release and known issues.
* Update data validation criteria to accommodate new numbers of records in many tables
  which now have more data in them.
* Update and comment the databeta.sh script for making quick-and-dirty data releases.
* Added the censusdp1tract SQLite DB to the list of DBs which are accessed directly
  instead of being regenerated when you run pytest --live-dbs
* Changed the integration tests to skip epacems_to_parquet when running with --live-dbs
* Reduced margin for checking expected line numbers to zero -- any change in the
  number of output rows will now cause validation to fail. It's informational to see
  that the results have changed even if it's by less than an additional year's worth of
  data. E.g. fixing the leading zeroes on generator IDs changed the number of rows by
  a few here and there. When we're running the validations automatically it'll be good
  to see these changes appear as a result of code changes to know what we're affecting.
- removed the replaces that were dealing with null values for string 
types. added a check to ensure all string types in pc.column_dtypes are 
the nullable string types
- inserted locs in the zip code zero padding
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Jul 16, 2021

Codecov Report

Merging #1087 (cc6ab8b) into dev (178f5a3) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head cc6ab8b differs from pull request most recent head 91e4894. Consider uploading reports for the commit 91e4894 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1087      +/-   ##
==========================================
- Coverage   81.43%   81.41%   -0.02%     
==========================================
  Files          97       49      -48     
  Lines       12109     6089    -6020     
==========================================
- Hits         9860     4957    -4903     
+ Misses       2249     1132    -1117     
Impacted Files Coverage Δ
src/pudl/validate.py 35.19% <ø> (ø)
src/pudl/helpers.py 92.26% <100.00%> (ø)
..._pudl/lib/python3.9/site-packages/pudl/validate.py
...b/python3.8/site-packages/pudl/glue/eia_epacems.py
.../python3.9/site-packages/pudl/transform/ferc714.py
.env_pudl/lib/python3.8/site-packages/pudl/cli.py
.../lib/python3.8/site-packages/pudl/output/eia923.py
...lib/python3.8/site-packages/pudl/extract/eia923.py
...b/python3.9/site-packages/pudl/glue/eia_epacems.py
...lib/python3.9/site-packages/pudl/extract/eia861.py
... and 137 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 178f5a3...91e4894. Read the comment docs.

Added a parametrized test that checks the date frequency of output
dataframes, relative to the gens_eia860 dataframe, to ensure that we
have annual / monthly dataframes as expected. This test is currently
expected to fail because of the bug referenced in issue #1088.

Also consolidated some of the "fast output tests" using parametrization,
and improved coverage to include all of the potential output ferc1,
eia860, eia923, and mcoe output tables.
@zaneselvans zaneselvans linked an issue Jul 22, 2021 that may be closed by this pull request
9 tasks
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've got a few questions but nothing major

devtools/databeta.sh Show resolved Hide resolved
docs/release_notes.rst Outdated Show resolved Hide resolved
.pipe(pv.check_max_rows, expected_rows=expected_rows,
margin=0.05, df_name=df_name)
margin=0.0, df_name=df_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why turn the margin to 0? that seems extreme given that we know there will be some fluctuation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may often want to know about these consequences though. Like fixing the leading-zeroes in generator IDs does change the number of records in some cases since the aggregations lead to fewer groupings. In this case I also wanted to do it so that I could update the expected number of rows to reflect the actual expected number of rows. It's more a warning that "Hey, something changed! Did you expect it to change? If not, maybe you should look into why it changed."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that makes sense to me. especially in the world in which we have the validation tests running every night.

in the before times of the nightly validation tests (now!), I think this behavior is going to be hard to interact with. We will tweak things over time, not realize when and have to play forensic investigator for issues that are probably mostly non-issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is definitely with an 👁️ toward the aftertimes. I guess I was also thinking that since we run the validation tests so rarely now it's not going to be a common annoyance. And we can look at the numbers and be like "Meh, close enough." if we want to.

This addresses a bunch of unsatisfied foreign key constraints in the original
databases published by FERC.
* We're doing much more software testing and data validation, and so hopefully
we're catching more issues early on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add the (experimental!) net generation allocation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please feel free to add a blurb explaining it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I added a section. I mostly stole content from the main module docs. Feel free to edit or condense

@cmgosnell cmgosnell merged commit 8ccb3dc into dev Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some monthly MCOE outputs become annual
2 participants