Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve 2023-03-29 nightly build failure issues #2469

Merged
merged 8 commits into from
Mar 30, 2023

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Mar 29, 2023

PR Overview

We got a few failures in the most recent nightly build. It looks like there are two new PRs that made it into the build:

  1. A log file for the build is not being written to the output bucket or sent to the #pudl-deployments slack channel. I pulled the test errors below from Cloud Logging.

  2. test/validate/eia_test.py::test_no_null_cols_eia failed on the frc_eia923 table:

E           ValueError: Null columns found in frc_eia923: ['bulk_agg_fuel_cost_per_mmbtu']

@zaneselvans @e-belfer did #2447 touch this table?

  1. test/integrations/etl_test.py:: test_pudl_engine found a foreign key constraint failure:
E           pudl.io_managers.ForeignKeyErrors: Foreign key error for table: purchased_power_ferc1 -- power_purchase_types_ferc1 {self.foreign_key} -- on rows {self.rowids}

I'm a dingdong and forgot to make the second half of the error message an f-string. purchased_power_ferc1 is the child table and power_purchase_types_ferc1 is the parent table. I don't think either of the two PRs directly changed anything about FERC tables. I'm going to download pudl.sqlite and run the fk checks locally to debug this.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans
Copy link
Member

zaneselvans commented Mar 29, 2023

This seems like a vote for increasing the priority of setting up the check_foreign_keys() follow-on job!

#2447 should not have touched any of this stuff. It doesn't even create a new raw asset b/c of the weird archive problem with the 2018 data, so all it adds is some additional spreadsheet mapping metadata. And I don't think anything has changed with the Bulk EIA fuel aggregations either.

So far as i know nothing has changed with the ferc1 tables either so... that seems very odd!

@bendnorman
Copy link
Member Author

bendnorman commented Mar 29, 2023

Full FK failure error message:

pudl.io_managers.ForeignKeyErrors: Foreign key error for table: purchased_power_ferc1 -- power_purchase_types_ferc1 (code) -- on rows [     2      3      4 ... 197521 197522 197523]

Looks like the power_purchase_types_ferc1 wasn't populated at all. Maybe there was a failure in the DAG I didn't catch?

@bendnorman
Copy link
Member Author

Ah ok we got the FK error because a MetadataDiffError got thrown when trying to load the power_purchase_types_ferc1 table. I guess this could happen if:

  1. another asset creates the pudl.sqlite but hasn't created the metadata yet
  2. The power_purchase_types_ferc1 asset's instance of SQLiteIOManager reflects the empty database and throws an error because the schema is different, the asset fails to be written to the db... resulting in an FK error.
  3. The other asset's instance of SQLiteIOManager finishes creating the db schema, successfully loads the asset and continues on with the DAG.

My two initial thoughts are:

  1. Explore if it's possible to implement the desired concurrent behavior with sqlite.
  2. If not, remove the metadata diff logic and scope using a db setup process like db migrations.

@jdangerx
Copy link
Member

I think we lost the logs because we don't have pudl-etl.log anymore:

image

Which happened when we stopped using configure_root_logger with a log_file set. I can push some changes to this branch to test out if pudl-etl.log gets written.

@bendnorman
Copy link
Member Author

bendnorman commented Mar 29, 2023

In the past I don't think the pudl-etl.log file was being created using a python logger. We were just piping the stdout and stderr into a new file called pudl-etl.log. I don't think it's advised to have multiple processing writing to a logfile via a python logger.

I think the path for $LOGFILE might be incorrect.

This is the line that copies over the outputs and log file to the GCS bucket.

gsutil -m cp -r $PUDL_OUTPUT "gs://nightly-build-outputs.catalyst.coop/$ACTION_SHA-$GITHUB_REF"

In .env LOGFILE=/home/catalyst/pudl_out/pudl-etl.log but PUDL_OUTPUT=/home/catalyst/pudl_work/output so the logfile isn't included in the gustil cp command. Same with the slack file upload:

upload_file_to_slack "${PUDL_OUTPUT}/pudl-etl.log" "Logs for $ACTION_SHA-$GITHUB_REF:"

I think all we have to do is update the $LOGFILE path.

* pytest overwrites the logs at --log-file, so split into separate files
* pytest/our cli arguments differ by having/not having a hyphen
--live-dbs test
}

function shutdown_vm() {
# Copy the outputs to the GCS bucket
gsutil -m cp -r $PUDL_OUTPUT "gs://nightly-build-outputs.catalyst.coop/$ACTION_SHA-$GITHUB_REF"

upload_file_to_slack "${PUDL_OUTPUT}/pudl-etl.log" "Logs for $ACTION_SHA-$GITHUB_REF:"
upload_file_to_slack "${PUDL_OUTPUT}/ferc-to-sqlite.log" "ferc_to_sqlite logs for $ACTION_SHA-$GITHUB_REF:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we prefer looking at raw logs vs. in Cloud Logging? Seems like the extra filtering tools from Cloud Logging might be handy, and the fact that they stream in vs. being uploaded at the end is nice too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to use Cloud Logging in the end. However, having the logs associated with a particular commit and set of outputs is nice. There's probably a way to port logs from Cloud Logging to the outputs.catalyst.coop/{commit_sha} bucket instead of capturing the logs by piping them to a file at the end of the script.

@jdangerx
Copy link
Member

In the past I don't think the pudl-etl.log file was being created using a python logger. We were just piping the stdout and stderr into a new file called pudl-etl.log. I don't think it's advised to have multiple processing writing to a logfile via a python logger.
...
I think all we have to do is update the $LOGFILE path.

Ahh, yeah makes sense. I'll do that then, and leave the "should we just be using Cloud Logging" question as a non-blocking discussion :)

@bendnorman
Copy link
Member Author

bendnorman commented Mar 29, 2023

The frc_eia923 error failed for the same reason the FK constraint test failed. The fuel_receipts_costs_aggs_eia table was not loaded to the db because of a MetadataDiffError.

I think the easiest path forward to make dev pass again is to remove the changes introduced in #2331. This means we'll continue to receive database errors when new metadata is added. Another interim solution would be to catch any sqlite3. OperationalError in SQLiteIOManager.handle_output() and add an additional message like "If you recently updated resource definitions in pudl.metadata.resources, delete the pudl.sqlite database and rerun the ETL".

I still think the long term solution for handling database schema changes is to use db migrations but would love to hear other people's thoughts in #2378.

@jdangerx
Copy link
Member

jdangerx commented Mar 29, 2023

I agree that we should revert #2331 and re-evaluate our approach - then at least we won't be blocking all the other work that wants to get into dev.

(also I think probably the discussion is best suited for #2377 , the actual parent issue)

@bendnorman
Copy link
Member Author

Sounds good to me. Let's revert it, and I'll add some thoughts to #2377 .

…a-compare"

This reverts commit 3d12720, reversing
changes made to b14a8c1.
@codecov
Copy link

codecov bot commented Mar 29, 2023

Codecov Report

Patch coverage: 100.0% and project coverage change: -0.1 ⚠️

Comparison is base (5576f10) 86.7% compared to head (e097647) 86.7%.

❗ Current head e097647 differs from pull request most recent head 6384ec7. Consider uploading reports for the commit 6384ec7 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2469     +/-   ##
=======================================
- Coverage   86.7%   86.7%   -0.1%     
=======================================
  Files         81      81             
  Lines       9453    9438     -15     
=======================================
- Hits        8203    8183     -20     
- Misses      1250    1255      +5     
Impacted Files Coverage Δ
src/pudl/io_managers.py 88.5% <100.0%> (-0.6%) ⬇️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans merged commit 442224e into dev Mar 30, 2023
@zaneselvans
Copy link
Member

I attempted to merge this into dev before the nightly build kicked off but missed it by 1 minute.

@zaneselvans
Copy link
Member

I cancelled the scheduled deployment since it seemed doomed, and kicked one off manually on the post merge dev.

@jdangerx jdangerx deleted the patch-2023-03-29-nightly-build-failure branch April 8, 2023 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants