Skip to content

Update R pre-commit package version and fix broken tests#64

Merged
jeancochrane merged 15 commits into
2024-data-updatefrom
jeancochrane/fix-pre-commit
Jan 14, 2026
Merged

Update R pre-commit package version and fix broken tests#64
jeancochrane merged 15 commits into
2024-data-updatefrom
jeancochrane/fix-pre-commit

Conversation

@jeancochrane
Copy link
Copy Markdown
Member

@jeancochrane jeancochrane commented Jan 5, 2026

Background

This PR implements two fixes for problems with our automated checks that are preventing us from merging PRs:

  1. The R pre-commit package version that we use is out of date, which causes it to try to install R dependencies that are no longer compatible with R 4.5.x (most notably digest). This PR updates our R pre-commit package version to get it working with R 4.5.x.
  2. Two snapshot tests for the lookup_agency() function are failing for reasons that I can't quite figure out. This PR switches to a more modern version of the snapshot test that resolves the error and will provide better output if it ever fails again in the future. (See the section below for more details.)

Failing snapshot tests

We use two snapshot tests in order to test that the lookup_agency() function returns exactly the same output for two medium-sized queries. Those tests are currently implemented using the testthat assertion function expect_known_hash():

expect_known_hash(
lookup_agency(2014:2019, "12064"),
"cf6dcb93bf"
)
expect_known_hash(
lookup_agency(sum_df$year, sum_df$tax_code),
"30ede4ede0"
)

For some reason, these two lookup_agency() calls return a different hash on CI than they do locally (see here for example CI logs). The hash matches the expected value in the test when I run it on my local machine, but the hash is different when the test runs on the test-coverage GitHub workflow.

I spent a few hours trying to figure out the source of the discrepancy but I couldn't quite get it. During my investigation, I wrote a script to confirm that the local and CI dataframes have exactly the same contents but different object hashes. In order to run this script, you'll need to manually download the following CI artifacts and save them to the corresponding filename in your ptaxsim/ directory (I didn't bother scripting this download because it requires a GitHub auth token):

Click here to expand a hidden section containing the script code
library(ptaxsim)

Sys.setenv(PTAXSIM_DB_PATH = "ptaxsim.db")
ptaxsim_db_conn <- DBI::dbConnect(
  RSQLite::SQLite(),
  Sys.getenv("PTAXSIM_DB_PATH")
)
assign("ptaxsim_db_conn", ptaxsim_db_conn, envir = .GlobalEnv)

# Download these .zip files from CI and save them to the current working directory
agency_2014_to_2019_ci_zip_url <- "https://github.com/ccao-data/ptaxsim/actions/runs/20830893132/artifacts/5068218666"
agency_summary_ci_zip_url <- "https://github.com/ccao-data/ptaxsim/actions/runs/20830893132/artifacts/5068218740"

# Function to extract the RDS file from the CI .zip files whose paths are listed above
extract_rds_from_zip <- function(zip_path, extract_dir) {
  unzip(zip_path, exdir = extract_dir)
  rds_files <- list.files(extract_dir, pattern = "\\.rds$", full.names = TRUE, recursive = TRUE)
  if (length(rds_files) == 0) stop("No RDS files found in: ", zip_path)
  rds_files[1]
}

# Extract and read the CI RDS files
dir.create("agency-2014-2019-ci", showWarnings = FALSE)
agency_2014_to_2019_ci_rds_path <- extract_rds_from_zip(
  file.path("agency-2014-2019-ci.zip"),
  file.path("agency-2014-2019-ci")
)
agency_2014_to_2019_ci_df <- readRDS(agency_2014_to_2019_ci_rds_path)

dir.create("agency-summary-ci", showWarnings = FALSE)
agency_summary_ci_rds_path <- extract_rds_from_zip(
  "agency-summary-ci.zip",
  "agency-summary-ci"
)
agency_summary_ci_df <- readRDS(agency_summary_ci_rds_path)

# Load the local data frames
agency_2014_to_2019_local_df <- lookup_agency(2014:2019, "12064")
agency_summary_local_df <- lookup_agency(
  sample_tax_bills_summary$year,
  sample_tax_bills_summary$tax_code
)

# Compare column names
if (!identical(names(agency_2014_to_2019_ci_df), names(agency_2014_to_2019_local_df))) {
  cat("CI columns:    ", paste(names(agency_2014_to_2019_ci_df), collapse = ", "), "\n")
  cat("Local columns: ", paste(names(agency_2014_to_2019_local_df), collapse = ", "), "\n")
  stop("agency_2014_to_2019: Column names do not match (see above for info)")
}

# Compare column types
ci_types <- sapply(agency_2014_to_2019_ci_df, class)
local_types <- sapply(agency_2014_to_2019_local_df, class)
if (!identical(ci_types, local_types)) {
  cat("agency_2014_to_2019: Column types do not match\n")
  cat("CI types:    ", paste(ci_types, collapse = ", "), "\n")
  cat("Local types: ", paste(local_types, collapse = ", "), "\n")
  stop("agency_2014_to_2019: Column types do not match")
}

# Compare values
if (!isTRUE(all.equal(agency_2014_to_2019_ci_df, agency_2014_to_2019_local_df))) {
  cat("agency_2014_to_2019: Column values are not identical\n")
  diff_rows <- which(as.matrix(agency_2014_to_2019_ci_df) != as.matrix(agency_2014_to_2019_local_df), arr.ind = TRUE)
  cat("First few differences (row, col):\n")
  print(head(diff_rows))
  stop("agency_2014_to_2019: Column values are not identical")
}

# Repeat checks for agency_summary
if (!identical(names(agency_summary_ci_df), names(agency_summary_local_df))) {
  cat("agency_summary: Column names do not match\n")
  cat("CI columns:    ", paste(names(agency_summary_ci_df), collapse = ", "), "\n")
  cat("Local columns: ", paste(names(agency_summary_local_df), collapse = ", "), "\n")
  stop("agency_summary: Column names do not match")
}

ci_types <- sapply(agency_summary_ci_df, class)
local_types <- sapply(agency_summary_local_df, class)
if (!identical(ci_types, local_types)) {
  cat("agency_summary: Column types do not match\n")
  cat("CI types:    ", paste(ci_types, collapse = ", "), "\n")
  cat("Local types: ", paste(local_types, collapse = ", "), "\n")
  stop("agency_summary: Column types do not match")
}

if (!isTRUE(all.equal(agency_summary_ci_df, agency_summary_local_df))) {
  cat("agency_summary: Column values are not identical\n")
  diff_indices <- which(as.matrix(agency_summary_ci_df) != as.matrix(agency_summary_local_df), arr.ind = TRUE)
  cat("First few differences (row, col):\n")
  print(head(diff_indices))
  stop("agency_summary: Column values are not identical")
}

# Print hashes, as a final check to demonstrate that the objects are different
# even though their contents are identical
cat("agency_2014_to_2019 local hash: ", digest::digest(agency_2014_to_2019_local_df), "\n")
cat("agency_2014_to_2019 CI hash:    ", digest::digest(agency_2014_to_2019_ci_df), "\n")
cat("agency_summary local hash:      ", digest::digest(agency_summary_local_df), "\n")
cat("agency_summary CI hash:         ", digest::digest(agency_summary_ci_df), "\n")

We shouldn't even really be using expect_known_hash() for these tests anymore, because it is deprecated in the latest version of testthat. Instead, testthat now recommends using expect_snapshot_output() and expect_snapshot_value() for snapshot tests. These new tests are not only recommended, they also provide verbose error output that shows exactly which rows are mismatching in the case of a snapshotted dataframe. This is nice for our lookup_agency() tests -- it's very difficult to debug a test failure based on the object hash changing (as my script demonstrates above), but since the new expect_snapshot_*() tests work with archived output rather than object hashes, they will be able to show us exactly why the output differs from the snapshot if the test fails in the future.

This snapshotting stuff may be unfamiliar so I'm happy to talk it through in person if it would be helpful!

@jeancochrane jeancochrane changed the title Update R pre-commit package version Update R pre-commit package version and fix broken tests Jan 8, 2026
Comment thread data-raw/agency/agency.R
Comment on lines -46 to -47


Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any documentation of this change, but it seems that the latest version of styler bundled with pre-commit is now enforcing a max of two newlines between code blocks. (See here for failing pre-commit logs.) It's possible we could choose to update our styler config to override this setting, but I personally agree that two newlines should be the maximum amount of space between code blocks, so I decided to just implement it across the files that are currently using a max of four newlines.

@@ -0,0 +1,70 @@
# lookup values/data are correct
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a snapshot file -- expect_snapshot_value() generates it automatically the first time it runs, and then on subsequent runs it compares the output of the lookup_agency() function to this file.

lookup_agency(2014:2019, "12064"),
"cf6dcb93bf"

local_edition(3) # Enable snapshot testing
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required since the new snapshot tests are part of the 3rd edition of testthat, which is opt-in only. I'm choosing to only opt-in for this one test, since I suspect we'll need to migrate other tests to meet the new standard and I don't want to bother with that right now.

Comment thread .lintr
Comment on lines +6 to +8
return_linter = NULL,
commented_code_linter = NULL,
pipe_consistency_linter = pipe_consistency_linter(c("auto"))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults have all changed in the latest version of lintr. It might be worth conforming to the new lintr defaults at some point, but I don't want to deal with it right now, so I'm just reverting to the previous defaults.

Comment thread .pre-commit-config.yaml
hooks:
- id: check-added-large-files
args: ['--maxkb=200']
args: ['--maxkb=500']
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the snapshot test files is slightly larger than 200kb. We probably shouldn't make a habit of committing large files to the repo on a regular basis, but I think one medium-sized snapshot file is fine, so I'm bumping this limit to allow my PR to pass this check.

@jeancochrane jeancochrane marked this pull request as ready for review January 8, 2026 23:48
Copy link
Copy Markdown
Member

@kyrasturgill kyrasturgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything makes sense to me! Thanks for the thorough explanation of these updates.

@jeancochrane jeancochrane changed the base branch from master to 2024-data-update January 14, 2026 16:08
@jeancochrane jeancochrane merged commit afae33d into 2024-data-update Jan 14, 2026
9 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/fix-pre-commit branch January 14, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants