Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate and publish a quarto doc with performance results on each model run #62

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Nov 24, 2023

This PR updates the finalize step of the model pipeline to render a stub Quarto doc intended to record performance results for the model. After rendering the doc, the model pipeline then uploads it to S3 and adds its console link to the body of the model SNS notification.

In order to add this step, we also need some way to manage the dependencies for rendering the Quarto doc. This PR proposes we do so by adding a new reporting renv profile and a supplemental lockfile in reports/renv.lock to go along with it. Detailed instructions for maintaining this approach are provided in edits to the model README.

Successful model run here. While the link to the performance report will not be sent via SNS notification until #63 is fixed in #64, you can download the performance report at for inspection at s3://ccao-model-results-us-east-1/report/year=2023/report_type=performance/2023-12-01-silly-tayun.html.

Connects #24.

@jeancochrane jeancochrane force-pushed the 24-infra-updates-generate-and-publish-a-quarto-doc-with-performance-results-on-each-model-run branch from 8af5c2b to 9a600ab Compare November 24, 2023 21:11
@jeancochrane jeancochrane force-pushed the 24-infra-updates-generate-and-publish-a-quarto-doc-with-performance-results-on-each-model-run branch from 9a600ab to fd6538b Compare November 27, 2023 20:28
Dockerfile Show resolved Hide resolved
DESCRIPTION Show resolved Hide resolved
R/helpers.R Show resolved Hide resolved
README.Rmd Outdated Show resolved Hide resolved
README.Rmd Outdated Show resolved Hide resolved
README.md Outdated
| Sale After COVID-19 | Time | logical | | Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020) |
model as of 2023-11-24.

| Feature Name | Category | Type | Possible Values | Notes |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the diff in this table represents the new feature docs that we have added to dbt since we last updated the README.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): This is way too much info for this table. Can we edit the README script to only take the description data up to and excluding the first period of each description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done in 25c8d91 with the README being reknit in 18cfbce!

library(here)
library(lubridate)
library(paws.application.integration)
library(purrr)
library(quarto)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my note above, one way we could make this smoother in the case where the user doesn't choose to install reporting dependencies would be to check if this library is installed or not and skip the generation/upload steps below if so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead we should just assume that they will never need to run the finalize stage and instead guide them toward just running the report directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment above, finalize should now be runnable by outside users, but we've also added a tryCatch block around report generation that should fail more gracefully if report dependencies are not installed.

Comment on lines 420 to 424
# Upload performance report
aws.s3::put_object(
paths$output$report$local,
paths$output$report$s3
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample report here. Accessing the report itself is a bit annoying; since all objects in ccao-model-results-us-east-1 must be private, we can't simply click the Object URL link to download the report, and instead we have to run aws s3 cp to get a copy. If we're willing to accept the slight increase in security risk, it might make sense to add an access policy specifying that any *.html object under the report/ prefix should be public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Could we generate a pre-signed S3 URL to include with the output with a time expiration? If that's easy to do then it seems like the perfect fit for this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea! Unfortunately I couldn't get it to work in ~2 hours of work so I abandoned the idea. It works fine via the aws s3 presign CLI, but something about paws.storage::s3()$generate_presigned_url() doesn't work properly in either RStudio or the ECS task container environment and was throwing me various opaque access denied errors in spite of everything meeting spec. I think we should punt on this unless/until the current workflow becomes bothersome; hopefully the extra step to navigate to the console page and click the Download button won't be too bad.

# Publish to SNS
pipeline_sns$publish(
Subject = paste("Model Run Complete:", run_id),
Message = paste0(
"Model run: ", run_id, " complete\n",
"Finished in: ", pipeline_sns_total_time, "\n\n",
"Report link: ", report_url, "\n\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of making this change, I realized that I haven't been getting any of these SNS emails recently. Have you? If not, I'll file an issue to investigate why we're missing them; otherwise it's probably just an issue with my subscription or my Outlook config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I haven't! Good catch, and very weird. There could be something in the params file preventing notification. Or this step could just be misconfigured. Either way, let's make an issue for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, issue here: #63

@jeancochrane jeancochrane marked this pull request as ready for review November 28, 2023 17:37
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @jeancochrane! I think this is a good trade off between complexity and being able to isolate the model dependencies (which I'm going to do more of shortly).

If it makes sense to you, let's add a README section on how to run the reporting independent of the finalize stage.

README.Rmd Outdated Show resolved Hide resolved
README.Rmd Outdated

The process for updating **model report dependencies** is more complex, since it requires the use of a separate `reporting` profile:

1. Run `Sys.setenv(RENV_PROFILE = "reporting")` to set the renv profile to `reporting`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Shouldn't renv::activate() be used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky -- according to the docs, calling renv::activate() with a profile argument will automatically set that profile to be the default:

This creates a profile called "dev", and sets it as the default for the project, so that newly-launched R sessions will operate using the "dev" profile. [...] Alternatively, if you want to activate a particular profile for an R session without setting it as the default for new R sessions, you can use [Sys.setenv(RENV_PROFILE = "dev")].

I think we don't want the reporting profile to be the default, since I expect we want it to be easier to update model dependencies than reporting dependencies, but maybe I'm wrong? In any case, if you agree that this is the right approach, I'll add a quick note to the docs clarifying why we recommend Sys.setenv() instead of renv::activate().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do actually want to use renv::activate(), since just setting the env var won't change the profile without a restart if I understand correctly. We'd then just call it again to switch back to the default profile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I switched up the docs to recommend renv::activate() in 5da06da!

README.Rmd Outdated Show resolved Hide resolved
library(here)
library(lubridate)
library(paws.application.integration)
library(purrr)
library(quarto)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead we should just assume that they will never need to run the finalize stage and instead guide them toward just running the report directly.

Comment on lines 212 to 218
here("reports/performance/performance.qmd") %>%
quarto_render(
execute_params = list(
run_id = run_id,
year = params$assessment$year
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: One big issue I foresee with this is that any failure in the Quarto doc is also going to kill the entire pipeline. I foresee a model run that's been running for 50 hours being killed by a misplaced comma in the doc :(

suggestion (blocking): I recommend that we add some kind of error handling here to prevent the doc from killing the pipeline. The result of the error handling should be included in the SNS notification.

Comment on lines 420 to 424
# Upload performance report
aws.s3::put_object(
paths$output$report$local,
paths$output$report$s3
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Could we generate a pre-signed S3 URL to include with the output with a time expiration? If that's easy to do then it seems like the perfect fit for this use case.

# Publish to SNS
pipeline_sns$publish(
Subject = paste("Model Run Complete:", run_id),
Message = paste0(
"Model run: ", run_id, " complete\n",
"Finished in: ", pipeline_sns_total_time, "\n\n",
"Report link: ", report_url, "\n\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I haven't! Good catch, and very weird. There could be something in the params file preventing notification. Or this step could just be misconfigured. Either way, let's make an issue for it.

@jeancochrane jeancochrane force-pushed the 24-infra-updates-generate-and-publish-a-quarto-doc-with-performance-results-on-each-model-run branch 2 times, most recently from b663934 to ffb017d Compare November 29, 2023 22:36
output,shap,4,interpret,ccao-model-results-us-east-1,output/shap/model_shap.parquet,shap/,shap,card,"year, run_id, township_code, meta_pin, meta_card_num",No,Yes,SHAP values for each feature for each card in the assessment data,NOTE: Each run adds new partitions to S3 which must be added via a Glue crawler
output,feature_importance,4,interpret,ccao-model-results-us-east-1,output/feature_importance/model_feature_importance.parquet,feature_importance/year={year}/{run_id}.parquet,feature_importance,predictor,"year, run_id, model_predictor_all_name",No,Yes,"Feature importance values (gain, cover, and frequency) for the run",
output,report,5,finalize,ccao-model-results-us-east-1,reports/performance.html,report/year={year}/report_type=performance/{run_id}.html,,model run,,No,Yes,Rendered Quarto doc with model performance statistics,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spite of the horrendous diff for this file, the addition of this line should be the only change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a line ending change? Can we make sure it's LF like the rest of the files in this repo (or at least they should be).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure what the source of the diff is, cat -e misc/file_dict.csv shows each line ending with a Unix line ending $. Enabling Hide whitespace in the GitHub UI seems to filter the diff as expected. Let me know if there's any more debugging you'd like me to do, I'm not particularly well-versed in line endings!

Comment on lines +1 to +3
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# 1. Setup ---------------------------------------------------------------------
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this file was extracted from 05-finalize.R with very few changes. Since they aren't visible in the diff, I'll call them out in comments below.

Comment on lines +9 to +22
suppressPackageStartupMessages({
library(arrow)
library(aws.s3)
library(aws.ec2metadata)
library(dplyr)
library(glue)
library(here)
library(knitr)
library(lubridate)
library(paws.analytics)
library(paws.application.integration)
library(tidyr)
library(yaml)
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependencies have been cleaned up here so that we only import what we need for this pipeline stage.

Comment on lines +31 to +36
# Load various overridden parameters as defined in the `finalize` step
metadata <- read_parquet(paths$output$metadata$local)
cv_enable <- metadata$cv_enable
shap_enable <- metadata$shap_enable
run_id <- metadata$run_id
run_type <- metadata$run_type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these attributes are generated in 05-finalize.R, we have to load them from the output of that stage now.

Comment on lines +238 to +243
# Upload performance report
aws.s3::put_object(
paths$output$report$local,
paths$output$report$s3
)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance report upload is new as of this PR.

Comment on lines +295 to +319
# Get a link to the uploaded Quarto report
report_path_parts <- strsplit(paths$output$report$s3[1], "/")[[1]]
report_bucket <- report_path_parts[3]
report_path <- report_path_parts[4:length(report_path_parts)] %>%
paste(collapse = "/")
# Use direct link to the console instead of to the object so that we don't
# have to bother with signed URLs
report_url <- paste0(
"https://s3.console.aws.amazon.com/s3/object/",
"{report_bucket}/{report_path}?region=us-east-1&tab=overview"
) %>%
glue::glue()

# Publish to SNS
pipeline_sns$publish(
Subject = paste("Model Run Complete:", run_id),
Message = paste0(
"Model run: ", run_id, " complete\n",
"Finished in: ", pipeline_sns_total_time, "\n\n",
"Report link: ", report_url, "\n\n",
pipeline_sns_results
),
TopicArn = Sys.getenv("AWS_SNS_ARN_MODEL_STATUS")
)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a report link to the SNS notification is new as of this PR. Note that this hasn't been tested yet because SNS notifications won't be sent until #63 is fixed by #64.

@jeancochrane
Copy link
Contributor Author

@dfsnow This is ready for another look!

README.Rmd Outdated

The process for updating **model report dependencies** is more complex, since it requires the use of a separate `reporting` profile:

1. Run `Sys.setenv(RENV_PROFILE = "reporting")` to set the renv profile to `reporting`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do actually want to use renv::activate(), since just setting the env var won't change the profile without a restart if I understand correctly. We'd then just call it again to switch back to the default profile.

README.Rmd Outdated
Comment on lines 760 to 773
## Updating R dependencies

There are two lockfiles that we use with renv to manage R dependencies:

1. **`renv.lock`** is the canonical list of dependencies that are used by the **core model pipeline**. Any dependencies that are required to run the model itself should be defined in this lockfile.
2. **`renv/profiles/reporting/renv.lock`** is the canonical list of dependencies that are used to **generate a model performance report** in the `finalize` step of the pipeline. Any dependencies that are required to generate that report or others like it should be defined in this lockfile.

Our goal in maintaining multiple lockfiles is to keep the list of dependencies that are required to run the model as short as possibile. This choice adds overhead to the process of updating R dependencies, but incurs the benefit of a more maintainable model over the long term.

The process for **updating core model pipeline dependencies** is straightforward: Running `renv::install("<dependency_name>")` and `renv::snapshot()` will ensure that the dependency gets added or updated in `renv.lock`, as long is it is imported somewhere in the model pipeline via a `library(<dependency_name>)` call.

The process for updating **model report dependencies** is more complex, since it requires the use of a separate `reporting` profile:

1. Run `Sys.setenv(RENV_PROFILE = "reporting")` to set the renv profile to `reporting`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking): Let's generalize these instructions a bit since I'm going to add one more lock file for the ingest and export dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, done in 5da06da.

README.md Outdated
| Sale After COVID-19 | Time | logical | | Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020) |
model as of 2023-11-24.

| Feature Name | Category | Type | Possible Values | Notes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): This is way too much info for this table. Can we edit the README script to only take the description data up to and excluding the first period of each description?

README.md Outdated
@@ -708,7 +717,7 @@ the following major changes to the residential modeling codebase:
process was moved to [pipeline/00-ingest.R](pipeline/00-ingest.R),
while the process to [finalize model
values](https://gitlab.com/ccao-data-science---modeling/processes/finalize_model_values)
was moved to [pipeline/06-export.R](pipeline/06-export.R).
was moved to [pipeline/06-export.R](pipeline/07-export.R).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
was moved to [pipeline/06-export.R](pipeline/07-export.R).
was moved to [pipeline/07-export.R](pipeline/07-export.R).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7458bab.

README.md Outdated
workflow to delete test run artifacts in S3 using GitHub Actions.
- Updated [pipeline/05-finalize](pipeline/05-finalize.R) step to
render a performance report using Quarto and factored S3/SNS
operations out into \[pipeline/06-upload.R\].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
operations out into \[pipeline/06-upload.R\].
operations out into [pipeline/06-upload.R](pipeline/06-upload.R).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7458bab.

output,shap,4,interpret,ccao-model-results-us-east-1,output/shap/model_shap.parquet,shap/,shap,card,"year, run_id, township_code, meta_pin, meta_card_num",No,Yes,SHAP values for each feature for each card in the assessment data,NOTE: Each run adds new partitions to S3 which must be added via a Glue crawler
output,feature_importance,4,interpret,ccao-model-results-us-east-1,output/feature_importance/model_feature_importance.parquet,feature_importance/year={year}/{run_id}.parquet,feature_importance,predictor,"year, run_id, model_predictor_all_name",No,Yes,"Feature importance values (gain, cover, and frequency) for the run",
output,report,5,finalize,ccao-model-results-us-east-1,reports/performance.html,report/year={year}/report_type=performance/{run_id}.html,,model run,,No,Yes,Rendered Quarto doc with model performance statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a line ending change? Can we make sure it's LF like the rest of the files in this repo (or at least they should be).

Comment on lines 205 to 206
# defined separately, so this script can't be sure that it is error-free, and
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# defined separately, so this script can't be sure that it is error-free, and
#
# defined separately, so this script can't be sure that it is error-free

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 31dc99d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Let's add the same tictoc timing code from the earlier stages to this one, since I'm anticipating that the report will take awhile to generate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done in 31dc99d!

@jeancochrane jeancochrane force-pushed the 24-infra-updates-generate-and-publish-a-quarto-doc-with-performance-results-on-each-model-run branch from 8ae6375 to 18cfbce Compare December 1, 2023 19:43
@jeancochrane
Copy link
Contributor Author

Ready for another look @dfsnow!

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @jeancochrane! Let's merge it.

…arto-doc-with-performance-results-on-each-model-run
@jeancochrane jeancochrane merged commit 8bed163 into master Dec 1, 2023
3 of 4 checks passed
@jeancochrane jeancochrane deleted the 24-infra-updates-generate-and-publish-a-quarto-doc-with-performance-results-on-each-model-run branch December 1, 2023 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Infra updates] Generate and publish a Quarto doc with performance results on each model run
2 participants