Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Terraform configuration to deploy staging vw_pin_sale view #55

Open
jeancochrane opened this issue Oct 20, 2023 · 3 comments
Open
Assignees

Comments

@jeancochrane
Copy link
Collaborator

In working on #53, @wagnerlmichael discovered that the testing version of script won't work as written because it pulls from the production vw_pin_sale view, which itself pulls from the production sale.flag table instead of the staging version of that table. In order to support removing the joins on sale.flag from the script, we'll need to update our Terraform config to deploy a staging version of vw_pin_sale that will pull data from the staging sale.flag table.

@jeancochrane
Copy link
Collaborator Author

A hiccup with this effort that I should have thought of sooner: Unlike sale.flag, which is generated by code in this repo, the vw_pin_sale view is actually created by code that lives in in data-architecture. This means that if we wanted to deploy a staging version of it from the context of model-sales-val, we would need to either duplicate the view query in model-sales-val or somehow reference the version of it stored in thedata-architecture repo, neither of which is a particularly appealing choice to me.

Two options off the top of my head:

  1. Revert this section of #54 and maintain a duplicative join between vw_pin_sale to sale.flag in the context of the Glue script
    • Pro: Dead simple and easy to maintain
    • Con: Less efficient and potentially more confusing query due to the duplication
  2. Update the deployment workflow and Terraform config so that on staging deployments we download the vw_pin_sale query from the main branch of data-architecture and use it to populate an aws_glue_catalog_table resource
    • Pro: Eliminates duplication in the Glue job query
    • Con: Fragile for a number of reasons:
      • Must be updated if the file structure of data-architecture ever changes
      • Makes local testing of the Terraform setup more complex
      • Still requires some duplication with data-architecture, since I confirmed experimentally that this Terraform resource requires each column in the view be specified

Option 1 is my preference but I'm curious if @wagnerlmichael or @dfsnow have other ideas!

@wagnerlmichael
Copy link
Member

Between these two options I also like option 1. The revert of #54 should be doing the exact same thing. If we add that, testing is basically good to go, and the docs are already set up for that situation, I think (copying data from buckets to mimic prod).

@jeancochrane
Copy link
Collaborator Author

One thought that @dfsnow posted in Teams that's worth preserving:

I think a simpler solution might be:

  • All runs write directly to the prod table, but we add an additional column like type that is either "test" or "prod"
  • vw_pin_sale takes only "prod" flags, but we can query directly to see the result of test runs
  • Terraform still instantiates the glue job, but just only targets the prod assets with the changes outlined. All CI runs target type = "test" by default. We also remove the scheduling component of the job if there is one
  • When testing is finished, we manually trigger the terraform workflow to instantiate and run the glue job, with an input variable that changes the type to "prod"

We discussed this out of band and decided that in the immediate term we're just going to add back in the join that was removed in #54 to make the existing deployment setup work again. Then in the medium term I'm going to think more about Glue job deployment architecture as a whole while I work on setting up a model deployment pipeline in ccao-data/model-res-avm#23, and I'll plan to keep an eye out for opportunities to align these two approaches to deploying a model. Then we'll revisit the design and hopefully have a plan to improve it when we tackle permit automation in ccao-data/data-architecture#158, which will require a similar type of scheduled data processing job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants