Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Openverse API and catalog to openverse.org subdomains #2037

Closed
5 tasks done
sarayourfriend opened this issue May 5, 2023 · 27 comments
Closed
5 tasks done

Move Openverse API and catalog to openverse.org subdomains #2037

sarayourfriend opened this issue May 5, 2023 · 27 comments
Assignees
Labels
🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧭 project: thread An issue used to track a project and its progress 🧱 stack: infra Related to the Terraform config and other infrastructure
Projects

Comments

@sarayourfriend
Copy link
Contributor

sarayourfriend commented May 5, 2023

Summary

Move the infrastructure which currently exists on openverse.engineering to openverse.org.

Description

Pseudo-project proposal (collapsed to preference the issue and document list)

We currently pay for two Cloudflare accounts, one for openverse.engineering and openverse.org. If we move the API and catalog to live on subdomains of openverse.org, we should be able to change our openverse.engineering account to a free one, saving 200 USD a month that can go towards e.g., Plausible.

In the initial discussion we had about this we only talked about moving the API. However, because Airflow is behind Cloudflare Access, we may need to move it as well. I've tried to understand whether that is the case based on the Cloudflare pricing and it seems like Cloudflare Access might be free under our current usage but it isn't clear to me whether that's free for specific paid account types or free for any account type.

The end result of this project should be that openverse.engineering and its usage should be entirely covered by a free Cloudflare account. The planning for this project must consider the various features we will need that will cause contingencies here:

  1. Do we need to route redirects through Cloudflare, or could we do it AWS side with a small redirect service? If we do it via Cloudflare, we might need individual page rules for each domain we are redirecting. For the API there are at least two we need to redirect: api.openverse.engineering and api-production.openverse.engineering. Do we need to redirect staging (api-staging)? What about the legacy staging subdomain, api-dev? We also continue to redirect search.openverse.engineering, our legacy frontend domain(s). Cloudflare free only supports three page rules. We currently have 3 redirects already (for the frontend). We would need at minimum one more for the API (api.openverse.engineering) but to match our existing redirect philosophy with the frontend, we should redirect api, api-production and api-staging. That would lead to 6 minimum page rules if we used page rules for the redirects. The other existing page rules are caching related and would be moved into openverse.org, so they do not need to count towards our total page rule utilisation.
  2. Can the catalog stay on openverse.engineering if it is a free account? Namely, as discussed above, do free Cloudflare accounts support Cloudflare Access? If we manage Cloudflare Access on two different Cloudflare zones, the access terraform module will need to be updated or potentially completely re-written to accommodate multiple Cloudflare zones.

Additional considerations:

  • The jumphost is on openverse.engineering. Should is stay there?
  • If we need to configure more redirects than page rules allow on a free account, can a small ECS Nginx or caddy instance be configured to properly handle every redirect we need? Is it cheaper to host an ECS service for this or a small EC2 instance or can it be managed entirely through AWS ALB listener rules instead?
  • Special attention must be paid to timing the transitions and announcements on Make. Should a notice be sent to registered API users?
  • Updates to API domains will be necessary in many parts of our stack, including documentation and the frontend configuration. Frontend test tapes will need to be updated for the new domain. Are there other instances of such fixtures that we would need to update?

A task list was written by @zackkrida and @dhruvkb before. It is listed below in a collapsed element as I think it should be referenced with a grain of salt. It demonstrates the overall picture well, especially for communications, but hides what I think are probably the most complex parts of this (namely the individual infrastructure steps we need to take for the first task).

The task list created during the original, internal discussion and proposal of this project.
  • Infrastructure PR to point the new domains (can be done in advance) and allow traffic from openverse.org to the load balancer API service
  • Add Cloudflare page cache rules for the API in openverse.org
  • Follow-up infrastructure PR to switch the old domains to 301 redirects
  • Write a make post
  • Get a list of all unique registered email addresses (pretty easy with the jumphost and a DB query) and draft an email
  • Publish the make post and send the email
  • Update references to https://api(-staging)?.openverse.engineering in the documentation
  • On launch day: merge the infra PR to turn the old domains into redirects
  • On launch day: Downgrade the Cloudflare account for openverse.engineering
  • On launch day: Announce completion of the redirect on the make post and in make slack

Documents

Because this project is relatively clear in its motivations and requirements, we will skip the project proposal. This project thread's description will serve as a general project proposal.

Issues

Issues are mostly in the infrastructure repository, organised into the following four milestones.

Preliminary work blocks the API and Airflow specific work. API and Airflow work can happen in parallel. Finalisation work is blocked by everything else.

Additionally, these four issues are in the monorepo related to this:

@sarayourfriend sarayourfriend added the 🧭 project: thread An issue used to track a project and its progress label May 5, 2023
@openverse-bot openverse-bot added this to Backlog in Openverse May 5, 2023
@AetherUnbound AetherUnbound changed the title Move Openverse API and catalog(?) to openverse.org subdomains Move Openverse API and catalog to openverse.org subdomains Dec 18, 2023
@AetherUnbound AetherUnbound added 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: infra Related to the Terraform config and other infrastructure labels Dec 19, 2023
@sarayourfriend sarayourfriend self-assigned this Jan 4, 2024
@sarayourfriend
Copy link
Contributor Author

Here are my answers to the questions I raised in the issue description:

Do we need to route redirects through Cloudflare, or could we do it AWS side with a small redirect service?

If there are not sufficient Cloudflare rules in the free tier to redirect everything that needs it, we can use AWS load balancer rules. I don't have a strong preference either way, maybe with a slight lean towards Cloudflare because the rules are more flexible, and technically they're at the edge, and would perform slightly better than load balancer rules. But those benefits are super negligible, so if there is any hassle at all making it work in Cloudflare, AWS LB rules are a perfectly good fallback that will 100% work and will not cost us anything.

Can the catalog stay on openverse.engineering if it is a free account?

No reason to do this. Let's move everything over and call it a day.

The jumphost is on openverse.engineering. Should is stay there?

This is no longer true. The new SSH bastion is on openverse.org

can a small ECS Nginx or caddy instance be configured to properly handle every redirect we need?

NO. We will use AWS load balancer rules. We must avoid any solutions that introduce new services!

Should a notice be sent to registered API users?

Yes.

Updates to API domains will be necessary in many parts of our stack, including documentation and the frontend configuration. Frontend test tapes will need to be updated for the new domain. Are there other instances of such fixtures that we would need to update?

The implementation plan will cover this.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Feb 2, 2024

Implementation plan is approved and merged (thanks @dhruvkb and @AetherUnbound for the careful review).

I've created all the issues for this project. They are primarily in the private infrastructure repository, almost all the work will happen there. The milestones are as follows:

Preliminary work blocks the API and Airflow specific work. API and Airflow work can happen in parallel. Finalisation work is blocked by everything else.

Additionally, these three issues are in the monorepo related to this:

I've update the issue description to collapse the pseudo-project proposal to prioritise the issue list.

@AetherUnbound
Copy link
Contributor

Thanks Sara for all this careful planning! I've gone ahead and moved all of the non-blocked tickets in the first preliminary work milestone into our TODOs.

@sarayourfriend
Copy link
Contributor Author

Thanks for mentioning the blocked issues in the preliminary work milestone. Those are actually finalisation work, I'd just forgotten to move them! They're in the correct milestone now 👍

@sarayourfriend
Copy link
Contributor Author

The first major changes for this project are underway. I successfully deployed the changes in https://github.com/WordPress/openverse-infrastructure/pull/802 to move our Cloudflare record and page rule handling for live domains out of the next root modules, and into the new cloudflare root module. This went very smoothly, with only a slight hitch required to sort out an issue with the production API RDS engine version. RDS auto-updates the minor version during our maintenance windows, and the terraform provider support setting a prefix instead of a specific version in this case, so that you can pin to a major version instead of pinning to a version that would roll back an automated upgraded. In our case, we'd put the engine version to 13.10, which was the version running when we imported the RDS resource, but it'd since been updated to 13.13 through the automated process. Therefore, our Terraform configuration represented a downgrade in the engine version. The fix for this was simple, we just removed the minor version, and relied on the major version prefix of 13. The documentation on the Terraform provider was helpful here, as was the documentation from the AWS CLI's man page for aws rds update-db-instance, on the --engine-version input. Specifically, we wanted to make sure that setting the engine version to 13 would neither cause a downgrade to the major version .0 release (seemed unlikely), but also that it wouldn't cause an immediate upgrade. Neither of these turned out to be necessary concerns, even if there was a newer minor version supported by RDS, which there isn't (they aren't deploying 13.14 yet). Thank you, @AetherUnbound, for looking at this with me and double-checking the docs to make sure we were all good.

I updated the PR with that change to the RDS module, finished applying all the changes, and merged it.

Today I opened https://github.com/WordPress/openverse-infrastructure/pull/804, which finishes the extraction of our "ingress" layer, by deduplicating load balancer listeners out of the generic service modules and other generic modules, into a single, much simpler generic module. This also involved a lot of clean up, due to the removal of unused variables from the generic service modules. I wrote a lengthy PR description to cover everything the PR changes, and help reviewers move through it was quickly and as confidently as possible.

While working on that, I noticed some issues with our API environment variables that will need to be adjusted for this project as well, missed during the implementation planning process. I created a new issue for that #3821, and will start working on it today, as it isn't blocked by anything else, but will block the very first task of the API migration issues once the preliminary work is finished.

@sarayourfriend
Copy link
Contributor Author

Preliminary work for this is finished now. I will start working on preparing the Airflow migration now.

@zackkrida
Copy link
Member

zackkrida commented Mar 12, 2024

Edit: This rule has been proactively added to Cloudflare manually. When we address WordPress/openverse-infrastructure#325 this change will be codified in our infra repo along with the other firewall rules. No action needs to be taken here.

I wanted to make a note here concerning Cloudflare and some of the currently-manual configuration for dealing with bots. On the frontend we now have "super bot fight mode" enabled, which automatically blocks all traffic from known and likely bad bots, while allowing "verified" web crawlers like Internet Archive, Google, Bing, etc. to access the frontend.

After moving to the openverse.org domain, we probably want to create Web Access Firewall rules to skip "super bot mode" rules for the API. I would guess at least that our API users programmatically accessing the API would be marked as bots by Cloudflare and blocked by these rules.

Specifically, our WAF rules need to skip the "http_request_sbfm" (sbfm = super bot fight mode) request phase. I think the whole rule would (roughly) look like this:

# https://registry.terraform.io/providers/cloudflare/cloudflare/latest/docs/resources/ruleset
resource "cloudflare_ruleset" "skip_sbfm_for_api" {
  zone_id = var.cloudflare_zone_id
  name    = "Skip Super Bot Fight Mode for the API"
  kind    = "zone"
  phase   = "http_request_sbfm"

  rules {
    action     = "skip"
    expression  = "(not http.host matches \"(api\.|api-staging\.)openverse\.org\")"
    description = "Skip Super Bot Fight Mode for requests to the API"
  }
}

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Mar 18, 2024

@zackkrida Can you please add this note to the issue for moving all existing Cloudflare rules from the .engineering zone to the .org zone, so that whoever implements that issue will definitely see this information?

https://github.com/WordPress/openverse-infrastructure/issues/777

Is the main issue needing to make sure that the existing rule for the frontend in the .org zone not accidentally cause an issue to the API?

BTW: we're not using paths for this project, it'll all be on subdomains, so the expression should check the hostname for the API subdomain, rather than any part of the path. To clarify also, do we need to bypass it for Airflow too? Kibana works fine, so I think Airflow should be okay as well: both are behind access and there is no automated traffic to either. Is that your understanding as well? If so, please clarify this in whatever update you leave in the issue 🙏

@zackkrida
Copy link
Member

@sarayourfriend I've updated my comment to match the hostname correctly. I've also manually added this rule to Cloudflare now, so no action needs to be taken in https://github.com/WordPress/openverse-infrastructure/issues/777.

Is the main issue needing to make sure that the existing rule for the frontend in the .org zone not accidentally cause an issue to the API?

Yes, the goal is so that Super Bot Fight Mode doesn't block programmatic API traffic once it's moved over to openverse.org. We do not need to bypass this for Kibana or Airflow.

@sarayourfriend
Copy link
Contributor Author

I've also manually added this rule to Cloudflare now

As in, added to the .org zone? Will you open a PR to add it to the cloudflare root module of the infrastructure repository? Ideally we move away from the practice of manually defining rules in Cloudflare without reviews or the chance to document in comments on them, particularly for things we expect to exist for a long time or indefinitely.

@zackkrida
Copy link
Member

@sarayourfriend agreed on no longer manually defining these in the cf ui,I mentioned in my edit that we should move this firewall rule with the others.

@sarayourfriend
Copy link
Contributor Author

Okay, thanks.

@sarayourfriend
Copy link
Contributor Author

Yesterday I got up the PR to deploy Airflow with Ansible on a stable EC2 instance: https://github.com/WordPress/openverse-infrastructure/pull/829

Staci's early review comment made me realise I hadn't mentioned in the project thread that the security group refactor described in the implementation plan is not workable. Details of that are described in this comment on the issue that was meant to implement the abstracted security group module.

To summarise: the proposed abstraction in the implementation plan largely misses the point of security groups and how to best organise them. Rather than enforcing a uniform basic standard of ingress/egress rules by applying those rules to each security group individually, we should add instances to relevant security groups configured with the rules relevant to them. We can have several thousands of security groups, and with our service volume we will not run into the limit, even if we had an individual security group for each and every rule (which we don't need). Rather than configuring each EC2 instance's security group with SSH ingress rules, we should just add the instance to a shared security group with those rules.

That will be a long term refactor that I need to sit down and plan out into discrete tasks. We cannot do it as part of this project without causing significant delays to it as well as disruptions to virtually all other ongoing infrastructure work. I don't think this needs an implementation plan, it just needs someone (very likely me) to sit down and look at all our existing security groups to identify the places that need changes. Another issue that complicates this is the need to migrate away from both inline security group rules in Terraform and away from the old security group rule resources, all towards the new rule resources which have the significant benefit of leveraging AWS's relatively new security group rule ids, which neither of the previous approaches (which are the ones we use) implement.

So, we will still eventually change how we manage security groups to reduce duplication, we just won't do it the way the implementation plan suggests, and we will not do it as part of this project.

@sarayourfriend
Copy link
Contributor Author

While working on https://github.com/WordPress/openverse-infrastructure/issues/777, I learned that Cloudflare is happy to interpret upstream cache-control header instructions as edge TTL instructions. I've opened #4005 to incorporate our cache control information into the API itself, which will allow us to eliminate a handful of individually defined Cloudflare rules for a single "use the upstream cache-control header as edge ttl" rule for the API.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Apr 8, 2024

Airflow is live at airflow.openverse.org 🎉

A final infrastructure PR is up to finalise the Ansible, compose, monitoring, and IAM configuration. Please review this as soon as possible, @WordPress/openverse-catalog @WordPress/openverse-infrastructure. The resources are live in production and ideally these changes are merged to main soon.

@openverse-bot
Copy link
Collaborator

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@sarayourfriend
Copy link
Contributor Author

I am waiting on two reviews for this PR, which blocks further work on the API side of things.

Last week I merged and applied the PR to migrate our openverse.engineering Cloudflare zone's rules into the openverse.org zone. That went well save for some small configuration issues with obscure errors from the Cloudflare API which were easy to resolve.

@AetherUnbound and @stacimc successfully deployed changes to Airflow for a new Airflow and Python version last week using the playbooks. They ran into an issue with the community.docker module being out of date on their local, so I worked on using PDM to pin our Ansible dependencies in the infrastructure repository: https://github.com/WordPress/openverse-infrastructure/issues/855. Spurred on by a discussion resulting from the Ansible work, and in anticipation of/keeping in mind the ingestion worker deployment I have spent two of my work days last week on a PR to use Packer to build AMIs to deploy with ASGs. This will eventually result in changes to all services deployed in EC2 ASGs, which now includes Airflow, but will not be part of this project. Just noting it here because it's taken some time away from pushing forward the API migration to openverse.org, though as I said, work there is somewhat blocked.

I plan to spend at least a few hours this week working on the copy for #3742 and #3743. Everything else really does depend on the openverse.org domains for the API being live, but the copy for the email and Make post (and the management command itself) can be implemented in full, with a placeholder left for the date in the meantime. I don't have an estimated date yet, but given the very slow pace of reviews for the API side of this project, I am targeting the end of June as the earliest likely shipped date for most of this project (with potentially some lingering PRs to update refenreces to the API in Jetpack and Gutenberg, which are not in scope, but referenced by this project).

@sarayourfriend
Copy link
Contributor Author

The API is now available at api.openverse.org 🎉

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Apr 30, 2024

I've drafted text for the Make post and email https://docs.google.com/document/d/1ESmzbH6vkp8rxJBsy3P0_BLZ01sgKVFAQa3LnPSvBhQ/edit?usp=sharing

@WordPress/openverse-maintainers, please review the text. I'll start working on the baseline requirements for the management command in #3742.

I'll shortly have a PR up for #3741. (Update: #4228)

@sarayourfriend
Copy link
Contributor Author

I've got the make post up and scheduled, with a switch-over date of 3 June 2024, which @zackkrida and I just decided on. That gives a month of lead time, with the post scheduled to publish 6 May at 00:00 UTC.

This PR introduces the management command for sending the email to registered and verified API users: #4229

Finally, https://github.com/WordPress/openverse-infrastructure/pull/876 updates the canonical URL and introduces a staging-only redirect for testing.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented May 7, 2024

The Make post went out earlier this week: https://make.wordpress.org/openverse/2024/05/06/the-openverse-api-is-moving-to-api-openverse-org/

I got all set up to run the management command to send notifications to registered API users, but found an issue with the query in the original code, and it only pulled 2 email addresses in production to send to. That's not the expected outcome. I dug deeper, and realised we made a mistake in how we wrote the query. We'd written the query off of the OAuth2Verification model, thinking that it would be an easy way to find the email addresses of verified registrations. However, we delete the verification row once the email is verified, so that's not a sound approach.

Instead, we need to query off the registration table where name is the name of a verified application. This is safe because name is unique indexed on the registration table.

I've confirmed that with this re-written query, we get a more expected number of emails to send in production. I'll shortly have a PR up to fix the query.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented May 8, 2024

The emails (750 in the end) announcing the API move are sent as of 2024-05-08T00:59:58.366Z.

I'll open a PR to stage the redirect cut over on 3 June, but until then, all other work on this is blocked, except for https://github.com/WordPress/openverse-infrastructure/issues/786.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Jun 3, 2024

The redirects are LIVE! Everything appears to be working. I tried out Gutenberg and Jetpack integrations and both look to be fine as far as I can tell.

I've removed https://github.com/WordPress/openverse-infrastructure/issues/781 and https://github.com/WordPress/openverse-infrastructure/issues/782 from the milestones for this project because we agreed they should not block the "shipped" status of this project, and I wanted to make that clear based on the milestone.

With that, the API tickets are done, and I've closed the milestone!

I'm going to start working on https://github.com/WordPress/openverse-infrastructure/issues/785 now, which will free up https://github.com/WordPress/openverse-infrastructure/issues/787 very soon 🎉. I think at that point the project can actually go into success rather than shipped, because the success criteria was to be able to downgrade to the free tier. @zackkrida do you agree with that assessment? If not, on what cue would we move from shipped to success for this project? We could put an arbitrary amount of time to just monitor things generally before declaring this finished?

Also, just for fun, here's a graph showing all the status codes we've returned from openverse.engineering since I applied the redirect 🙂 This in effect proves that openverse.engineering is no longer used for anything but those redirects, meaning it is safe to remove all the rules and such from it.

image

@zackkrida
Copy link
Member

@sarayourfriend sadly, the redirect broke the Inserter > Media > Openverse flow in Gutenberg and (surprisingly to me) the Gutenberg E2E tests, which rely on availability of the Openverse API. There was some chat about this in the Make WP Slack (public, but requires an account).

I made an infra PR to revert the change, and a Gutenberg PR was already merged to replace the URLs to use openverse.org.

To keep old versions of WordPress working, now, we would have to keep openverse.engineering operational indefinitely. That is clearly infeasible, so instead, I think we should re-implement the redirect right after WordPress 6.6 launches.

I also added a status update to our https://make.wordpress.org/openverse/2024/05/06/the-openverse-api-is-moving-to-api-openverse-org/ make post for the change, and pinned that post as we figure out next steps.

@sarayourfriend
Copy link
Contributor Author

We've got a solution to redirect everything except the media inserter. The solution doesn't prevent the goals of this project from succeeding as it only uses free Cloudflare features and can theoretically exist indefinitely. We can discuss when specifically we would remove it, whether that's with the 6.6 launch or later, in the follow up to this work.

@zackkrida
Copy link
Member

Once https://github.com/WordPress/openverse-infrastructure/pull/920 is merged this project can be moved from shipped to success 🥳

@sarayourfriend
Copy link
Contributor Author

I've just applied and merged https://github.com/WordPress/openverse-infrastructure/pull/920!

This project is complete 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧭 project: thread An issue used to track a project and its progress 🧱 stack: infra Related to the Terraform config and other infrastructure
Projects
Archived in project
Status: ✅ Success
Openverse
  
Backlog
Development

No branches or pull requests

4 participants