Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshotting a CDN that has an HTTPS delivery service w/ no cert causes TR crconfig reload failure #5893

Closed
mitchell852 opened this issue May 26, 2021 · 11 comments · Fixed by #6024
Assignees
Labels
bug something isn't working as intended medium impact impacts a significant portion of a CDN, or has the potential to do so Traffic Ops related to Traffic Ops Traffic Router related to Traffic Router
Milestone

Comments

@mitchell852
Copy link
Member

mitchell852 commented May 26, 2021

I'm submitting a ...

  • bug report

Traffic Control components affected ...

  • Traffic Ops
  • Traffic Router

Current behavior:

TR will not load a snapshot that has an HTTPs delivery service with a missing cert, therefore, TR will be stuck with an old snapshot until the problem is resolved and a cert is created for the HTTPS ds or the HTTPS ds is switched to HTTP.

Expected behavior:

Prevent the creation of an invalid snapshot or TR should handle invalid snapshots more gracefully.

Minimal reproduction of the problem with instructions:

  • Create a delivery service with protocol=HTTPS. Do NOT add a cert to the delivery service.
  • snapshot the cdn for the delivery service
  • TR will not reload the new snapshot due to the missing cert

Anything else:

Possible solutions:

  1. when snapshotting a cdn via the api, return 400 if any of the cdn's https delivery services are missing a cert.
  2. auto-generate a self-signed cert when creating an https ds or updating an http ds to an https ds.
  3. TR could somehow filter out the bad ds's and consume the rest of the snapshot?
@mitchell852 mitchell852 added bug something isn't working as intended Traffic Ops related to Traffic Ops Traffic Router related to Traffic Router medium impact impacts a significant portion of a CDN, or has the potential to do so labels May 26, 2021
@jhg03a
Copy link
Contributor

jhg03a commented May 26, 2021

There's a chicken/egg problem that you can't create the SSL cert until the DS is created I believe.

@jhg03a
Copy link
Contributor

jhg03a commented May 26, 2021

The 3rd option is also kindof problematic because the CRConfig and SSL certificate retrieval aren't tied to one another. Certs are just autodeployed without a safety once they show up even if it's broken.

@mitchell852
Copy link
Member Author

There's a chicken/egg problem that you can't create the SSL cert until the DS is created I believe.

true. but the api handler could do it just fine. on create, if protocol=https, call generate handler.

@jhg03a
Copy link
Contributor

jhg03a commented May 26, 2021

There's a chicken/egg problem that you can't create the SSL cert until the DS is created I believe.

true. but the api handler could do it just fine. on create, if protocol=https, call generate handler.

That's basically advocating for option 2.

@rob05c
Copy link
Member

rob05c commented May 26, 2021

I vote for both [1] or [2] (I don't have a strong opinion which), and [3]. Robust systems require multiple failsafes. I know we have limited resources, but config loading (on both TR and Caches) is a particularly dangerous part of our system.

We should have validation in TO and/or TP, and TR and ATS should also detect and refuse to apply changes to a single DS that they recognize is invalid, but still apply changes for every other DS.

This is especially necessary for Self-Service, to catch tenants who mash keys until something broken slips thru one validation. To make the CDN keep working and applying changes for every other tenant.

On the UI Usability side, we should also fix the chicken-and-egg problem, so it's possible to fully create the DS at once, and not necessary to create temporary wrong config in order to get to the final config.

@rawlinp
Copy link
Contributor

rawlinp commented May 27, 2021

I'm -1 on auto-generating a self-signed (unusable) certificate just to bypass this TR safety. I think we should just prevent enabling HTTPS on a delivery service unless it has a certificate. That means if we want HTTPS we still have to:

  1. create the DS
  2. generate or provide a cert
  3. enable https
    which is exactly what we should already be doing today, except with API validation in place.

I don't think we should make TR skip over changes to DSes that it recognizes are invalid but still apply changes for every other DS. As a general principle, if TR gets an invalid CRConfig, I think it should continue to operate with the last known good CRConfig. We generally catch these issues very quickly anyways, and they are pretty much always harmless. If we allow TR to run with partially-valid CRConfigs, we will not be able to look at the CRConfig TR has on-disk and know exactly what a given DS's running config is. The CRConfig should be all or nothing, and that is generally how TR is designed under the hood. Changing that design would be costly and provide very little value IMO.

I understand your point about multiple failsafes, but TR refusing to load an invalid snapshot is a failsafe. We just need the 2nd failsafe in TO to prevent the invalid snapshots in the first place.

@jhg03a
Copy link
Contributor

jhg03a commented May 27, 2021

As an operator, I disagree. Fewer steps and ways to mess things up that don't involve being forced into longer explicit workflows are better. We're aiming for the same endgoal, but which is better:

1. Make the DS
2. Update the cert to be valid, if you have one and not necessary if you're using LE/ACME

versus

1. Try to make the DS and be told it's wrong or get confused as to why I can't enable HTTPS in the first place
2. Make the DS partially
3. Add/update the cert to be valid, which might be self-signed anyway if you don't have a legitimate cert yet
4. Update the DS again to be complete.

There's not a downside in defaulting to a self-signed value and is how most beginner webserver instructions start with. The worst case scenario is that you forget to add the valid cert later which is really just a tradeoff in https connectivity being refused all together versus just being able to allow the insecure content.

@jhg03a
Copy link
Contributor

jhg03a commented May 27, 2021

I understand your point about multiple failsafes, but TR refusing to load an invalid snapshot is a failsafe. We just need the 2nd failsafe in TO to prevent the invalid snapshots in the first place.

I think the point is that you can have a more robust and scalable system by moving away from the all-or-nothing safety that's so pervasive in TR. If a goal someday ever is to get away from having a global CDN mutex (chicken), lots of people without communication/coordination with eachother will be needing to do their own micro-snapshots.

@rawlinp
Copy link
Contributor

rawlinp commented May 27, 2021

TR's all or nothing safety has nothing to do with the ability for people to do their own "micro/DS snapshots" whenever that functionality gets implemented in the future. TO needs to have the validation in place to prevent invalid configs from even being possible in the first place. I don't really see reducing the number of steps involved as a necessary part of that solution, especially when it is hiding a necessary step in getting an actually secure certificate. We wouldn't want users to create a DS with HTTPS enabled or update their DS to enable HTTPS and think their DS is actually secure with nothing left to do. Providing a real cert is a necessary step in the process, so I don't think it matters that we continue to have a 4-step process with this solution.

@ocket8888
Copy link
Contributor

The 4-step process is symptomatic of TP's tendency to say "1 button = 1 API request", anyway. When creating a DS it could easily ask for certificates (or if you want them generated) at the same time, and do the 4 steps for you behind the scenes so that user encounters only one actual step.

@jhg03a
Copy link
Contributor

jhg03a commented Aug 24, 2021

Relates to #2429

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working as intended medium impact impacts a significant portion of a CDN, or has the potential to do so Traffic Ops related to Traffic Ops Traffic Router related to Traffic Router
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants