Snapshotting a CDN that has an HTTPS delivery service w/ no cert causes TR crconfig reload failure #5893

mitchell852 · 2021-05-26T22:56:28Z

I'm submitting a ...

bug report

Traffic Control components affected ...

Traffic Ops
Traffic Router

Current behavior:

TR will not load a snapshot that has an HTTPs delivery service with a missing cert, therefore, TR will be stuck with an old snapshot until the problem is resolved and a cert is created for the HTTPS ds or the HTTPS ds is switched to HTTP.

Expected behavior:

Prevent the creation of an invalid snapshot or TR should handle invalid snapshots more gracefully.

Minimal reproduction of the problem with instructions:

Create a delivery service with protocol=HTTPS. Do NOT add a cert to the delivery service.
snapshot the cdn for the delivery service
TR will not reload the new snapshot due to the missing cert

Anything else:

Possible solutions:

when snapshotting a cdn via the api, return 400 if any of the cdn's https delivery services are missing a cert.
auto-generate a self-signed cert when creating an https ds or updating an http ds to an https ds.
TR could somehow filter out the bad ds's and consume the rest of the snapshot?

jhg03a · 2021-05-26T22:59:29Z

There's a chicken/egg problem that you can't create the SSL cert until the DS is created I believe.

jhg03a · 2021-05-26T23:00:49Z

The 3rd option is also kindof problematic because the CRConfig and SSL certificate retrieval aren't tied to one another. Certs are just autodeployed without a safety once they show up even if it's broken.

mitchell852 · 2021-05-26T23:01:03Z

There's a chicken/egg problem that you can't create the SSL cert until the DS is created I believe.

true. but the api handler could do it just fine. on create, if protocol=https, call generate handler.

jhg03a · 2021-05-26T23:01:58Z

There's a chicken/egg problem that you can't create the SSL cert until the DS is created I believe.

true. but the api handler could do it just fine. on create, if protocol=https, call generate handler.

That's basically advocating for option 2.

rob05c · 2021-05-26T23:04:06Z

I vote for both [1] or [2] (I don't have a strong opinion which), and [3]. Robust systems require multiple failsafes. I know we have limited resources, but config loading (on both TR and Caches) is a particularly dangerous part of our system.

We should have validation in TO and/or TP, and TR and ATS should also detect and refuse to apply changes to a single DS that they recognize is invalid, but still apply changes for every other DS.

This is especially necessary for Self-Service, to catch tenants who mash keys until something broken slips thru one validation. To make the CDN keep working and applying changes for every other tenant.

On the UI Usability side, we should also fix the chicken-and-egg problem, so it's possible to fully create the DS at once, and not necessary to create temporary wrong config in order to get to the final config.

rawlinp · 2021-05-27T00:41:40Z

I'm -1 on auto-generating a self-signed (unusable) certificate just to bypass this TR safety. I think we should just prevent enabling HTTPS on a delivery service unless it has a certificate. That means if we want HTTPS we still have to:

create the DS
generate or provide a cert
enable https
which is exactly what we should already be doing today, except with API validation in place.

I don't think we should make TR skip over changes to DSes that it recognizes are invalid but still apply changes for every other DS. As a general principle, if TR gets an invalid CRConfig, I think it should continue to operate with the last known good CRConfig. We generally catch these issues very quickly anyways, and they are pretty much always harmless. If we allow TR to run with partially-valid CRConfigs, we will not be able to look at the CRConfig TR has on-disk and know exactly what a given DS's running config is. The CRConfig should be all or nothing, and that is generally how TR is designed under the hood. Changing that design would be costly and provide very little value IMO.

I understand your point about multiple failsafes, but TR refusing to load an invalid snapshot is a failsafe. We just need the 2nd failsafe in TO to prevent the invalid snapshots in the first place.

jhg03a · 2021-05-27T17:53:31Z

As an operator, I disagree. Fewer steps and ways to mess things up that don't involve being forced into longer explicit workflows are better. We're aiming for the same endgoal, but which is better:

1. Make the DS
2. Update the cert to be valid, if you have one and not necessary if you're using LE/ACME

versus

1. Try to make the DS and be told it's wrong or get confused as to why I can't enable HTTPS in the first place
2. Make the DS partially
3. Add/update the cert to be valid, which might be self-signed anyway if you don't have a legitimate cert yet
4. Update the DS again to be complete.

There's not a downside in defaulting to a self-signed value and is how most beginner webserver instructions start with. The worst case scenario is that you forget to add the valid cert later which is really just a tradeoff in https connectivity being refused all together versus just being able to allow the insecure content.

jhg03a · 2021-05-27T18:03:29Z

I understand your point about multiple failsafes, but TR refusing to load an invalid snapshot is a failsafe. We just need the 2nd failsafe in TO to prevent the invalid snapshots in the first place.

I think the point is that you can have a more robust and scalable system by moving away from the all-or-nothing safety that's so pervasive in TR. If a goal someday ever is to get away from having a global CDN mutex (chicken), lots of people without communication/coordination with eachother will be needing to do their own micro-snapshots.

rawlinp · 2021-05-27T19:09:42Z

TR's all or nothing safety has nothing to do with the ability for people to do their own "micro/DS snapshots" whenever that functionality gets implemented in the future. TO needs to have the validation in place to prevent invalid configs from even being possible in the first place. I don't really see reducing the number of steps involved as a necessary part of that solution, especially when it is hiding a necessary step in getting an actually secure certificate. We wouldn't want users to create a DS with HTTPS enabled or update their DS to enable HTTPS and think their DS is actually secure with nothing left to do. Providing a real cert is a necessary step in the process, so I don't think it matters that we continue to have a 4-step process with this solution.

ocket8888 · 2021-05-27T19:21:29Z

The 4-step process is symptomatic of TP's tendency to say "1 button = 1 API request", anyway. When creating a DS it could easily ask for certificates (or if you want them generated) at the same time, and do the 4 steps for you behind the scenes so that user encounters only one actual step.

jhg03a · 2021-08-24T15:40:47Z

Relates to #2429

mitchell852 added bug something isn't working as intended Traffic Ops related to Traffic Ops Traffic Router related to Traffic Router medium impact impacts a significant portion of a CDN, or has the potential to do so labels May 26, 2021

mattjackson220 mentioned this issue Jul 14, 2021

Generate a self signed cert on ds creation/update #6024

Merged

6 tasks

mitchell852 assigned mattjackson220 Sep 15, 2021

mitchell852 added this to the 6.0.0 milestone Sep 15, 2021

rawlinp closed this as completed in #6024 Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshotting a CDN that has an HTTPS delivery service w/ no cert causes TR crconfig reload failure #5893

Snapshotting a CDN that has an HTTPS delivery service w/ no cert causes TR crconfig reload failure #5893

mitchell852 commented May 26, 2021 •

edited

Loading

jhg03a commented May 26, 2021

jhg03a commented May 26, 2021

mitchell852 commented May 26, 2021

jhg03a commented May 26, 2021

rob05c commented May 26, 2021 •

edited

Loading

rawlinp commented May 27, 2021

jhg03a commented May 27, 2021 •

edited

Loading

jhg03a commented May 27, 2021

rawlinp commented May 27, 2021

ocket8888 commented May 27, 2021

jhg03a commented Aug 24, 2021

Snapshotting a CDN that has an HTTPS delivery service w/ no cert causes TR crconfig reload failure #5893

Snapshotting a CDN that has an HTTPS delivery service w/ no cert causes TR crconfig reload failure #5893

Comments

mitchell852 commented May 26, 2021 • edited Loading

I'm submitting a ...

Traffic Control components affected ...

Current behavior:

Expected behavior:

Minimal reproduction of the problem with instructions:

Anything else:

jhg03a commented May 26, 2021

jhg03a commented May 26, 2021

mitchell852 commented May 26, 2021

jhg03a commented May 26, 2021

rob05c commented May 26, 2021 • edited Loading

rawlinp commented May 27, 2021

jhg03a commented May 27, 2021 • edited Loading

jhg03a commented May 27, 2021

rawlinp commented May 27, 2021

ocket8888 commented May 27, 2021

jhg03a commented Aug 24, 2021

mitchell852 commented May 26, 2021 •

edited

Loading

rob05c commented May 26, 2021 •

edited

Loading

jhg03a commented May 27, 2021 •

edited

Loading