-
Notifications
You must be signed in to change notification settings - Fork 34
What would break if we moved to Deployments? #207
Comments
cloudfoundry/eirini-release#207 Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
#207 Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
One of the things we did not think about was metrics. When switching to deployments, the instance ID in the metric ends up as a string with letters. This is not a problem for metrics themselves, however, CC assumes that this is a number. Failing to use the value as a number raises an exception (or whatever it is called in Ruby) which is rescued in a very generic manner Thus the CF cli user gets the misleading error that the stats server is not available. |
This is a very naive implementation of this maaping: we get all the pods, sort their names alphabetically and map the sorted names to ordered indexes. This approach gives us stable mapping but does not guarantee that during scaling up instance X would be older than instance X-1 [cloudfoundry/eirini-release#207] Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
[#207] Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
When using deployments in Eirini for workloads, the metrics proxy would send the pod name suffix (which is not a number) as instance id. In order to convert it to a number, we naively unique and sort those instance ids and map them to their order. [cloudfoundry/eirini-release#207] Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
With the spike branches[1] we managed to:
The only issue we faced was to provide a mapping between pod names and instance In cloud controller instance indexes are expected to be numbers or strings with
The approach above reliably works when restarting an instance, i.e. if you A stable and reliable way to map guids to numbers would be treating guids as [1] Spike branches: |
@danail-branekov @mnitchev do those indexes have to be unique? Or can we get away with |
If we use zeroes, then certain things in the CF cli UI would not work:
|
I see, it looks like an objectively inferior solution. Good catch on the mapping! As weird as it might look it should do the job pretty well! |
We could even inject that number in the |
We should definitely try this. Note: we probably want the range to start from 1 so that no instance ever gets index |
[cloudfoundry/eirini-release#207] Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
The uid is alphanumeric, which means it can be interpreted as a base-36 number and converted to an integer [cloudfoundry/eirini-release#207] Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
[#207] Co-authored-by: Mario Nitchev <marionitchev@gmail.com>
to adopt always getting popd guid as last 5 characters [#207] Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Today we wanted to give the idea of treating pod 5-character guids as base36 numbers and convert them to decimals whereever needed.
All the above has been pushed to related spike branches. Here is how
There are few CATs that are failing because they expect that instance IDs are 0, 1, 2, 3....:
|
In conclusion, we believe that there are two ways to to address instance ID being a non-numeric character
|
I think option 1 would be a great first step. The necessary work would be:
We'll check with the Cloud Controller and metrics-proxy teams and then create issues for this. |
We might want to reopen this to:
|
Other interesting idea: return (sparse) integer indexes for backwards compatibility but adding a new field for alphanumeric IDs so that components can gradually migrate to it. |
Another place to check is log-cache, according to this comment. |
Co-authored-by: Anonymous Eirininaut <eirini@cloudfoundry.org> Issue: cloudfoundry/eirini-release#182 Issue: cloudfoundry/eirini-release#184 Issue: cloudfoundry/eirini-release#185 Issue: cloudfoundry/eirini-release#187 Issue: cloudfoundry/eirini-release#186 Issue: cloudfoundry/eirini-release#183 Issue: cloudfoundry/eirini-release#192 Issue: cloudfoundry/eirini-release#193 Issue: cloudfoundry/eirini-release#194 Issue: cloudfoundry/eirini-release#191 Issue: cloudfoundry/eirini-release#196 Issue: cloudfoundry/eirini-release#199 Issue: cloudfoundry/eirini-release#200 Issue: cloudfoundry/eirini-release#201 Issue: cloudfoundry/eirini-release#202 Issue: cloudfoundry/eirini-release#204 Issue: cloudfoundry/eirini-release#205 Issue: cloudfoundry/eirini-release#195 Issue: cloudfoundry/eirini-release#198 Issue: cloudfoundry/eirini-release#211 Issue: cloudfoundry/eirini-release#212 Issue: cloudfoundry/eirini-release#207 Issue: cloudfoundry/eirini-release#190 Issue: cloudfoundry/eirini-release#209
In order to asses the consequences of supporting alphanumeric instance ids we decided to take an experimental approach where we make the desired change in the cheapest possible way and run CATS to see what breaks. We have decided to limit the scope of the experiment to the cf-for-k8s world, but still making sure that the cf for VMs still works in the old way (numeric instance ids)
In order to deploy the prototype you need to chek out the complete list of spike branches and run the following command from the root of the
As part of the experiment we did not have to change log cache and did not observe any cats failures. Maybe we should ask the comment author for more context.
A part of the experiment we just removed them and did not have significant issues. Sometimes an instance would take longer to disappear from the In order to validate that diego still works with the changes in CC we have deployed a cf fr VMs and are running CATS against it. [1]
|
Good news, I managed to successfully run CATS on a diego deployment that contains cloud controller changes discussed above. The setup (currently deployed as
Thus we proved that:
Having proved that backwards compatibility is ensured, we believe that
Things that would be broken
|
Co-authored-by: Mario Nitchev <marionitchev@gmail.com> Issue: cloudfoundry/eirini-release#207
@danail-branekov can we prepare a list of CF components that would need to change and which changes would be needed? |
@gcapizzi I think we can derive the answer from the spike branches listed above We need to change Eirini, Cloud Controller, Metric Proxy and the CF CLI. The spike branches outline the changes needed, of course they should be implemented in a proper way when doing that for realz. Did I answer your question? |
@danail-branekov I see we only use a separate field ( What about something like this:
|
Ok, I've done a little bit of investigation and found out that Kubernetes uses this code to calculate the random UIDs: // String generates a random alphanumeric string, without vowels, which is n
// characters long. This will panic if n is less than zero.
// How the random string is created:
// - we generate random int63's
// - from each int63, we are extracting multiple random letters by bit-shifting and masking
// - if some index is out of range of alphanums we neglect it (unlikely to happen multiple times in a row)
func String(n int) string {
b := make([]byte, n)
rng.Lock()
defer rng.Unlock()
randomInt63 := rng.rand.Int63()
remaining := maxAlphanumsPerInt
for i := 0; i < n; {
if remaining == 0 {
randomInt63, remaining = rng.rand.Int63(), maxAlphanumsPerInt
}
if idx := int(randomInt63 & alphanumsIdxMask); idx < len(alphanums) {
b[i] = alphanums[idx]
i++
}
randomInt63 >>= alphanumsIdxBits
remaining--
}
return string(b)
} The I've built this little test program: func main() {
nStrings := 0
nInts := 0
for {
str := String(5)
nStrings += 1
_, err := strconv.ParseUint(str, 10, 64)
if err == nil {
nInts += 1
}
if nStrings%1000000 == 0 {
fmt.Printf("strings: %d, ints: %d, ratio: %f%%\n", nStrings, nInts, float64(nInts)/float64(nStrings)*100)
}
}
} and the output looks a bit like this:
So yes, there's a 0.12% chance that a UID is a valid integer 😕 Edit: this could have been calculated on paper! |
There is lot of interest in having Eirini use
Deployment
s for its workloads instead ofStatefulSet
s (see cloudfoundry/eirini#94 and cloudfoundry/cf-for-k8s#600). We suspect a bunch of things would break if we did it, but we don't really know which. Can we figure this out? One way could be to do the change as cheaply as possible, and then run CATs.We expect anything that relies on instance indexes to break, like:
INSTANCE_INDEX
environment variablebut we don't know much about wether the rest of CF would keep working.
The goal of this is to understand if this could be an optional feature users can enable if they're willing to give up on a subset of CF features, hoping that subset is as small as possible.
The text was updated successfully, but these errors were encountered: