Resiliency POC

Let's use the following definitions

Instance: Single HTTP server.
Service: Set of all instances for a given application.

We want to proof that there is a single Hystrix circuit breaker per service, as opposed to a circuit breaker per instance. This means that one failing instance could open the circuit for the whole service, even having other healthy instances.

This is important because while migrating services from one platform to another, you may add unstable instances to a service, and you don't want to close the whole circuit just because there is an unhealthy instance on the new platform, while having plenty of healthy instances on the former platform.

Scenario

There are two services ping (1 instance) and pong (3 instances) that register themselves in Eureka. The ping service outputs the result of making an HTTP request to the pong service, using round robin to use a different pong instance on every received request.

Let's make pong fail until the circuit is opened.

Running the example

Executing the following command will start minikube, deploy Eureka, build the docker images for ping and pong and deploy them to minikube

$ minikube start
$ eval $(minikube docker-env)
$ minikube addons enable ingress # Use ingress to expose the ping and pong services for testing and debugging
$ echo "$(minikube ip) eureka ping pong" | sudo tee -a /etc/hosts # Update /etc/hosts to add minikube IP pointing to our services 
$ helm upgrade --install "eureka" eureka/charts/eureka # We need to start Eureka, since both ping and pong will use it
$ helm upgrade --install "prometheus" stable/prometheus -f prometheus-values.yaml
$ helm upgrade --install "grafana" stable/grafana -f grafana-values.yaml
$ ./gradlew clean build buildDockerImage deploy

Sending requests to the ping service will respond successfully with responses coming from the different pong instances. You can see the different pod name on every response

$ while true; do curl -i "http://ping/"; echo "\n"; sleep 0.5; done

Let's make one of the pong instances to start replying with errors

$ curl -i -XPOST "http://pong/"

Now, you should see that one of every three requests returns the fallback message, meaning that there is a pong instance failing. 1/3 = 33% of the requests to the pong service are failing. But the circuit is not opened: we see the fallback response every time the request from ping to pong fails.

Now let's decrease to 20% the error threshold percentage for requests to the pong service (it was configured to 50%)

$ curl -i -XPOST "http://ping/env" -d hystrix.command."PongClient#hello()".circuitBreaker.errorThresholdPercentage=20

This will open the circuit since our error rate 33% is higher than the 20% threshold. So you'll see that all the responses go straight to the fallback, even though we have healthy pong instances.

$ curl -i "http://ping/"

Let's configure retries so when ping requests to pong fail, ping will try sending the request to a different pong instance. That way, even though one instance is unhealthy, we will never get an error and the circuit will never be opened.

$ curl -i -XPOST "http://ping/env" -d ribbon.retryableStatusCodes=500

This tells the HTTP client to retry requests which received a 500 status code. It accepts a comma separated list of codes. If we keep sending requests, we'll see how now we don't get any error responses.

Conclusion

Hystrix wraps the HTTP client. The client may or may not be a LoadBalancer client, which means that Hystrix is not aware of the different instances for a given service.
When requests start to fail, if the error percentage is higher than the defined error threshold percentage for a given service, Hystrix will open the circuit.
A potential workaround to minimize this is to add the percentage of unstable instances to the current error threshold percentage. Be careful: this will make your service tolerate more errors.
A better approach would be to use retries so requests are retries until reach a healthy instance.
A good healthcheck is needed for services. If a service is failing, the healthcheck should reflect that, de-registering the service from Eureka.
When using kubernetes, use the healthcheck for the liveness probe to stop receiving traffic if the application is unhealthy.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
eureka/charts/eureka		eureka/charts/eureka
gradle/wrapper		gradle/wrapper
ping		ping
pong		pong
README.md		README.md
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
grafana-values.yaml		grafana-values.yaml
prometheus-values.yaml		prometheus-values.yaml
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eureka/charts/eureka

eureka/charts/eureka

gradle/wrapper

gradle/wrapper

ping

ping

pong

pong

README.md

README.md

gradle.properties

gradle.properties

gradlew

gradlew

gradlew.bat

gradlew.bat

grafana-values.yaml

grafana-values.yaml

prometheus-values.yaml

prometheus-values.yaml

settings.gradle

settings.gradle

Repository files navigation

Resiliency POC

Scenario

Running the example

Conclusion

About

Releases

Packages

Languages

fiunchinho/spring-resiliency

Folders and files

Latest commit

History

Repository files navigation

Resiliency POC

Scenario

Running the example

Conclusion

About

Resources

Stars

Watchers

Forks

Languages