Skip to content

fiunchinho/spring-resiliency

Repository files navigation

Resiliency POC

Let's use the following definitions

  • Instance: Single HTTP server.
  • Service: Set of all instances for a given application.

We want to proof that there is a single Hystrix circuit breaker per service, as opposed to a circuit breaker per instance. This means that one failing instance could open the circuit for the whole service, even having other healthy instances.

This is important because while migrating services from one platform to another, you may add unstable instances to a service, and you don't want to close the whole circuit just because there is an unhealthy instance on the new platform, while having plenty of healthy instances on the former platform.

Scenario

There are two services ping (1 instance) and pong (3 instances) that register themselves in Eureka. The ping service outputs the result of making an HTTP request to the pong service, using round robin to use a different pong instance on every received request.

Let's make pong fail until the circuit is opened.

Running the example

Executing the following command will start minikube, deploy Eureka, build the docker images for ping and pong and deploy them to minikube

$ minikube start
$ eval $(minikube docker-env)
$ minikube addons enable ingress # Use ingress to expose the ping and pong services for testing and debugging
$ echo "$(minikube ip) eureka ping pong" | sudo tee -a /etc/hosts # Update /etc/hosts to add minikube IP pointing to our services 
$ helm upgrade --install "eureka" eureka/charts/eureka # We need to start Eureka, since both ping and pong will use it
$ helm upgrade --install "prometheus" stable/prometheus -f prometheus-values.yaml
$ helm upgrade --install "grafana" stable/grafana -f grafana-values.yaml
$ ./gradlew clean build buildDockerImage deploy

Sending requests to the ping service will respond successfully with responses coming from the different pong instances. You can see the different pod name on every response

$ while true; do curl -i "http://ping/"; echo "\n"; sleep 0.5; done

Let's make one of the pong instances to start replying with errors

$ curl -i -XPOST "http://pong/"

Now, you should see that one of every three requests returns the fallback message, meaning that there is a pong instance failing. 1/3 = 33% of the requests to the pong service are failing. But the circuit is not opened: we see the fallback response every time the request from ping to pong fails.

Now let's decrease to 20% the error threshold percentage for requests to the pong service (it was configured to 50%)

$ curl -i -XPOST "http://ping/env" -d hystrix.command."PongClient#hello()".circuitBreaker.errorThresholdPercentage=20

This will open the circuit since our error rate 33% is higher than the 20% threshold. So you'll see that all the responses go straight to the fallback, even though we have healthy pong instances.

$ curl -i "http://ping/"

Let's configure retries so when ping requests to pong fail, ping will try sending the request to a different pong instance. That way, even though one instance is unhealthy, we will never get an error and the circuit will never be opened.

$ curl -i -XPOST "http://ping/env" -d ribbon.retryableStatusCodes=500

This tells the HTTP client to retry requests which received a 500 status code. It accepts a comma separated list of codes. If we keep sending requests, we'll see how now we don't get any error responses.

Conclusion

  • Hystrix wraps the HTTP client. The client may or may not be a LoadBalancer client, which means that Hystrix is not aware of the different instances for a given service.
  • When requests start to fail, if the error percentage is higher than the defined error threshold percentage for a given service, Hystrix will open the circuit.
  • A potential workaround to minimize this is to add the percentage of unstable instances to the current error threshold percentage. Be careful: this will make your service tolerate more errors.
  • A better approach would be to use retries so requests are retries until reach a healthy instance.
  • A good healthcheck is needed for services. If a service is failing, the healthcheck should reflect that, de-registering the service from Eureka.
  • When using kubernetes, use the healthcheck for the liveness probe to stop receiving traffic if the application is unhealthy.

About

Examples of resiliency between services using Spring

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published