NetflixOSS FAQ

Chris Fregly edited this page Jan 21, 2014 · 26 revisions

Archaius

How are properties scoped in Archaius? There is a concept in Archaius called DeploymentContext. You can implement your own context which defines custom scopes - chaining them together by using @next at the end of each properties file. By default, however, the following scopes are supported:

  • archaius.deployment.environment
  • archaius.deployment.region
  • archaius.deployment.datacenter
  • archaius.deployment.applicationId
  • archaius.deployment.serverId
  • archaius.deployment.stack

How do I prevent conflicting property names across all of my dependent jars, etc? This is a difficult problem. The best way to avoid this is to namespace your property names by logical dependency. ie. usermgmt.propertyA, streaming.propertyA, personalization.propertyA, etc.

Servo Metrics

What is the best way to collect JMX, JVM, and application level metrics? Use a CompositeMetricPoller with JmxMetricPoller, JvmMetricPoller, and MonitorRegistryMetricPoller (application-level). Use JvmMetricPoller for well-known JVM stats such as these. Use JmxMetricPoller for other JMX-published metrics such as Tomcat stats, etc. The MonitorRegistryMetricPoller is for custom application-level stats related to your business metrics. Note: You'll want to apply filters to all Pollers so as to prevent metric overload.

Should I use CloudWatch, a self-hosted tool such as Graphite, or a hosted service such as HostedGraphite? CloudWatch is a slick tool - and great for setting up autoscale triggers. That being said, my most-expensive AWS bill to date has been CloudWatch. The Netflix OSS components generate a LOT of metrics by default. As mentioned in the previous bullet point, you definitely want to apply filters and start with the bare-minimum of metrics initially. Another best practice - and one that is used by Netflix itself - is to use a self-hosted tool for everything - and push only the metrics you need for autoscaling to CloudWatch. Another option to consider is HostedGraphite.

How do I implement a timer metric? StatsTimer is the class from the Netflix Servo package. Here is an example from Flux Capacitor. Note: If you use Hystrix, you will automatically get timers for all services that you wrap with the Hystrix framework. See the Hystrix section below for more details on how to export Hystrix metrics through Servo. You definitely want to get this wired up.

Hystrix Circuit Breaker

How do I get my real-time Hystrix metrics into my historical metrics store for later analysis? You can do this by linking Hystrix and Servo using the Hystrix Servo Metrics Publisher. Here is an example from Flux Capacitor. You'll need to import the hystrix-servo-metrics-publisher artifact from the Hystrix module.

What is the best way to tune Hystrix with Ribbon? The advice from Ben Christensen (Hystrix project owner) is to make sure the aggregate of all Ribbon timeouts and retries is less than the timeout of Hystrix. You can safely do this by configuring only the Hystrix timeout and deriving the Ribbon timeouts at runtime.

What are some best practices for debugging timeouts and fallbacks with Hystrix? There is a special debug property that will dump a lot of info captured by Hystrix about the underlying causes of a timeout or fallback. I'm waiting for this property from the folks at Netflix.

Why does the first execution of a Hystrix Command usually fail? It appears that the first execution of a Hystrix Command usually fails from a Timeout. It appears that the initial call takes longer than the default Hystrix timeout of 1 second. This is likely due to the initial underlying client-side connection pool setup (using Ribbon, for example) - as well as any Jersey resource setup that is needed on the servier-side. This discussion is relevant.

How do I throw an exception (ie. BadInputValidationException) without counting towards a Hystrix failure? You can throw HystrixBadRequestException for things like bad arguments. This won't count as a Hystrix errors and therefore will not trigger a fallback or trip a circuit.

Ribbon HttpClient

Can I use Ribbon with Thrift, SOAP, etc? _It is possible to use SOAP with Ribbon's RestClient, which internally uses Jersey Client with Apache HttpClient. You have to create the HttpClientRequest with a plain String entity which is the SOAP message, and add appropriate content-type header. Then you can call RestClient.execute(request) to get a response, where you probably need to manually deserialize the response entity.

Ribbon will likely not work with Thrift, which is a binary protocol.

The roadmap for Ribbon includes multi-protocol support including http, WebSockets, SSE, custom tcp, thrift, soap, etc. In addition, multiple I/O models including async nio via Netty/HttpClient 4.x._

What has to be implemented in Ribbon if you by-pass Eureka and use a different Service Discovery mechanism such as Curator's ZK-based implementation? A little bit of glue code, but not much.

Eureka Service Discovery Load Balancer

Do I still need Eureka if I'm using VPC and a middletier ELB load balancer? Possibly. Eureka is a used by a few Netflix OSS components, but there are workarounds for all. Specifically, Priam (Cassandra side car), Asgard (deployment and resource management), and Turbine (real-time Hystrix metrics collection). I recently submitted a pull request that decouples Turbine from Eureka. If you aren't using Priam or Asgard, you may give the VPC-based middletier ELB a try versus introducing Eureka. Just point Ribbon (or any HttpClient of your choice) at the ELB. See the Astyanax section below to see a Eureka-less configuration for Astyanax/Cassandra on AWS.

When I rev my middletier services, how do I update Eureka to remove the old services and start using the new services? The assumption is that you're services are stateless - and perhaps using Redis or equivalent for caching session state. If this is the case, Asgard is traditionally the mechanism to perform a red-black push. The red cluster (new build) is brought up alongside the black cluster (old build). Once you've verified that the canary is behaving correctly (see Deployment section), Asgard will activate the red cluster (new build) in Eureka and deactivate the black cluster (old build) in Eureka. If you're using a VPC middletier ELB instead of Eureka, you can use DNS (ie. AWS's Route53 and/or Netflix's Denominator) to flip this switch.

Why is Eureka so slow at startup - particularly in the single-eureka-instance scenario? Eureka is getting into a race condition at startup where it's trying to register itself before it's ready ta accept registration. To work around this, set eureka.registration.enabled=false as well as eureka.waitTimeInMsWhenSyncEmpty=0 in the eureka-server.properties file within the eureka-server project. You'll need to rebuild and redeploy the eureka-server webapp.

Why do instances not de-register from Eureka server when they terminate - particularly in a single-eureka-instance scenario? This is a feature of Eureka called self-preservation mode which causes Eureka to halt all de-registrations if it detects a drop below an 85% threshold in 15 min (by default). More information is here. You can disable this feature by setting the following property: eureka.enableSelfPreservation=false

How long will it take for instances to drop out of service when they terminate? Service instances ping eureka server (through eureka client, but that's an implementation detail) every 30 seconds with a heartbeat. Eureka server marks the instance DOWN after 3 missed heartbeats. This means that instances will drop out of service after 90-120 seconds. Any client code that uses eureka client to discover instances registered with eureka server should definitely retry in the case that a dead service is returned from eureka. Note: This is the default behavior of Ribbon when configured to use eureka for load balancing.

Why are my middletier services not reachable by my edge services when using Eureka in an AWS environment? This is most-likely a security group issue. Keep in mind that Eureka serves up hostnames directly to the middletier instance and does not go through an ELB - unless explicitly configured to do so. Therefore, the security group of your middletier services must allow the appropriate incoming traffic.

How do I change the default Eureka port? This is not recommended as this is not an easy process. Here is a link on how to do it.

Blitz4j Asynchronous Logging

Can I use LogBack instead of Blitz4j? Even the Blitz4j project owner encourages new projects to use LogBack. Warning: A few friends have tried to go this route and log4j keeps creeping back in due to transitive dependencies scattered throughout the various Netflix OSS components. My recommendation is to stick with Blitz4j for now unless you have the time to debug the creeping dependencies.

Priam Cassandra Sidecar

How do you handle restarting processes (tomcat) if it goes down? Daemon tools or upstart (built into ubuntu). This is helpful in the case of JVM-based sidecars such as Priam.

Does Priam use Eureka? Priam doesn't use Eureka in the public version. Still trying to understand how this is possible as Eureka seems to be a key component for Priam's usefulness. More to come.

Astyanax

How does Astyanax discover Cassandra nodes in an ephemeral environment such as AWS? Using Eureka (and Priam), of course. TODO: Figure this out. The astyanax NetflixDiscoveryHostSupplier or FilterHostSupplier appears to be the documented mechanism, but I can't seem to find these in the Astyanax codebase.

Does Astyanax use Eureka? No, the latest Asytanax does not use Eureka. At one point it did. And the Astyanax documentation even mentions Eureka as a node discovery mechanism. Again, not sure what's going on here. More to come as I talk to more people.

How can I use Astyanax without Eureka and Priam? After much experimentation, my friends over at Black Pearl Systems (many former Netflix'ers) have settled on the following configuration for Astyanax->Cassandra in an AWS environment without Eureka and Priam. Here's roughly what they've done: Within a VPC, setup middletier ELBs in front of each ASG (1 ASG per AZ, aka. "rack", multi-region setup) with the following hostname convention: [ringname].discovery.[environment].[region].[az/rack].yourdomain.com.

The ELBs listen and proxy thrift traffic (port 9162) back to the EC2/Cassandra instances in the ASG. Astyanax clients use this well-known ELB hostname as their "seed host" which ultimately does a ring describe and passes the dynamic EC2 nodes back to the client. Since the well-known ELB hostname is not actually in the ring, it will not be returned to the client. Therefore, clients will never use the well-known ELB hostname except for initial seed purposes.

Each Cassandra node requires a custom, templated cassandra.yaml and cassandra-rackdc.properties to swap in the ring name, rack, DC, interface, etc at startup time before the Cassandra/Java process starts.

Each ASG runs in a single AZ. This helps control the number of instances versus relying on AWS's automatic ASG balancing across zones. There are 3 ASG's - each running in their own AZ - with a replication factor of 3.

By default, Astyanax will contact the seed host for an updated ring config across the entire cluster every 30s. This can be overriden by specifying discoveryIntervalInSeconds when building the AstyanaxContext.

More info on the AstyanaxContenxt builder here.

Another option is to use use Route53/healthchecks to monitor the EC2 instances, but not much experimentation has been done along these lines.

Bam! No Eureka. No Priam. No Problem!

Simian Army/Chaos Monkey

Can you use the monkeys with anything except Amazon? ie. Vagrant, etc. At this time, no. A custom plugin would need to be developed similar to the vSphere plugin.

Amazon Web Services

Should we use AWS VPC? Absolutely yes. The sooner you start using VPC, the smoother things will be. Retrofitting VPC later is difficult in terms of migrating Security Groups, etc.

Deployment

What is a canary? A single instance - or set of instances - that are deployed into production alongside the existing live cluster. The idea is to analyze the canary instance(s) against the live cluster to determine if there are any significant deviations in either system or application-level metrics.

Why does Flux use Elastic Beanstalk? I love Elastic Beanstalk. Out of the box, it provides an Apache 2, Tomcat 7 environment running in a VPC. This lets me focus exclusively on my webapp. You can also customize the default AMI and provide custom CloudFormation scripts to be used, although I haven't yet needed to do this. Beanstalk even provides an eclipse plugin which allows you to specify a remote Beanstalk Server to deploy to - similar to a local tomcat server running locally. They both show up in the Servers tab and allow incremental deployments, etc.

Does Elastic Beanstalk support rolling pushes? As of November 2013, yes it does. More info here.

Does Elastic Beanstalk support Red-Black pushes? Indeed it does. Just setup a Red and a Black environment for a given application. When you're ready to switch environments, you simply use the Swap Environment Beanstalk functionality. This flips the CNAME to point to the new environment's ELB. If you need to rollback, simply Swap Environment back to the old environment.

Does Elastic Beanstalk support a canary-in-a-coal-mine test push for a new build to run alongside an existing build? Need to think this one through a bit. I'll find a way!

Does Elastic Beanstalk allow custom AMIs for Cassandra, ElasticSearch (not the AWS CloudSearch), etc? Yes. See this document for more details.

Asgard

Does Asgard support multiple regions? Yes. Asgard supports the following: us-east-1,us-west-1,us-west-2,eu-west-1,ap-northeast-1,ap-southeast-1,ap-southeast-2,sa-east-1. More to come as new regions come online.

JVM

Should we run multiple VM's on a single host? The folk lore is that any VM over 12GB of heap will likely experience longer-than-usual GC pauses. I've seen larger heaps that run fine once the object allocation patterns are well-established and the GC params are tuned properly.

How do we tune the GC params? There's a great tool called gcviz that is heavily used at Netflix for GC visualization and tuning. I highly recommend it. This blog post discusses some suitable defaults. However, there is no substitute for testing a canary under load.