OSv Case Study: Memcached
For the later Seastar Base Memcached go here
Memcached is a popular in-memory key-value store. It is used by many high-profile Web-sites to cache results of database queries and prepared page sections, to significantly boost these sites' performance.
An unmodified memcached on OSv was able to handle about 20% more requests per second than the same memcached version on Linux. A modified memcached, designed to use OSv-specific network APIs, had nearly four times the throughput.
Introduction to memcached
Memcached is a good case study for OSv for several reasons:
- Memcached is a real and popular application, on both physical servers and the cloud. It is not a toy application or a benchmark.
- Memcached makes high demands on the operating system. It needs to handle a huge number of TCP or UDP requests, doing very little computation on each. It needs to manage a lot of memory filled with small objects, and it needs to have as much as possible free memory for the actual data. We believe that OSv is a better operating system for applications on the cloud, so we expect that a memcached VM ("virtual appliance") built on OSv can significantly outperform the traditional Linux-based VM.
- The peak throughput (requests-per-second) of a memcached server is normally limited only by the efficiency of the software (OS and memcached). It is not limited by unrelated factors like disk speed.
As we show below, the performance of an OSv-based memcached VM is significantly better than what is achievable on a Linux-based VM. The standard memcached performed better on OSv than on Linux (answering 20% more requests per second). But OSv can do even better, by not being bound by the 30 year old Unix networking APIs available on Linux; As we shall show below, we can get almost 4 times the performance of the Linux+Memcached baseline by writing a memcached server using the lower overhead APIs available in OSv.
Memcached's protocol supports both UDP and TCP. Each has different advantages and disadvantages; TCP is slower but more reliable, UDP is faster but only works for small (packet-sized) requests and responses. When the cached values are small, and request/response loss is acceptable (as is true when we remember that memcached is just a cache), UDP has the potential to provide better performance, and indeed companies like Facebook report using it.
This is why we decided to initially focus on performance of memcached on UDP. We also plan to look at TCP (which OSv also supports, of course), but later.
For a benchmark, we chose memaslap from libmemcached. This fairly known benchmark (available, for example, in a Fedora package), which repeatedly sends a configurable number of memcached requests, 10% of which are "SET" and 90% are "GET", and measures the achieved throughput (requests per second). We run memaslap from a different physical host to the host running the VM with memcached, and both hosts are connected via a direct 40 Gbps link. We give memcached a big enough memory for the memaslap benchmark to have zero cache misses in the test duration (e.g., we found 5 GB of cache to be enough for measurements of 30 seconds).
All our benchmarks below are using a 1-CPU VM. Ideally, memcached's performance should scale up with the number of CPUs, but to actually achieve this one needs a multi-queue network card, which aren't often available on virtual machines. So initially we focus our effort on bringing best memcached performance on a single-CPU VM.
The host was a single-socket quad-core Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz. It ran the KVM hypervisor and vhost-net, KVM's paravirtual network driver. Though the guest received a single CPU, the host actually used additional CPUs on its behalf, for processing the paravirtual I/O requests and for processing physical network interrupts; For each setup separately, we chose how to pin those various threads and interrupts to physical CPUs in the way that achieved the best performance.
For a Linux-based memcached VM we took an up-to-date Linux distribution, Fedora 20, and its included Linux kernel and memcached server. With this setup, with firewall disabled we achieved 104,394 requests per second.
When we ran the same version of memcached on OSv, we measured 20% better performance: 127,275.
A Better Memcached for OSv
A 20% performance improvement is nice, but we set out to demonstrate that it is possible to build a memcached virtual appliance using OSv with performance at least double that of the baseline, still using only a single CPU.
Some performance can be gained by rewriting memcached more efficiently. We wrote a memcached clone, which supports the subset of the memcached protocol needed for the benchmark (namely, UDP, and "GET" and "SET" operations). Only by this change we've already succeeded to gain about 55% on top of KVM HV.
That's an impressive improvement over the baseline, but not the end of the story. The socket APIs supported by Linux are great, and have served us well for 30 years, but have numerous overheads that prevent memcached from achieving truely mind-blowing performance. For example, a UDP socket can be used concurrently by multiple threads, so the implementation needs to use a lock to protect it - even if we know that in our application we'll never access the socket from more than one thread. Also, the various layers of the TCP/IP and socket stack are important for a full-featured operating system running a variety of servers, but it only slows down a VM whose single purpose is to provide one single UDP service - memcached.
So we wrote a new OSv-specific memcached clone which bypasses the socket APIs and most of the TCP/IP stack. This VM achieved a whopping 406,750 requests per second - 3.9 the baseline performance.
Our new osv-memcached is still a limited prototype: It currently only supports UDP, only supports "GET" and "SET" commands as required by the benchmark, and is limited to MTU-sized requests and responses (no IP fragmentation). However, we expect that these missing features can be added without sacrificing any of the performance that we've achieved.
Automatic memory management
Unmodified memcached server uses a static limitation for maximum memory consumption: -m parameter. This way one can't optimally utilize the memory available in the Guest in every possible situation since it will either use less than available (when Guest has a lot of free memory) or may cause Guest to become sluggish since it runs out of memory and a lot of memory is consumed by the memcached.
osv-memcached uses another approach: it utilizes all memory available in the Guest and may shrink the cache if the Guest runs out of memory (for any reason).
The shrinking will be triggered by OSv's
memory::shrinker framework, which invokes shrinking of all registered agents (osv-memcached is one of them) when the amount of free memory in the system falls below some threshold and requests them to free some amount of memory (the needed memory amount is a part of a framework request towards the agent).
This way the osv-memcached will use the maximum amount of memory available in the Guest at the specific moment and is able to release the required amount of its memory back to the OSv when OSv needs it (e.g. for some other cache, ARC of ZFS).
- CPU: single socket, 4 cores Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
- RAM: 32 GB
- NIC: Mellanox Technologies MT27500 Family (ConnectX-3)
Benchmark command lines
memaslap -s <server IP> -T 3 --concurrency 120 -t 30s --udp
- Unmodified memcached:
memcached -u root -t 1 -m 5048
- Modified memcached:
- Unmodified memcached:
Google Cloud Engine (GCE)
We wanted to see how osv-memcached performs in other virtualization environments. We started with GCE.
We used the following instances for testing:
- For running
memaslap: n1-standard-8: 30GB RAM, 8vCPUs. We installed a Debian 7.4 Linux on it.
- For running memcached servers: n1-standard-1: 3.8GB RAM, 1vCPU
- Fedora 20 with unmodified memcached: 37696
- OSv with unmodified memcached: 46891 (+20%)
- OSv with osv-memcached: 78278 (+105%)
- Naturally we couldn't control the affinity of the HV threads in GCE thus we couldn't get the same level of performance as with KVM where we completely controlled the affinities of the above and prevented (as much as possible) the physical CPU to be a bottle-neck in this CPU-hungry benchmark.
- Since the networking latency in GCE are (naturally) higher than in our back-to-back setup we used greater value for
memslap's --concurency parameter. We've found that 360 gave the best numbers for 3 threads. Thus the final command line that was used was
memaslap -s <server IP> -T 3 --concurrency 360 -t 30s --udp
- To get the best networking performance we used internal IPs for testing.