The power of technology
19 December 2019
We are using incredibly powerful technology these days: you can stream high definition video on a cheap smartphone thanks to state-of-the-art wireless technology and state-of-the-art video compression. The first public radio broadcast was barely 100 years ago. When you are shopping for new clothes online, the connection with the webserver uses state-of-the-art encryption that is practically impossible to crack. About 65 years ago, during World War II, the British were able to obtain crucial intelligence by cracking the Enigma, which would take milliseconds on a modern computer. A little over 40 years ago, the Cray-1 was a supercomputer that had about 8 megabytes of memory, could do 160 million floating-point operations per second (FLOPS), weighed 5.5 tons, consumed over 100 kilowatts of power, and cost 7.9 million dollars in 1979. Just 15 years later, in 1994, anyone with a couple of thousand dollars could buy an Intel Pentium desktop with similar specifications. Today, in 2019, you can buy a Raspberry Pi Zero that does more operations per second, has 64 times more memory, uses about 1 watt of energy, and weighs less than 10 grams, for only 5 dollars. That's a mind-boggling amount of improvement in just 40 years. On the Wikipedia page about FLOPS the cost of FLOPS - between 1961 and 2017, corrected for inflation - have decreased by a factor of about 5.2 trillion. Because computing is so cheap, more people have mobile phones than toilets. And that article is over six years old when I wrote this. I could go on and on about how much improvements we've made in computer performance, but I think you get the point by now.
So technology is widely available and ridiculously cheap. As a result, we can do the most amazing things with computers. But for quite some time we've reached the point where our hardware doesn't actually get much faster anymore. Clock speeds have been relatively stable for the past decade or so, so CPUs have roughly the same number of instruction cycles per second. As our chips shrink more and more, the amount of cores in every CPU increases which allows our computers to do more things in parallel. While our computers may get better at multi-tasking, not all applications magically benefit from having more cores. If the problem is embarrassingly parallel it might be easy to make something faster using multiple cores, but developers need to be beware of a class of non-trivial problems called race conditions. At the same time, our processors get faster by using various complicated tricks to let the CPU evaluate more instructions per cycle. This has lead to a whole new class of security issues such as Meltdown and Spectre.
One group of incredible pieces of technology that pretty much every developer uses are databases. Relational databases that you can query using SQL are extremely sophisticated and high-performing pieces of technology. Typing in some SQL and getting back a response in a fraction of a second might seem ordinary but it is not trivial at all. A lot of work from very smart people went into the development of SQL databases. PostgreSQL might be the most advanced open-source database, proprietary databases such as Microsoft's SQL Server and Oracle are quite a bit more advanced (and expensive). Traditionally all these databases are designed around slow persistent storage such as spinning hard drives. Every time you make a change to your data and the database confirms it, the data is actually written to disk, to guarantee that you don't lose any data. Simply storing data is not the only responsibility of a database, it also has to be transactional: if you transfer money from one bank account to another, it guarantees that it either stores the changes to both accounts or none at all. There is no possibility that only one account has money added or deducted. Decades of research and development were spent on relational databases to make them highly performant and reliable. They are a great example of technology that makes very efficient use of the available hardware.
Another way to improve performance is to use multiple computers connected by a network, also known as distributed computing. If coordinating processes on a single CPU with multiple cores is difficult, distributed computing is a lot harder, as now you have a network to deal with. If a single computer is running a computation, and that computer suddenly fails, you need to restart the whole thing from scratch. Running a computation on a cluster of a thousand computers (each called a 'node') is a different story. If one of the nodes fails, you don't want to start over from scratch, you want the work it was doing to be taken over by a different node. One of the earlier frameworks to handle failing nodes when calculating 'embarrassingly parallel' problems was MapReduce published by Google in 2004. This model was made popular in the rest of the world after Hadoop was released as an open-source MapReduce framework in 2006. Ever since, very smart people at big tech companies dealing with huge amounts of data have created boatloads of great open-source tools to easily take advantage of distributed computing, such as Apache Spark and Kubernetes. Organizations all over the world are now taking advantage of these tools to handle all of their Big Data.
But wait, what exactly is Big Data? There is no official definition, but I like the idea that Big Data is either 1) too much data to fit on a single computer or 2) too much data to be processed fast enough by a single computer. The first definition is easy to quantify. At the moment of writing, you can get relatively cheap hard disk drives with 12 TB of capacity. Let's assume with some redundancy you could easily put 72 TB in a single server. So Big Data by the first definition would be at least a hundred terabytes of data. The second definition is a bit harder to quantify, as "fast enough" is very subjective. Let's assume "fast enough" is being able to process in 1 hour all the data generated in 1 day. Let's take eight SSDs and assume they can read steadily at 500MB/s each. Let's also assume that the computation is limited only by how fast we can read the data. At that rate, you can read roughly 14 TB of data in one hour. So if you produce more than 14 TB a day, you probably have Big Data.
Why does all of this talk about Big Data matter? Because this is an article about using the power of technology. If you have hundreds of terabytes of data or more, or you need to analyze your several terabytes of data at terabytes per second, using something like Spark can be quite useful in coordinating all that processing power to solve your computations. If you have anything less than tens of terabytes, you might want to check your options first. When I was working with not-quite-so-Big Data in 2015, it occurred to me that Spark wasn't really that necessary for the types of computation we were doing. Or even very faster. Starting a bunch of virtual machines just to perform some simple calculations and aggregations seemed like a waste of resources. Especially when we also had BigQuery available to do the same kind of calculations with much less overhead. More people on the internet seemed to have similar feelings. One post, in particular, caught my eye about COST or the Configuration that Outperforms a Single Thread, and the follow-up post Bigger data; same laptop. In these posts, Frank McSherry criticizes papers that benchmark distributed graph processing systems using a small amount of data. The graphs being processed easily fit on a laptop and the author shows that in fact, the laptop using 1 core outperforms the distributed systems with 128 cores with ease. McSherry thus challenges writers of papers to show configurations where a distributed system will outperform a single thread. Luckily Frank is not just a smart-ass being smug about being so much faster with so little cores. He wrote Timely Dataflow and Differential Dataflow on top of that to help us mortals create very fast computations on a laptop, too. Another great example of properly using the power of the hardware we have available today.
One final example that I want to bring up about getting a lot of performance out of hardware comes from Netflix. As you might know, Netflix runs mostly on Amazon Web Services (AWS), and as you might also know, AWS is quite expensive, especially if you want to send data from your server to the user. This is also known as egress, and cloud providers charge anywhere between $0.05 and $0.15 per transferred gigabyte. Because Netflix sends a lot of data from their servers to the user, they need to be very smart about where they serve the data from, to prevent paying AWS more bandwidth charges than they receive from subscription fees. To do this, Netflix places Open Connect devices at internet service providers (ISP) all over the world. These machines cache the most popular content so they can improve the quality of streaming by being closer to their customers while paying much less to AWS. To save as much money as possible, they need to get as much performance out of these servers as possible. In streaming land, that means you try to saturate all of the available bandwidth that a single machine is connected to. For Open Connect, that means serving 100 Gbps from every device which is a staggering amount of throughput. Actually, the article states that they achieved about 90 Gbps of throughput with encryption (TLS), but they must have figured out how to squeeze the final 10 Gbps out of them by now.
For every example of using technology in a great way, there must be at least a thousand examples of using technology in a horrible way. Most websites these days are bloated with unnecessary content and agonizingly slow. It seems that no matter how fast our hardware gets, people will find a way to make their software slow and unresponsive. According to some research, the limit at which most people feel like something is reactive is about 70-100 milliseconds. In other words: when you press a button, it should react in less than 100ms to feel instantaneous. A lot of websites these days are obese: they load several megabytes in assets and take tens of seconds to load. Fine, you might say, but we all have gigabit internet connections, multi-core processors, and heaps of memory. But consider that most people don't actually have a load of bandwidth on their internet connection, and that most people can't afford the latest iPhone or Macbook Pro, and browsing the web starts looking like something for the rich. What's the point of making technology faster and cheaper if we are making our software slower and bulkier at the same rate?
Technology is awesome and I love that building software is my job. It's amazing to see how far we've come in just a couple of decades, and I'm curious to see where we will go next. Getting the most out of technology is difficult: it requires thinking hard, being pragmatic, having a good understanding of what you are actually doing, and being critical about hypes and marketing. Luckily there are a lot of great people out there that do amazing things with technology.