In [1]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<IPython.core.display.Javascript object>

<h1 class="tocheading">Lambda Architecture, Part 1</h1>
<div id="toc"></div>

<i>N.B. blockquotes (and nearly all images) in this notebook are excerpts from [Big Data](https://www.manning.com/books/big-data) by Nathan Marz.</i>

A new paradigm for Big Data
===============================

**Standard: Evaluate when to apply Lambda Architecture**

By the end of this lesson, you should be able to:

- Explain components of Lambda architecture
- Name (8) desired properties of big data system
- Discuss advantages & disadvantages of Lambda Architecture vs traditional databases

Scaling with a traditional database
-------------------------------------------------

Scenario: you are the data engineer at a brand new web startup. You have designed your OLAP RDBMS according to all the best practices. You've anticipated questions like: "How many daily active users do we have?" and "How many pages do users visit?" which can be answered by tables like:

| Column name 	| Type
|:------------:	|:------------:
| id          	| integer
| user_id      	| integer
| url          	| varchar(255)
| pageviews   	| bigint

The site is a hit and before long you start seeing:
<br /><br />
<font size=6 color=red><b><center>Timeout error on inserting to the database.</center></b></font>
<br /><br />

You remember how random writes can be time consuming, so you insert a queue to cache updates and apply them in batches:

### Scaling with a queue

![Batching updates with queue and worker](https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig02.jpg)

This helps for a while, but only temporarily. Eventually even with batch updates, the database can only handle so many writes. Then you recall an article on [Instagram's engineering blog about sharding](http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram). It's a beast to set up, but eventually you've got it:

### Scaling by sharding the database
![](http://i.imgur.com/1LoFj7B.png)

### Fault-tolerance issues begin

Your website continues to grow exponentially and before you know it, your database is distributed over hundreds of nodes which are now failing at a rate of one every few weeks. (Think about this, if the average machine will fail once in 10 years, how long before one of 200 machines fails?) Fortunately [Postgres 9 has built-in replication](http://www.postgresql.org/docs/current/static/high-availability.html), but this too is non-trivial to set up.

### Corruption issues

It's now been a year since your website has gone live and you haven't had a good night's sleep in all that time, so it's no surprise when <blockquote><p class="noind">you accidentally deploy a bug to production that increments the number of pageviews
         by two, instead of by one, for every URL. You don’t notice until 24 hours later, but by then the damage is done. Your weekly
         backups don’t help because there’s no way of knowing which data got corrupted. After all this work trying to make your system
         scalable and tolerant of machine failures, your system has no resilience to a human making a mistake. And if there’s one guarantee
         in software, it’s that bugs inevitably make it to production, no matter how hard you try to prevent it.
      </p></blockquote>

### What went wrong?

### How will Big Data techniques help?

First principles
-------------------------------------------------
We can think of any data system (big or small) as something that ideally permits the following:
<blockquote>
$$\text{query} = function(\text{all data})$$
</blockquote>
That is, it should allow you to ask any question of all the data, that question essentially being a [<i>lambda</i> function](https://en.wikipedia.org/wiki/Anonymous_function) on the data.

Desired properties of a Big Data system
-------------------------------------------------
Given what you've learned above and on your experience as a data scientist, what are some of the properties you would like to see in a big data system? Try to name eight:

<details><summary><i>Write your answer above before opening this.</i></summary>

<h3>Robustness and fault tolerance</h3>
<blockquote><p class="noind">Building systems that “do the right thing” is difficult in the face of the challenges of distributed systems. Systems need
         to behave correctly despite machines going down randomly, the complex semantics of consistency in distributed databases, duplicated
         data, concurrency, and more. These challenges make it difficult even to reason about what a system is doing. Part of making a Big Data system robust is avoiding these complexities so that you can easily reason
         about the system.
      </p></blockquote>

<h3>Low latency reads and updates</h3>
<blockquote><p class="noind">The vast majority of applications require reads to be satisfied with very low latency, typically between a few milliseconds
         to a few hundred milliseconds. On the other hand, the update latency requirements vary a great deal between applications.
         Some applications require updates to propagate immediately, but in other applications a latency of a few hours is fine. Regardless,
         you need to be able to achieve low latency updates <i class="calibre6">when you need them</i> in your Big Data systems. More importantly, you need to be able to achieve low latency reads and updates without compromising
         the robustness of the system.</p></blockquote>
<h3>Scalability</h3>
<blockquote><p class="noind">Scalability is the ability to maintain performance in the face of increasing data or load by adding resources to the system.
         The Lambda Architecture is horizontally scalable across all layers of the system stack: scaling is accomplished by adding
         more machines.
      </p></blockquote>
<h3>Generalization</h3>
<blockquote><p class="noind">A general system can support a wide range of applications. Indeed, this book wouldn’t be very useful if it didn’t generalize
         to a wide range of applications! Because the Lambda Architecture is based on functions of all data, it generalizes to all
         applications, whether financial management systems, social media analytics, scientific applications, social networking, or
         anything else.
      </p></blockquote>
<h3>Extensibility</h3>
<blockquote><p class="noind">You don’t want to have to reinvent the wheel each time you add a related feature or make a change to how your system works.
         Extensible systems allow functionality to be added with a minimal development cost.
      </p></blockquote>
<h3>Ad hoc queries</h3>
<blockquote><p class="noind">Being able to do ad hoc queries on your data is extremely important. Nearly every large dataset has unanticipated value within
         it. Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications. Ultimately, you can’t discover interesting things to do
         with your data unless you can ask arbitrary questions of it.</p></blockquote>
<h3>Minimal maintenance</h3>
<blockquote><p class="noind">Maintenance is a tax on developers. Maintenance is the work required to keep a system running smoothly. This includes anticipating
         when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production.
      </p>
      
      <p class="noind">An important part of minimizing maintenance is choosing components that have as little <i class="calibre6">implementation complexity</i> as possible. You want to rely on components that have simple mechanisms underlying them. In particular, distributed databases
         tend to have very complicated internals. The more complex a system, the more likely something will go wrong, and the more
         you need to understand about the system to debug and tune it.
      </p></blockquote>
<h3>Debuggability</h3>
<blockquote><p class="noind">A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to
         trace, for each value in the system, exactly what caused it to have that value.
      </p>
      
      <p class="noind">“Debuggability” is accomplished in the Lambda Architecture through the functional nature of the batch layer and by preferring to use recomputation
         algorithms when possible.
      </p></blockquote>
</details>

Before continuing: which of these were not accounted for in your answer above? Rewrite those in your own words.

The problems with fully incremental architectures
-------------------------------------------------
Returning to our pagecount example: suppose we have our application which continuously updates the database as such:

<blockquote><p class="noind">At the highest level, traditional architectures look like <a href="#ch01fig03" class="calibre4">figure 1.3</a>. What characterizes these architectures is the use of read/write databases and maintaining the state in those databases incrementally
         as new data is seen. For example, an incremental approach to counting pageviews would be to process a new pageview by adding
         one to the counter for its URL. This characterization of architectures is a <a id="iddle1137" class="calibre4"></a><a id="iddle1171" class="calibre4"></a><a id="iddle1186" class="calibre4"></a><a id="iddle1325" class="calibre4"></a><a id="iddle1350" class="calibre4"></a>lot more fundamental than just relational versus non-relational—in fact, the vast majority of both relational and non-relational
         database deployments are done as fully incremental architectures. This has been true for many decades.
      </p>


      
      
      
      <p><b><i>Figure 1.3. <a id="ch01fig03" class="calibre4"></a>Fully incremental architecture
      </b></i></p>
      
      
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig03.jpg" alt="" class="calibre2"/></p>
      
       <p class="noind">It’s worth emphasizing that fully incremental architectures are so widespread that many people don’t realize it’s possible
         to avoid their problems with a different architecture. These are great examples of <i class="calibre6">familiar complexity</i>—complexity that’s so ingrained, you don’t even think to find a way to avoid it.
      </p>
      
      <p class="noind">The problems with fully incremental architectures are significant. We’ll begin our exploration of this topic by looking at
         the general complexities brought on by any fully incremental architecture. Then we’ll look at two contrasting solutions for
         the same problem: one using the best possible fully incremental solution, and one using a Lambda Architecture. You’ll see
         that the fully incremental version is significantly worse in every respect.
      </p></blockquote>

### Extreme complexity of achieving eventual consistency

<blockquote><p class="noind">In order for a highly available system to return to consistency once a network partition ends (known as <i class="calibre6">eventual consistency</i>), a lot of help is required from your application. Take, for example, the basic use case of maintaining a count in a database.
         The obvious way to go about this is to store a number in the database and increment that number whenever an event is received
         that requires the count to go up. You may be surprised that if you were to take this approach, you’d suffer massive data loss
         during network partitions.
      </p>
      
      <p class="noind">The reason for this is due to the way distributed databases achieve high availability by keeping multiple replicas of all
         information stored. When you keep many copies of the same information, that information is still available even if a machine
         goes down or the network gets partitioned, as shown in <a href="#ch01fig04" class="calibre4">figure 1.4</a>. During a network partition, a system that chooses to be highly available has clients update whatever replicas are reachable
         to them. This causes replicas to diverge and receive different sets of updates. Only when the partition goes away can the
         replicas be merged together into a common value.
      </p>
      
      
      
      <p><b><i>Figure 1.4. <a id="ch01fig04" class="calibre4"></a>Using replication to increase availability
      </i></b></p>
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig04_alt.jpg" alt="" class="calibre2"/></p>
      
      
      <p class="noind">Suppose you have two replicas with a count of 10 when a network partition begins. Suppose the first replica gets two increments
         and the second gets one increment. When it comes time to merge these replicas together, with values of 12 and 11, what should
         the merged value be? Although the correct answer is 13, there’s no way to know just by looking at the numbers 12 and 11. They
         could have diverged at 11 (in which case the answer would be 12), or they could have diverged at 0 (in which case the answer
         would be 23).
      </p>
      
      <p class="noind"><a id="iddle1135" class="calibre4"></a><a id="iddle1304" class="calibre4"></a><a id="iddle1323" class="calibre4"></a>To do highly available counting correctly, it’s not enough to just store a count. You need a data structure that’s amenable
         to merging when values diverge, and you need to implement the code that will repair values once partitions end. This is an
         amazing amount of complexity you have to deal with just to maintain a simple count.
      </p></blockquote>


### Lack of human-fault tolerance

<blockquote><p class="noind">An incremental system is constantly modifying the state it keeps in the database, which means a mistake can also modify the
         state in the database. Because mistakes are inevitable, the database in a fully incremental architecture is guaranteed to
         be corrupted.
      </p>
      
      <p class="noind">It’s important to note that this is one of the few complexities of fully incremental architectures that can be resolved without
         a complete rethinking of the architecture. Consider the two architectures shown in <a href="#ch01fig05" class="calibre4">figure 1.5</a>: a synchronous architecture, where the application makes updates directly to the database, and an asynchronous architecture,
         where events go to a queue before updating the database in the background. In both cases, every event is permanently logged
         to an events datastore. By keeping every event, if a human mistake causes database corruption, you can go back <a id="iddle1136" class="calibre4"></a><a id="iddle1324" class="calibre4"></a><a id="iddle1411" class="calibre4"></a>to the events store and reconstruct the proper state for the database. Because the events store is immutable and constantly
         growing, redundant checks, like permissions, can be put in to make it highly unlikely for a mistake to trample over the events
         store. This technique is also core to the Lambda Architecture</p>
      
      
      
      <p><b><i>Figure 1.5. <a id="ch01fig05" class="calibre4"></a>Adding logging to fully incremental architectures
      </i></b></p>
      
      
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig05.jpg" alt="" class="calibre2"/></p></blockquote>

Lambda Architecture
-------------------------------------------------
<blockquote><p class="noind">The main idea of the Lambda Architecture is to build Big Data systems as a series of layers, as shown in <a href="#ch01fig06" class="calibre4">figure 1.6</a>. Each layer satisfies a subset of the properties and builds upon the functionality provided by the layers beneath it. You’ll
         spend the whole book learning how to design, implement, and deploy each layer, but the high-level ideas of how the whole system
         fits together are fairly easy to understand.
      </p>
      
      
      
      <p><b><i>Figure 1.6. <a id="ch01fig06" class="calibre4"></a>Lambda Architecture
      </i></b></p>
      
      
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig06.jpg" alt="" class="calibre2"/></p>
      
      
      
      <p class="noind">Everything starts from the $\text{query} = function(\text{all data})$ equation. Ideally, you could run the functions on the fly to get the results. Unfortunately, even if this were possible,
         it would take a huge amount of resources to do and would be unreasonably expensive. Imagine having to read a petabyte dataset
         every time you wanted to answer the query of someone’s current location.
      </p>
      
      <p class="noind">The most obvious alternative approach is to precompute the query function. Let’s call the precomputed query function the <i class="calibre6">batch view</i>. Instead of computing the query on the fly, you read the results from the precomputed view. The precomputed view is indexed
         so that it can be accessed with random reads. This system looks like this:
      </p>
      
      $$
\begin{align*}
\text{batch view} &= function(\text{all data})\\
\text{query} &= function(\text{batch view})
\end{align*}
$$
      
      <p class="noind">In this system, you run a function on all the data to get the batch view. Then, when you want to know the value for a query,
         you run a function on that batch view. The batch view makes it possible to get the values you need from it very quickly, without
         having to scan everything in it.
      </p>
      
      <p class="noind">Because this discussion is somewhat abstract, let’s ground it with an example. Suppose you’re building a web analytics application
         (again), and you want to query the number of pageviews for a URL on any range of days. If you were computing the query as
         a function of all the data, you’d scan the dataset for pageviews for that URL within that time range, and return the count
         of those results.
      </p>
      
      <p class="noind">The batch view approach instead runs a function on all the pageviews to precompute an index from a key of [url, day] to the
         count of the number of pageviews for that URL for that day. Then, to resolve the query, you retrieve all values from that
         view for all days within that time range, and sum up the counts to get the result. This approach is shown in <a href="#ch01fig07" class="calibre4">figure 1.7</a>.
      </p>
      
      
      
      <p><b><i>Figure 1.7. <a id="ch01fig07" class="calibre4"></a>Architecture of the batch layer
      </i></b></p>
      
      
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig07.jpg" alt="" class="calibre2"/></p>
      
      
      
      <p class="noind">It should be clear that there’s something missing from this approach, as described so far. Creating the batch view is clearly
         going to be a high-latency operation, because it’s running a function on all the data you have. By the time it finishes, a
         lot of new data will have collected that’s not represented in the batch views, and the queries will be out of date by many
         hours. But let’s ignore this issue for the moment, because we’ll <a id="iddle1076" class="calibre4"></a><a id="iddle1139" class="calibre4"></a><a id="iddle1407" class="calibre4"></a>be able to fix it. Let’s pretend that it’s okay for queries to be out of date by a few hours and continue exploring this idea
         of precomputing a batch view by running a function on the complete dataset.
      </p></blockquote>

### Batch layer

<blockquote><p class="noind">The portion of the Lambda Architecture that implements the $\text{query} = function(\text{all data})$ equation is called the <i class="calibre6">batch layer</i>. The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset (see <a href="#ch01fig08" class="calibre4">figure 1.8</a>). The master dataset can be thought of as a very large list of records.
      </p>
      
      
      
      <p><b><i>Figure 1.8. <a id="ch01fig08" class="calibre4"></a>Batch layer
      </i></b></p>
      
      
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig08.jpg" alt="" class="calibre2"/></p>
      
      
      
      <p class="noind">The batch layer needs to be able to do two things: store an immutable, constantly growing master dataset, and compute arbitrary
         functions on that dataset. This type of processing is best done using batch-processing systems. Hadoop is the canonical example
         of a batch-processing system</p></blockquote>

> The simplest form of the batch layer can be represented in pseudo-code like this:
```python
def run_batch_layer():
  while True:
    recompute_batch_views()
```
The batch layer runs in an infinite loop and continuously recomputes the batch views from scratch. In reality, the batch layer is a little more involved, but we’ll come to that later

### Serving layer
<blockquote><p class="noind">The batch layer emits batch views as the result of its functions. The next step is to load the views somewhere so that they
         can be queried. This is where the serving layer comes in. The serving layer is a specialized distributed database that loads
         in a batch view and makes it possible to do random reads on it (see <a href="#ch01fig09" class="calibre4">figure 1.9</a>). When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are
         available.
      </p>
      
      
      
      <p><b><i>Figure 1.9. <a id="ch01fig09" class="calibre4"></a>Serving layer
      </i></b></p>
      
      
      
      <p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig09.jpg" alt="" class="calibre2"/></p></blockquote>

### Batch and serving layers satisfy almost all properties
<blockquote><ul class="calibre17">
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Robustness and fault tolerance</i>— </b>Hadoop handles failover when machines go down. The serving layer uses replication under the hood to ensure availability when
            servers go down. The batch and serving layers are also human-fault tolerant, because when a mistake is made, you can fix your
            algorithm or remove the bad data and recompute the views from scratch.
            
         </li>
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Scalability</i>— </b>Both the batch and serving layers are easily scalable. They’re both fully distributed systems, and scaling them is as easy
            as adding new machines.
            
         </li>
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Generalization</i>— </b>The architecture described is as general as it gets. You can compute and update arbitrary views of an arbitrary dataset.
            
         </li>
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Extensibility</i>— </b>Adding a new view is as easy as adding a new function of the master dataset. Because the master dataset can contain arbitrary
            data, new types of data can be easily added. If you want to tweak a view, you don’t have to worry <a id="iddle1008" class="calibre4"></a><a id="iddle1143" class="calibre4"></a><a id="iddle1256" class="calibre4"></a><a id="iddle1418" class="calibre4"></a><a id="iddle1427" class="calibre4"></a><a id="iddle1638" class="calibre4"></a>about supporting multiple versions of the view in the application. You can simply recompute the entire view from scratch.
            
         </li>
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Ad hoc queries</i>— </b>The batch layer supports ad hoc queries innately. All the data is conveniently available in one location.
            
         </li>
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Minimal maintenance</i>— </b>The main component to maintain in this system is Hadoop. Hadoop requires some administration knowledge, but it’s fairly straightforward
            to operate. As explained before, the serving layer databases are simple because they don’t do random writes. Because a serving
            layer database has so few moving parts, there’s lots less that can go wrong. As a consequence, it’s much less likely that
            anything <i class="calibre6">will</i> go wrong with a serving layer database, so they’re easier to maintain.
            
         </li>
         
         <li class="calibre18"><b class="calibre22"><i class="calibre6">Debuggability</i>— </b>You’ll always have the inputs and outputs of computations run on the batch layer. In a traditional database, an output can
            replace the original input—such as when incrementing a value. In the batch and serving layers, the input is the master dataset
            and the output is the views. Likewise, you have the inputs and outputs for all the intermediate steps. Having the inputs and
            outputs gives you all the information you need to debug when something goes wrong.
            
         </li>
         
      </ul></blockquote>

#### Pop Quiz

<details><summary>What is the one property we listed above that is not satisfied by the batch and serving layers. If you don't remember, before scrolling back up, take a look at these seven and try to identify what's missing:</summary>**Low latency reads and updates**</details>

### Speed layer

<blockquote><p class="noind">The serving layer updates whenever the batch layer finishes precomputing a batch view. This means that the only data not represented
         in the batch view is the data that came in while the precomputation was running. All that’s left to do to have a fully realtime
         data system—that is, to have arbitrary functions computed on arbitrary data in real time—is to compensate for those last few
         hours of data. This is the purpose of the speed layer. As its name suggests, its goal is to ensure new data is represented
         in query functions as quickly as needed for the application requirements (see <a href="#ch01fig10" class="calibre4">figure 1.10</a>).
      </p><p><b><i>Figure 1.10. <a id="ch01fig10" class="calibre4"></a>Speed layer
      </i></b></p><p class="center1"><img src="https://s3-us-west-2.amazonaws.com/dsci6007/assets/01fig10.jpg" alt="" class="calibre2"/></p><p class="noind">You can think of the speed layer as being similar to the batch layer in that it produces views based on data it receives.
         One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once.
         Another big difference is that in order to achieve the smallest latencies possible, the speed layer doesn’t look at all the
         new data at once. Instead, it updates the realtime views as it receives new data instead of recomputing the views from scratch
         like the batch layer does. The speed layer does incremental computation instead of the recomputation done in the batch layer.
</p><p class="noind">We can formalize the data flow on the speed layer with the following equation:</p>

$$\text{realtime view} = function(\text{realtime view}, \text{new data})$$
      
<p class="noind">A realtime view is updated based on new data and the existing realtime view.</p>
      
<p class="noind">The Lambda Architecture in full is summarized by these three equations:</p>

$$
\begin{align*}
\text{batch view} &= function(\text{all data})\\
\text{realtime view} &= function(\text{realtime view}, \text{new data})\\
\text{query} &= function(\text{batch view}, \text{realtime view})
\end{align*}
$$
</blockquote>
Here is a pictoral representation:
![Lambda Architecture diagram](http://fr.talend.com/sites/default/files/hadoop_summit_2015_takeaway_the_lambda_architecture-picture_1.png)
<blockquote><p class="noind"><a id="iddle1151" class="calibre4"></a><a id="iddle1203" class="calibre4"></a><a id="iddle1359" class="calibre4"></a><a id="iddle1765" class="calibre4"></a>The beauty of the Lambda Architecture is that once data makes it through the batch layer into the serving layer, the corresponding
         results in the realtime views <i class="calibre6">are no longer needed</i>. This means you can discard pieces of the realtime view as they’re no longer needed. This is a wonderful result, because
         the speed layer is far more complex than the batch and serving layers. This property of the Lambda Architecture is called
         <i class="calibre6">complexity isolation,</i> meaning that complexity is pushed into a layer whose results are only temporary. If anything ever goes wrong, you can discard
         the state for the entire speed layer, and everything will be back to normal within a few hours.
      </p></blockquote>

#### Question
Is a speed layer always necessary? Under what circumstances would you want it?

#### Question

Give at least two examples of when you would want to use Lambda Architecture, 
and two examples of when you wouldn't.

Recent trends in technology
-------------------------------------------------
* CPUs aren’t getting faster
* Elastic clouds
* Vibrant open source ecosystem for Big Data
    * Batch computation systems 
    * Serialization frameworks
    * Random-access NoSQL databases
    * Messaging/queuing systems
    * Realtime computation system 

Data storage on the batch layer
===============================

By the end of this lesson, you should be able to:

- Determine the requirements for storing the master dataset
- See why distributed filesystems are a natural fit for storing a master dataset
- See how the batch layer storage for the SuperWebAnalytics.com project maps to distributed filesystems

What are some requirements you might have for your master dataset? Try to think of (at least) one requirement you would make for writes and one for reads that would help you determine which technology to pick. 

<details>
  <summary><i>Write your answer above before opening this.</i></summary>
<blockquote><table>
<tr><th>Operation</th>
    <th>Requisite</th>
    <th>Discussion</th></tr>
<tr><td rowspan=2>Write</td>
    <td>Efficient appends of new data</td>
    <td>The only write operation is to add new pieces of data, so it must be easy and efficient to append a new set of data objects to the master dataset.</td></tr>
<tr><td>Scalable storage</td>
    <td>The batch layer stores the complete dataset—potentially terabytes or petabytes of data. It must therefore be easy to scale the storage as your dataset grows.</td></tr>
<tr><td>Read</td>
    <td>Support for parallel processing</td>
    <td>Constructing the batch views requires computing functions on the entire master dataset. The batch storage must consequently support parallel processing to handle large amounts of data in a scalable manner.</td></tr>
<tr><td rowspan=2>Both</td>
    <td>Tunable storage and processing costs</td>
    <td>Storage costs money. You may choose to compress your data to help minimize your expenses, but decompressing your data during computations can affect performance. The batch layer should give you the flexibility to decide how to store and compress your data to suit your specific needs.</td></tr>
<tr><td>Enforceable immutability</td>
    <td>It’s critical that you’re able to enforce the immutability property on your master dataset. Of course, computers by their very nature are mutable, so there will always be a way to mutate the data you’re storing. The best you can do is put checks in place to disallow mutable operations. These checks should prevent bugs or other random errors from trampling over existing data.</td></tr>
</table></blockquote></details>

Before continuing: which of these were not accounted for in your answer above? Rewrite those in your own words.

## Choosing a storage solution for the batch layer
<blockquote><p class="noind">With the requirements checklist in hand, you can now consider options for batch layer storage. With such loose requirements—not
         even needing random access to the data—it seems like you could use pretty much any distributed database for the master dataset.
         So let’s first consider the viability of using a key/value store, the most common type of distributed database, for the master
         dataset.
      </p></blockquote>

### Using a key/value store for the master dataset
<blockquote><p class="noind">If you’re storing a master dataset on a key/value store, the first thing you have
         to figure out is what the keys should be and what the values should be.
      </p>
      
      <p class="noind">What a value should be is obvious—it’s a piece of data you want to store—but what should a key be? There’s no natural key
         in the data model, nor is one necessary because <a id="iddle1047" class="calibre4"></a><a id="iddle1229" class="calibre4"></a><a id="iddle1265" class="calibre4"></a>the data is meant to be consumed in bulk. So you immediately hit an impedance mismatch between the data model and how key/value
         stores work. The only really viable idea is to generate a UUID to use as a key.
      </p>
      
      <p class="noind">But this is only the start of the problems with using key/value stores for a master dataset. Because key/value stores need
         fine-grained access to key/value pairs to do random reads and writes, you can’t compress multiple key/value pairs together.
         So you’re severely limited in tuning the trade-off between storage costs and processing costs.
      </p>
      
      <p class="noind">Key/value stores are meant to be used as mutable stores, which is a problem if enforcing immutability is so crucial for the
         master dataset. Unless you modify the code of the key/value store you’re using, you typically can’t disable the ability to
         modify existing key/value pairs.
      </p>
      
      <p class="noind">The biggest problem, though, is that a key/value store has a lot of things you don’t need: random reads, random writes, and
         all the machinery behind making those work. In fact, most of the implementation of a key/value store is dedicated to these
         features you don’t need at all. This means the tool is enormously more complex than it needs to be to meet your requirements,
         making it much more likely you’ll have a problem with it. Additionally, the key/value store indexes your data and provides
         unneeded services, which will increase your storage costs and lower your performance when reading and writing data.
      </p></blockquote>

### Distributed filesystems
<blockquote>Unlike a key/value store, a filesystem
         gives you exactly what you need and no more, while also not limiting your ability to tune storage cost versus processing cost.
         On top of that, filesystems implement fine-grained permissions systems, which are perfect for enforcing immutability.</blockquote><blockquote><p class="noind">Distributed filesystems vary in the kinds of operations they permit. Some distributed filesystems let you modify existing
         files, and others don’t. Some allow you to append to existing files, and some don’t have that feature. In this section we’ll
         look at how you can store a master dataset on a distributed filesystem with only the most bare-boned of features, where a
         file can’t be modified at all after being created.
      </p>
      
      <p class="noind">Clearly, with unmodifiable files you can’t store the entire master dataset in a single file. What you can do instead is spread
         the master dataset among many files, and store all those files in the same folder. Each file would contain many serialized
         data objects</p>
      
      
      
      <p><b><i>Spreading the master dataset throughout many files
      </i></b></p>
      
      <p class="center1"><img src="images/04fig04_alt.jpg" alt="" class="calibre2"/></p></blockquote><blockquote><p class="noind">To append to the master dataset, you simply add a new file containing the new data objects to the master dataset folder</p><p><b><i>Appending to the master dataset by uploading a new file with new data records
      </i></b></p>
      
      <p class="center1"><img src="images/04fig05_alt.jpg" alt="" class="calibre2"/></p></blockquote><blockquote><p class="noind">Let’s now go over the requirements for master dataset storage and verify that a distributed filesystem matches those requirements.</p>
      
      <table>
<caption>How distributed filesystems meet the storage requirement checklist</caption>
<tr><th>Operation</th>
    <th>Requisite</th>
    <th>Discussion</th></tr>
<tr><td rowspan=2>Write</td>
    <td>Efficient appends of new data</td>
    <td>Appending new data is as simple as adding a new file to the folder containing the master dataset.</td></tr>
<tr><td>Scalable storage</td>
    <td>Distributed filesystems evenly distribute the storage across a cluster of machines. You increase storage space and I/O throughput by adding more machines.</td></tr>
<tr><td>Read</td>
    <td>Support for parallel processing</td>
    <td>Distributed filesystems spread all data across many machines, making it possible to parallelize the processing across many machines. Distributed filesystems typically integrate with computation frameworks like MapReduce to make that processing easy to do (discussed in upcoming lessons)</td></tr>
<tr><td rowspan=2>Both</td>
    <td>Tunable storage and processing costs</td>
    <td>Just like regular filesystems, you have full control over how you store your data units within the files. You choose the file format for your data as well as the level of compression. You’re free to do individual record compres- sion, block-level compression, or neither.</td></tr>
<tr><td>Enforceable immutability</td>
    <td>Distributed filesystems typically have the same permissions systems you’re used to using in regular filesystems. To enforce immutability, you can dis- able the ability to modify or delete files in the master dataset folder for the user with which your application runs. This redundant check will protect your previously existing data against bugs or other human mistakes.</td></tr>
</table></blockquote>

Vertical partitioning
------------------------------------------------------
<blockquote><p class="noind">Although the batch layer is built to run functions on the entire dataset, many computations don’t require looking at all the
         data. For example, you may have a computation that only requires information collected during the past two weeks. The batch
         storage should allow you to partition your data so that a function only accesses data relevant to its computation.</p></blockquote><blockquote><p><b><i>Partitioning scheme for login data. By sorting information for each date in separate folders, a function can select
         only the folders containing data relevant to its computation.
      </i></b></p><p class="center1"><img src="images/04fig06_alt.jpg" alt="" class="calibre2"/></p></blockquote>
Similarly, you may want to store different dimensions of your data separately.
<blockquote><p class="noind">This process
         is called <i class="calibre6">vertical partitioning</i>, and it can greatly contribute to making the batch layer more efficient. While it’s not strictly necessary for the batch
         layer, as the batch layer is capable of looking at all the data at once and filtering out what it doesn’t need, vertical partitioning
         enables large performance gains, so it’s important to know how to use the technique.
      </p>
      
      <p class="noind">Vertically partitioning data on a distributed filesystem can be done by sorting your data into separate folders.</p></blockquote><blockquote><p><b><i>If the target dataset is vertically partitioned, appending data to it is not as simple as just adding files to the dataset
         folder.
      </i></b></p>
      
      <p class="center1"><img src="images/04fig08_alt.jpg" alt="" class="calibre2"/></p></blockquote><blockquote><p class="noind">Now if you only want to look at a particular subset of your dataset, you can just look at the files in those particular folders and ignore the other files.</p></blockquote>

This can be done with relative ease using topics in Kafka.

Batch Layer
===============================

By the end of this lesson, you should be able to:

- Describe and employ the four communication patterns of distributed processing
- Develop and implement a DAG for batch processing

<blockquote><p class="noind">The goal of a data system is to answer arbitrary questions about your data. Any question you could ask of your dataset can
         be implemented as a function that takes all of your data as input. Ideally, you could run these functions on the fly whenever
         you query your dataset. Unfortunately, a function that uses your entire dataset as input will take a very long time to run.
         You need a different strategy if you want your queries answered quickly.
      </p>
      
      <p class="noind">In the Lambda Architecture, the batch layer precomputes the master dataset into batch views so that queries can be resolved
         with low latency. This requires striking a balance between what will be precomputed and what will be computed at execution
         time to complete the query. By doing a little bit of computation on the fly to complete queries, you save yourself from needing
         to precompute absurdly large <a id="iddle1087" class="calibre4"></a><a id="iddle1486" class="calibre4"></a>batch views. The key is to precompute just enough information so that the query can be completed quickly.
      </p></blockquote>

Computing on the batch layer
-------------------------------------------------
<blockquote><p class="noind">The batch layer runs functions over the master dataset to precompute intermediate data called <i class="calibre6">batch views</i>. The batch views are loaded by the serving layer, which indexes them to allow rapid access to that data. The speed layer
         compensates for the high latency of the batch layer by providing low-latency updates using data that has yet to be precomputed
         into a batch view. Queries are then satisfied by processing data from the serving layer views and the speed layer views, and
         merging the results.
      </p><p class="noind">A linchpin of the architecture is that for <i class="calibre6">any</i> query, it’s possible to precompute the data in the batch layer to expedite its processing by the serving layer. These precomputations
         over the master dataset take time, but you should view the high latency of the batch layer as an opportunity to do deep analyses
         of the data and connect diverse pieces of data together. Remember, low-latency query serving is achieved through other parts
         of the Lambda Architecture.
      </p><p class="noind">A naive strategy for computing on the batch layer would be to precompute all possible queries and cache the results in the
         serving layer.</p>
         <p><b><i>Precomputing a query by running a function on the master dataset directly
      </i></b></p><p class="center1"><img src="images/06fig02.jpg" alt="" class="calibre2"/></p></blockquote><blockquote><p class="noind">Unfortunately you can’t always precompute <i class="calibre6">everything</i>. Consider the pageviews-over-time query as an example. If you wanted to precompute every potential query, you’d need to determine
         the answer for every possible range of hours for every URL. But the number of ranges of hours within a given time frame can
         be huge. In a one-year period, there are approximately 380 million distinct hour ranges. To precompute the query, you’d need
         to precompute and index 380 million values <i class="calibre6">for every URL</i>. This is obviously infeasible and an unworkable solution.
      </p><p class="noind">Instead, you can precompute intermediate results and then use these results to complete queries on the fly</p>
      <p><b><i>Splitting a query into precomputation and on-the-fly components
      </i></b></p>
      <p class="center1"><img src="images/06fig03_alt.jpg" alt="" class="calibre2"/></p>
      </blockquote><blockquote><p class="noind">For the pageviews-over-time query, you can precompute the number of pageviews for every hour for each URL.</p>
      <p><b><i>Computing the number of pageviews by querying an indexed batch view
      </i></b></p>
      <p class="center1"><img src="images/06fig04_alt.jpg" alt="" class="calibre2"/></p>
      <p class="noind">To complete a query, you retrieve from the index the number of pageviews for every hour in the range, and sum the results.
         For a single year, you only need to precompute and index 8,760 values per URL (365 days, 24 hours per day). This is certainly
         a more manageable number.</p></blockquote>

Recomputation algorithms vs. incremental algorithms
-------------------------------------------------
<blockquote><p class="noind">Because your master dataset is continually growing, you must have a strategy for updating your batch views when new data becomes
         available. You could choose a <i class="calibre6">recomputation</i> algorithm, throwing away the old batch views and recomputing functions over the entire master dataset. Alternatively, an
         <i class="calibre6">incremental</i> algorithm will update the views directly when new data arrives.
      </p>
      
      <p class="noind">As a basic example, consider a batch view containing the total number of records in your master dataset. A recomputation algorithm
         would update the count by first appending the new data to the master dataset and then counting all the records from scratch.</p>
      
      
      
      <p><b><i>A recomputing algorithm to update the number of records in the master dataset. New data is appended to the master dataset,
         and then all records are counted.
      </i></b></p>
      
      <p class="center1"><img src="images/06fig05_alt.jpg" alt="" class="calibre2"/></p>
      
      
      <p class="noind"><a id="iddle1100" class="calibre4"></a><a id="iddle1374" class="calibre4"></a><a id="iddle1508" class="calibre4"></a><a id="iddle1572" class="calibre4"></a>An incremental algorithm, on the other hand, would count the number of new data records and add it to the existing count</p>
      
      
      
      <p><b><i>An incremental algorithm to update the number of records in the master dataset. Only the new dataset is counted, with the
         total used to update the batch view directly.
      </i></b></p>
      
      <p class="center1"><img src="images/06fig06_alt.jpg" alt="" class="calibre2"/></p>
      
      
      <p class="noind">You might be wondering why you would ever use a recomputation algorithm when you can use a vastly more efficient incremental
         algorithm instead. But efficiency is not the only factor to be considered. The key trade-offs between the two approaches are
         performance, human-fault tolerance, and the generality of the algorithm. We’ll discuss both types of algorithms in regard
         to each of these issues. You’ll discover that although incremental approaches can provide additional efficiency, you <i class="calibre6">must</i> also have recomputation versions of your algorithms.
      </p></blockquote>

### Performance
<blockquote><p class="noind">There are two aspects to the performance of a batch-layer algorithm: the amount of resources required to update a batch view
         with new data, and the size of the batch views produced.
      </p>
      <p class="noind">An incremental algorithm almost always uses significantly less resources to update a view because it uses new data and the
         current state of the batch view to perform an update. For a task such as computing pageviews over time, the view will be significantly
         smaller than the master dataset because of the aggregation. A recomputation algorithm looks at the entire master dataset,
         so the amount of resources needed for an update can be multiple orders of magnitude higher than an incremental algorithm.
         But the size of the batch view for an incremental algorithm can be significantly larger than the corresponding batch view
         for a recomputation algorithm. This is because the view needs to be formulated in such a way that it can be incrementally
         updated.
      </p></blockquote>
      
> Consider a query that computes the number of unique visitors for each URL.

<blockquote><p><b><i>A comparison between a recomputation view and an incremental view for determining the number of unique visitors per URL
      </i></b></p>
      <p class="center1"><img src="images/06fig07_alt.jpg" alt="" class="calibre2"/></p>
      <p class="noind">A recomputation view only requires a map from the URL to the unique count. In contrast, an incremental algorithm only examines
         the new pageviews, so its view must contain the full set of visitors for each URL so it can determine which records in the
         new data correspond to return visits. As such, the incremental view could potentially be as large as the master dataset!
      </p>
      
      <p class="noind">The batch view generated by an incremental algorithm isn’t always this large, but it can be far larger than the corresponding
         recomputation-based view.
      </p></blockquote>

### Human-fault tolerance
<blockquote><p class="noind">The lifetime of a data system is extremely long, and bugs can and will be deployed to production during that time period.
         You therefore must consider how your batch update algorithm will tolerate such mistakes. In this regard, recomputation algorithms
         are inherently human-fault tolerant, whereas with an incremental algorithm, human mistakes can cause serious problems.
      </p></blockquote>

### Generality of the algorithms

> Although incremental algorithms can be faster to run, they must often be tailored to address the problem at hand.
. . . 
> Because a recomputation algorithm continually rebuilds the entire batch view, the structure of the batch view and the complexity of the on-the-fly component are both simpler, leading to a more general algorithm.

### Choosing a style of algorithm

> The key takeaway is that you must *always* have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for a robust system. Additionally, you have the option to add incremental versions of your algorithms to make them more resource-efficient.

> ***Comparing recomputation and incremental algorithms***

>|                       | Recomputation algorithms | Incremental algorithms |
|:--------------------- |:------------------------ |:---------------------- |
| Performance           | Requires computational effort to process the entire master dataset | Requires less computational resources but may generate much larger batch views |
| Human-fault tolerance | Extremely tolerant of human errors because the batch views are continually rebuilt | Doesn’t facilitate repairing errors in the batch views; repairs are ad hoc and may require estimates |
| Generality            | Complexity of the algorithm is addressed during precomputation, resulting in simple batch views and low-latency, on-the-fly processing | Requires special tailoring; may shift complexity to on-the-fly query processing |
| Conclusion            | Essential to supporting a robust data-processing system | Can increase the efficiency of your system, but only as a supplement to recomputation algorithms |