In [10]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<IPython.core.display.Javascript object>

<h1 class="tocheading">Speed Layer</h1>
> The speed layer compensates for the high latency of the batch layer to enable up-to-date results for queries.  

<div id="toc"></div>

Objectives
===============================================================

By the end of today's lesson, you will be able to:  

* Enqueue page-views in Kafka
* Dedupe and normalize using Spark Streaming
* Store Pageviews over time in HBase
* Expire data in HBase as appropriate

Realtime views
===============================================================

> This section covers:
>  
* The theoretical model of the speed layer
* How the batch layer eases the responsibilities of the speed layer
* Using random-write databases for realtime views
* The CAP theorem and its implications
* The challenges of incremental computation
* Expiring data from the speed layer

![Figure 12.1](images/12fig01_alt.jpg)

> **The speed layer allows the Lambda Architecture to serve low-latency queries over up-to-date data.**

Computing realtime views
---------------------------------------------------------------------

### Strategy: realtime view = function(recent data)

![Figure 12.2](images/12fig02_alt.jpg)

### Incremental strategy: realtime view = function(new data, previous realtime view)

![Figure 12.3](images/12fig03_alt.jpg)


Storing realtime views
---------------------------------------------------------------------
> 
* *Random reads*—A realtime view should support fast random reads to answer queries quickly. This means the data it contains must be indexed.
* *Random writes*—To support incremental algorithms, it must also be possible to modify a realtime view with low latency.
* *Scalability*—As with the serving layer views, the realtime views should scale with the amount of data they store and the read/write rates required by the applica- tion. Typically this implies that realtime views can be distributed across many machines.
* *Fault tolerance*—If a disk or a machine crashes, a realtime view should continue to function normally. Fault tolerance is accomplished by replicating data across machines so there are backups should a single machine fail.

### Eventual accuracy

> Because all data is eventually represented in the batch and serving layer views, any approximations you make in the speed layer are continually corrected.

### Amount of state stored in the speed layer
> 
* *Online compaction*—As a read/write database receives updates, parts of the disk index become unused, wasted space. Periodically the database must perform compaction to reclaim space. Compaction is a resource-intensive process and could potentially starve the machine of resources needed to rapidly serve queries. Improper manage- ment of compaction can cause a cascading failure of the entire cluster.
* *Concurrency*—A read/write database can potentially receive many reads or writes for the same value at the same time. It therefore needs to coordinate these reads and writes to prevent returning stale or inconsistent values. Sharing mutable state across threads is a notoriously complex problem, and control strategies such as locking are notoriously bug-prone.

Challenges of incremental computation
---------------------------------------------------------------------

### Validity of the CAP theorem

![Figure 12.4](images/12fig04_alt.jpg)

**Replicas can diverge if updates are allowed under partitions.**

### The complex interaction between the CAP theorem and incremental algorithms

**conflict-free replicated data types (*CRDT*s)**

#### A G-Counter is a grow-only counter (or [`accumulator`](http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka)) where a replica only increments its assigned counter. 

![Figure 12.5](images/12fig05_alt.jpg)

**The overall value of the accumulator is the sum of the replica counts.**

#### Merging G-Counters

![Figure 12.6](images/12fig06_alt.jpg)

Asynchronous versus synchronous updates
---------------------------------------------------------------------

### A simple speed layer architecture using synchronous updates

![Figure 12.7](images/12fig07.jpg)

### Asynchronous updates provide higher throughput and readily handle variable loads.

![Figure 12.8](images/12fig08_alt.jpg)

Expiring realtime views
---------------------------------------------------------------------

#### The state of the serving and speed layer views at the end of the first batch computation run:

![Figure 12.9](images/12fig09.jpg)

#### A portion of the realtime views can be expired after the second run completes:

![Figure 12.10](images/12fig10.jpg)

#### The serving and speed layer views immediately before the completion of the third batch computation run:

![Figure 12.11](images/12fig11_alt.jpg)

#### Alternating clearing between two different sets of realtime views guarantees one set always contains the appropriate data for the speed layer:

![Figure 12.12](images/12fig12_alt.jpg)

Queueing and stream processing
===============================================================

**To implement asynchronous processing without queues, a client submits an event without monitoring whether its processing is successful.**

![Figure 14.1](images/14fig01.jpg)

Queuing
---------------------------------------------------------------------
---------------------------------------------------------------------

### Single-consumer queue servers

![Figure 14.2](images/14fig02.jpg)

**Multiple applications sharing a single queue consumer**

---------------------------------------------------------------------

### Multi-consumer queues

![Figure 14.3](images/14fig03_alt.jpg)

**With a multi-consumer queue, applications request specific items from the queue and are responsible for tracking the successful processing of each event.**

Stream processing
---------------------------------------------------------------------

![Figure 14.4](images/14fig04_alt.jpg)

---------------------------------------------------------------------

**Comparison of stream-processing paradigms**

![Figure 14.5](images/14fig05.jpg)

### Queues and workers

---------------------------------------------------------------------

#### A representative system using a queues-and-workers architecture. 

![Figure 14.6](images/14fig06_alt.jpg)

**The queues in the diagram could potentially be distributed queues as well.**

---------------------------------------------------------------------

#### Computing pageviews over time with a queues-and-workers architecture

![Figure 14.7](images/14fig07_alt.jpg)

**For our purposes we can use HBase in place of Cassandra**

Micro-batch stream processing
===============================================================
> This section covers:
>  
* Exactly-once processing semantics
* Micro-batch processing and its trade-offs
* Extending pipe diagrams for micro-batch stream processing

Achieving exactly-once semantics
---------------------------------------------------------------------

### Strongly ordered processing
>  
* The stored ID is the same as the current tuple ID. In this case, you know that the count already reflects the current tuple, so you do nothing.
* The stored ID is different from the current tuple ID. In this case, you know that the count doesn’t reflect the current tuple. So you increment the counter and update the stored ID. This works because tuples are processed in order, and the count and ID are updated atomically.

### Micro-batch stream processing

**Tuple stream divided into batches**

![Figure 16.1](images/16fig01.jpg)

### Micro-batch processing topologies

#### Each batch includes tuples from all partitions of the incoming stream.

![Figure 16.4](images/16fig04.jpg)

**Word-count topology:**

![Figure 16.5](images/16fig05.jpg)

**Storing word counts with batch IDs:**

![Figure 16.6](images/16fig06.jpg)


Core concepts of micro-batch stream processing
---------------------------------------------------------------------
>  
* *Batch-local computation*—There’s computation that occurs solely within the batch, not dependent on any state being kept. This includes things like reparti- tioning the word stream by the word field and computing the count of all the tuples in a batch.
* *Stateful computation*—Then there’s computation that keeps state across all batches, such as updating a global count, updating word counts, or storing a top-three list of most frequently used words. This is where you have to be really careful about how you do state updates so that processing is idempotent under failures and retries. The trick of storing the batch ID with the state is particu- larly useful here to add idempotence to non-idempotent operations.


<!--
Illustration
===============================================================
-->

Realtime views
---------------------------------------------------------------------

> Three separate queries you’re implementing for SuperWebAnalytics.com:
>  
* Number of pageviews over a range of hours
* Unique number of visitors over a range of hours
* Bounce rate for a domain  
    
---------------------------------------------------------------------

> Consider the following sequence of events:
>  
1. IP address `11.11.11.111` visits `foo.com/about` at 1:30 pm.
2. User `sally` visits `foo.com/about` at 1:40 pm.
3. An equiv edge between `11.11.11.111` and `sally` is discovered at 2:00 pm.

### Topology structure

1. Consume a stream of pageview events that contains a user identifier, a URL, and a timestamp.
2. Normalize URLs.
3. Update a database containing a nested map from URL to hour to a HyperLogLog (*i.e.* `approxCountDistinct`) set.

<!--
SuperWebAnalytics.com speed layer
---------------------------------------------------------------------

### HBase data model

#### The HBase data model consists of column families, keys, and columns.

![Figure 13.1](images/13fig01_alt.jpg)

#### Pageviews over time represented in HBase

![Figure 13.2](images/13fig02.jpg)
-->