# Aula 8 - Arquitecturas de Dados e Sistemas Big Data 

---

Let's now understand where the algorihtms meet real needs

- Lambda Architecture
- Kappa Architecture

---

## The Need for New Architetures


Suppose that the application should track the number of page views for any URL a customer wishes to track

- The customer’s browser pings the application’s web server in a specific URL every time a pageview is received;
- Application should tell us the top 100 URLs by number of pageviews;
- Now, as allways, consider a huge amount of page hits and a log schema like this:



| Id (integer) | User_id (integer) | url(varchar(255)) | Pageviews(bigint) |            
|--|--|--|--|
|  |  |  |  |



#### First way to calculate: Direct Access to database

<img src="images/direct_acess.png" style="width:30%"/>

Direct access from Web server to the backend DB cannot handle the large amount of frequent **write** requests. You start receive a lot of

``` 
Timeout error inserting into Database
```

----

#### Async inserts with a Queue

> Batch many increments in a single request, kept in an async worker;

<img src="images/async_acess.png" style="width:25%"/>

It solved. But imagine your data amount increases even more? Your worker also starts receiving 

``` 
Timeout error inserting into Database
```
and cannot keep up with the writes!

> Add more workers?

Nope, the Database will be overloaded! 

> New Solution! Sharding of the database: Horizontal partitioning of tables

Looks good:

- Uses multiple database servers and spreads the table across all the servers
- Choose the shard for each key by taking the hash of the key "modded" by the number of shards

> Starting to work distributedly, right?

##### New problems:

- **SIZE ISSUES:** What if your current number of shards cannot handle your data?
    - Your mapping script should cope with new set of shards
    - Application and data should be re-organized
    


- **FAULT-TOLERANCE:**  What if one of the database machines is down? 
    - A portion of the data is unavailable
    

- **CORRUPTION ISSUES** What if your worker code accidentally generated a bug and stored the wrong number for some of the data portions?

----

#### Typical Approaches 

##### OLAP 
<img src="images/olap.png" style="width:40%"/>

- not real time 
- (Very) Expensive
- Not easy to scale-up

----


#### Let's consider now a BigData Approach:

- Sharding and replications are the a fundamental component in the design of Big Data systems
    - solving fault tolerance and Size issues
    
- Although users (o workers) can change data all the time, **the raw pageview information is not modified**
    - this solves the curruption issues
    
    

----




## Lambda Architecture

### Why Lambda Architecture?

**To perform large-scale analytics over voluminous data**

We need a high-level architecture that provides,
    
- Robustness
- Fault-tolerant: Both against hardware failures and human mistakes
- Support for a wide range of workloads and use cases 
- Low-latency reads and updates
- Batch analytics jobs
- Scalability
- Scale-out capabilities with minimal maintenance


<img src="images/lambda_arch.png" style="width:50%"/>



STARTING WITH A QUERY:

---

```
query = function(all data)

```


### Batch layer
**Low throughput, High latency**

- Precomputes results using distributed processing system
    - The component that performs the batch view processing like 
    
    ```
    batch view= function(data)
    ```
    

- Stores an immutable, constantly growing master dataset

- Computes arbitrary functions on that dataset 
    - Batch-processing systems: e.g. Hadoop, Spark, TensorFlow


<img src="images/batch_layer.png" style="width:60%"/>

---

### Speed layer
**High throughput, low latency, stream-processing systems**
- Is there any data not represented in the batch view? 
    - Data arrives while the precomputation (Batch Layer computation) is running
- With fully real-time data system
- Speed layer looks only at recent data
- Whereas the batch layer looks at all the data at once 
    ```
    realtime view= function(realtime view, new data)
    ```

How long should the real time view be maintained?

- Once the data arrives at the serving layer, the corresponding results in the real-time views are no longer needed
- You can discard pieces of the realtime views

<img src="images/lambda_arch_timeline.png" style="width:60%"/>



### Serving Layer


- As seen, the batch layer emits batch views as the result of its functions 
    - These views should be loaded somewhere and queried

- Specialized distributed database that loads in a batch view and makes it possible to do random reads on it

- Batch update and random reads should be supported
    - e.g. BigQuery, ElephantDB, Dynamo, MongoDB, Cassandra


### Finally

<img src="images/lambda_arch_2.png" style="width:70%"/>



### Back to our example:

Web analytics application tracking the number of pageviews over a range of days

- The speed layer keeps its own separate view of [url, day] and updates its views by incrementing the count in the view whenever it receives new data

- The batch layer recomputes its views by counting the pageviews

- To resolve the query, you query both the batch and realtime views 
    - With satisfying ranges
    - Sum up the results

### Now, our aproximated (statistical) algorithms come in:

- The batch/speed layer will split your data and use:
    
    - The exact algorithms on the batch layer
    - An approximate, error bounded algorithms on the speed layer (like bloom filters, etc.)



- The batch layer repeatedly overrides the speed layer 
    - The approximation gets corrected

    
- Eventual accuracy

### Latest Technologies

<img src="images/Lambda_arch_tech.png" style="width:70%"/>


---

## Kappa Architecture

----


Kappa architecture is a simplified version of lambda architecture

<img src="images/kappa_arch.png" style="width:70%"/>



It is not a replacement for the Lambda Architecture, except for where your use case fits. For this architecture, incoming data is streamed through a real-time layer and the results of which are placed in the serving layer for queries.






