# Introduction to Data Engineering 
<br><br>

**ONS / NISR** <br>
2021

## What does a data engineer do?

Data engineers are responsible for designing, developing, and maintaining the data platform, which includes the data infrastructure, data applications, data warehouse and data pipelines.

In a big company, data engineers are usually divided into different groups that work with specific part of the data platform. <br><br>

<center><img src="./imgs/engineer_stack.png"></center>

## Data warehouses

A **data warehouse** is a data storage system filled with data from various sources and is used for data analysis. It is different from a traditional **database**. A traditional **database** is great for updating and retrieving values, but is inefficient for analytical and data science purposes. 

<center><img src="./imgs/datawarehouse.png" width=300px height=300px style=""><br><caption><em>A visual respresentation of a data warehouse</em></caption></center>

### What is a data warehouse? 

A data warehouse is a data storage system filled with data from **various sources** and is used for data analysis. 

Most data in the real world is stored in different transactional systems (or even worse, as text files!). Transactional data isn't very efficient for use in analysis and data science. 

The main reason for building a data warehouse is to store **all** types of data in **optimized formats** in a centralized place so that data scientists can analyse all the data altogether. 

<center>A data warehouse makes <b>many</b> different datasets available to data scientists in a format that is <b>optimised</b> for analytical work.</center> 

#### What technologies do data warehouses use? 

There are many databases that serve well as a data warehouse, such as Apache Hive, BigQuery, and Redshift.   

<table style="text-align:center">
<tr><td>
<img src="./imgs/hivelogo.png" width=150></td><td>
<img src="./imgs/bigquerylogo.png" width=300></td><td>
<img src="./imgs/redshiftlogo.png" width=250></td></tr></table> 

### Data Lakes

In big data contexts, where we're talking about massive volumes of rapidly changing information streams data warehouses usually aren't able to accommodate. For example live tweets from every person in Rwanda.  

A data lake is a solution to store all sorts of raw and semi-structured data in a big...lake. 

It is a vast pool for saving data in its native, unprocessed form. A data lake stands out for its high agility as it isn’t limited to a warehouse’s fixed configuration.

Data lakes are relatively new and as such their security models are not quite as mature as data warehouses but they can be incredibly valuable for data scientists. 

#### What technologies do data lakes use? 

Most big cloud providers now offer data lake technology such as Amazons' S3 service and Microsoft's blob storage. 

<table style="text-align:center">
    <tr><td><img src="./imgs/hadoop.png" width=250></td><td>
<img src="./imgs/s3logo.png" width=250></td><td>
<img src="./imgs/blobstoragelogo.png" width=300></td></tr></table> 

## Compute Resources

As the data warehouse doesn't do anything by itself we need a **compute** layer to apply algorithms to the data in our warehouse. 

The difficulty is that algorithms come in all different shapes and sizes meaning our compute layer need to be flexible to the needs of our data scientists. 

<center><img src="./imgs/compute.png" width=300px height=300px style="margin: -15px"><br><caption><em>A visual representation of the compute layer</em></caption></center>

### A history of two data science compute layers

Its important to remember that just like data science, data infrastructure is an iterative process that **will** need to expand as the needs of the business grow. <br><br>

<table style="font-size:20px">
    <tr><th>ONS Data Science Campus</th><th>OfS Data, Foresight & Analysis</th><tr>
    <tr><td>No infrastructure</td><td>No infrastructure</td></tr>
    <tr><td>Working with local data on average laptops</td><td>Working with local data on average laptops</td></tr>
    <tr><td>Working with local data on better laptops</td><td>Working with local data on better laptops</td></tr>
    <tr><td>On premise DAP</td><td>On premise SAS severs</td></tr>
    <tr><td>Cloud Software / Data</td><td>Cloud data and compute with Azure Databricks</td></tr>
    
</table>

## Job Scheduler / Pipelines

New data is being generated all the time, similarly data science projects will want to be able to take advantage of the newest data whenever possible. We want to refresh our data sources and data science results with a regular cadence. 

The scheduler layer keeps all these processes ticking along automatically. 

<center><img src="./imgs/jobschedule.png" width=300px height=300px style="margin: -20px"><br><caption><em>A visual representation of the scheduler layer</em></caption></center>

### What is a data pipeline?

A data pipeline is a series of data processes that extract, process and load data between different systems. 

There are two types of data pipeline:

<table style="font-size:20px"><tr><td>Batch-driven</td><td>Real-time</td></tr></table>

#### Batch-driven pipelines

Batch data pipelines only process data at a certain frequency, i.e. once a day. They usually process a large **batch** of historical data all at once, usually taking a long time to finish. 

For example a batch-driven pipeline could download the previous day's data from an API at 12 AM every day, transform the data and then load it into the data warehouse. 

<table style="text-align:center">
<tr><td>
<img src="./imgs/AirflowLogo.png" width=200></td><td>
<img src="./imgs/luigi.png" width=150></td><td>
<img src="./imgs/crontablogo.png" width=250></td></tr></table>

#### Real-time pipelines

Real-time data pipelines process new data as soon as it is available. The architecture for real-time data processing is very different from that of batched pipelines because data is treated as a **stream of events** instead of chunks of a record. 

Real-time pipelines are useful for applications that need to respond to new information close to instantly. 

<center><img src="./imgs/kafka.png" width=200></center>

## Architecture / Orchestration

All the previous layers are atomic. As such we can join them together in different ways for different purposes. This can massively increase performance if we're not processing the same raw data for each individual pipeline.

<center><img src="./imgs/architecture.png" width=300px height=300px style="margin: -20px"><br><caption><em>A visual representation of the architecture layer</em></caption></center>

### What technologies assist with orchestration

Large scale orchestration is just pipelining pipelines, however the large and more complex the system the more important logging and error reporting become. That means simpler solutions such as `cron` will become more painful to maintain.

<table style="text-align:center">
<tr><td>
<img src="./imgs/AirflowLogo.png" width=200></td><td>
    <img src="./imgs/snowflake.png" width=200></td><td></tr></table>

## Versioning & CI/CD

The most important thing to remember is that just like data science, for infrastructure to be effective it needs to constantly measured for performance, reviewed and iterated. 

<center><img src="./imgs/versioning.png" width=400px height=400px style="margin: 0 0 -50px"><br><caption><em>A visual representation of the architecture versioning</em></caption></center>

### How to version control a pipeline 

Most of the tools we've discussed allow for versioning of a pipeline, the pipeline itself is just code and can versioned, tested and deployed just like any other project. <br><br>

<center><img src="./imgs/gitlogo.png" width=300></center>

### How to CI/CD a pipeline

Continuous integration / Continuous development allows for changes to our code to be pushed to production incrementally. This can be made even safer with the use of a **development** environment that mirrors the setup of the production environment. 

As pipelines and orchestration are code it also possible to automate the testing of each deployment using unit and integration tests. 

**Unit tests** - Testing a small section of your code in isolation and,<br>
**Integration tests** - Testing that the entire code works within the production environment

## Data Engineering key frameworks

Every data engineering problem is different. If there is no need to process data in real-time then there is no need for a Kafka system. However below are some common tools and frameworks of which it wouldn't hurt to have a passing understanding.

**Programming**: Python, SQL, Java/Scala<br>
**Distributed Systems**: Hadoop<br>
**Databases**: MySQL, MongoDB<br>
**Data processing**: Spark<br>
**Real-time data ecosystem**: Kafka<br>
**Data orchestration**: Airflow<br>
**Data science and ML**: pandas


## Further reading 

Some resources for further reading are given below.
<div style="float:left"><a href="https://livebook.manning.com/book/effective-data-science-infrastructure/chapter-1/"><img src="./imgs/manning_book.jpeg" width=200px></a><br><center><em>Manning's Effective Data <br>Science Infrastructure.</em></center></div>
<table style="padding-left:50px">
    <tr><th style="text-align:left">Resources</th></tr>
<td><a href="https://www.kdnuggets.com/2020/12/introduction-data-engineering.html">An introduction to data engineering</a></td><tr>
<td><a href="https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-started/">Data Engineer's comprehensive list of resources to get started.</a></td>
</tr>
</table>