# Data Engineering for Everyone

This notebook contains a conceptual course where no coding is involved.

**Objectives:**
- Being able to exchange with data engineers
- Provide a solid foundation to learn more

## 1. What is data engineering?

In this section, we'll learn what data engineering is and why demand for them is increasing. We'll then discover where data engineering sits in relation to the data science lifecycle, how data engineers differ from data scientists, and have an introduction to your first complete data pipeline.

## Data workflow

The diagram below depicts the data workflow in an organization.


<img style="float: left; " src="images/workflow.png" alt="workflow" width="800"/>
<!-- ![workflow](images/workflow.png) -->

## Data Egineers

Data engineers deliver:

- the correct data
- in the right form
- to the right people
- as efficiently as possible

## A data engineer's responsibilities

- Ingest data from different sources
- Optimize databases for analysis
- Remove corrupted data
- Develop, construct, test and maintain data architectures

## Data engineers and big data

- Big data becomes the norm => data engineers are more and more needed
- Big data:
    - Have to think about how to deal with its size
    - So large traditional methods don't work anymore

## Big data growth

- Sensors and devices
- Social media
- Enterprise data
- VoIP (voice communication, multimedia sessions)

<img style="float: left; " src="images/big-data-growth.png" alt="big-data-growth" width="600"/>


## The five Vs

- Volume (how much?)
- Variety (what kind?)
- Velocity (how frequent?)
- Veracity (how accurate?)
- Value (how useful?)

## Data engineers vs. data scientists

Data engineers enable data scientists


<!-- <div style="float:left"> -->
<style width: 70%>
    
| Data engineer           | Data scientist            |
|-------------------------|---------------------------|
| Ingest and store data   |  Exploit data             |
| Set up databases        |  Access databases         |
| Build data pipelines    |  Use pipeline outputs     |
| Strong software skills  |  Strong analytical skills |
</style>
<!-- </div> -->

## Data pipelines ensure an efficient flow of the data


    
| Automate                               | Reduce                              |
|----------------------------------------|-------------------------------------|
| Extracting                             | Human intervention                  |
| Transforming                           | Errors                              |       
| Combining                              | Time it takes data to flow          |    
| Validating                             |                                     |
| Loading                                |                                     |


## ETL and data pipelines

    
| ETL                                             | Data pipelines                               |
|-------------------------------------------------|----------------------------------------------|
| Popular framework for designing data pipelines  | Move data from one system to another         |
| 1) Extract data                                 | May follow ETL                               |       
| 2) Transform extracted data                     | Data may not be transformed                  |    
| 3) Load transformed data to another database    | Data may be directly loaded in applications  |  



## 2. Storing data

It's time to talk about data storage - one of the main responsibilities for a data engineer. In this section, we'll learn how data engineers manage different data structures, work in SQL - the programming language of choice for querying and storing data, and implement appropriate data storage solutions with data lakes and data warehouses.

## Structured data

- Easy to search and organize
- Consistent model, rows and columns
- Defined types
- Can be grouped to form relations
- Stored in relational databases
- About 20% of the data is structured
- Created and queried using using SQL

## Semi-structured data

- Relatively easy to search and organize
- Consistent model, less-rigid implementation: different observations have different sizes
- Different types
- Can be grouped, but needs more work
- NoSQL databases: JSON, XML, YAML

## Unstructured data

- Does not follow a model, can't be contained in rows and columns
- Difficult to search and organize
- Usually text, sound, pictures or videos
- Usually stored in data lakes, can appear in data warehouses or databases
- Most of the data is unstructured
- Can be extremely valuable

## Adding some structure

- Use AI to search and organize unstructured data
- Add information to make it semi-structured

## SQL databases

- Structured Query Language
- Industry standard for Relational Database Management System (RDBMS)
- Allows you to access many records at once, and group, filter or aggregate them
- Close to written English, easy to write and understand
- Data engineers use SQL to create and maintain databases
- Data scientists use SQL to query (request information from) databases

## Database schema

- Databases are made of tables
- The database schema governs how tables are related

## Several implementations

- SQLite
- MySQL
- PostgreSQL
- Oracle SQL
- SQL Server

## Data warehouses and data lakes


|        Data lake                             |         Data warehouse                           |
|:----------------------------------------------|--------------------------------------------------:|
| Stores all the raw data                      | Specific data for specific use                   |
| Can be petabytes (1 million GBs)             | Relatively small                                 |
| Stores all data structures                   | Stores mainly structured data                    |
| Cost-effective                               | More costly to update                            |
| Difficult to analyze                         | Optimized for data analysis                      |
| Requires an up-to-date data catalog          | Also used by data analysts and business analysts |
| Used by data scientists                      | Ad-hoc, read-only queries                        |
| Big data, real-time analytics                |                                                  |


## Data catalog for data lakes

- What is the source of this data?
- Where is this data used?
- Who is the owner of the data?
- How often is this data updated?
- Good practice in terms of data governance
- Ensures reproducibility
- No catalog --> data swamp

**Good practice for any data storage solution**
- Reliability
- Autonomy
- Scalability
- Speed

## Database vs. data warehouse

- Database:
    - General term
    - Loosely defined as organized data stored and accessed on a computer
- Datawarehouse is a type of database

## 3. Moving and processing data

Data engineers make life easy for data scienctists by preparing raw data for analysis using different processing techniques at different steps. These steps need to be combined to create pipelines, which is when automation comes into play. Finally, data engineers use parallel and cloud computing to keep pipelines flowing smoothly.

## A general definition

- Data processing: converting raw data into meaningful information

## Data processing value

### Conceptually
- Remove unwanted data
- Optimize memory, process and network costs
- Convert data from one type to another
- Organize data
- To fit into a schema/structure
- Increase productivity

### At Spotflix
- No long term need for testing feature data
- Can't afford to store and stream files this big
- Convert songs from `.flac` to `.ogg`
- Reorganize data from the data lake to data warehouses
- Employee table example
- Enable data scientists

## How data engineers process data

- Data manipulation, cleaning, and tidying tasks
    - that can be automated
    - that will always need to be done
- Store data in a sanely structured database
- Create views on top of the database tables
- Optimizing the performance of the database
- Rejecting corrupt song files
- Deciding what happens with missing metadata
- Seperate artists and albums tables...
- ...but provide view combining them
- indexing

## Scheduling

- Can apply to any task listed in data processing
- Scheduling is the glue of your system
- Holds each piece and organize how they work together
- Runs tasks in a specific order and resolve all dependencies

- Manually
- Time
- Sensor scheduling

## Batches and streams

- Batches
    - Group records at intervals
    - Often cheaper
- Streams
    - Send individual records right away
    
- Songs uploaded by artists
- Employee table
- Revenue table
- New users signing in
- Another example: online vs. offline listening

## Scheduling tools

- Apache airflow
- Luigi


## Parallel computing

- Basis of modern data processing tools
- Necessary:
    - Mainly because of memory
    - Also for processing power
- How it works:
    - Split tasks into several smaller subtasks
    - Distribute these subtasks over several computers

## Benefits and risks of parallel computing

- Employees = processing units
- Advantages
    - Extra processing power
    - Reduced memory footprint
- Disadvantages
    - Moving data incurs a cost
    - Communication time

## Cloud computing for data processing


|           Servers                            |            Servers on the cloud    |
|:--------------------------------------------:|:----------------------------------:|
| Bought                                       | Rented                             |
| Need space                                   | Don't need space                   |
| Electrical and maintenance cost              | Use just the resources we need     |
| Enough power for peak moments                | When we need them                  |
| Processing power unused at quieter times     | The closer to the user the better  |
    

## Cloud computing for data storage

- Database reliability: data replication
- Risk with sensitive data

## Major cloud vendors

<img style="float: left; " src="images/cloud-computing.png" alt="cloud-computing" width="700"/>

## Multicloud


|       Pros                             |         Cons                             |
|:---------------------------------------|-----------------------------------------:|
| Reducing reliance on a single vendor   | Cloud providers try to lock in consumers |
| Cost-efficiencies                      | Incompatibility                          |
| Local laws requiring certain data to physically present within the country    | Security and governance   |     
| Mitigating against disasters           |                                           |
