# Data Pipeline & Data Warehousing (DWH)
    Compiled by: Alem Fitwi 
    Binghamton, New York
    September 2020

# 1. What is Data Pipeline (DPL)?
- A mechanism/Scheme to transfer data from source (point of creation) to the the point of transformation, and then consumption/sink.
- It is just like water fetching in the old days or like water pipes in the modern days that bring waters to our homes for consumption.

- DPL is a mechanism to transfer data from point A to point Z

   $$A\ \rightarrow\ B\ \rightarrow\ C\ \rightarrow\ D\ ...\ \rightarrow\ Z$$
   
           A  --> Data Producer
           B, C, .. --> Data Pipeline
           Z --> Data COnsumer/Sink
- These are the pints where the following transformations take place
    - Data Cleansing
    - Data Governance
    - Data Enrichment
    - Data Processing
<img src = './figs/dpl1.png'>
             
                         Fig 1.1 Water fetching/Water Pipelines
                         
- Analogy:
    - Data Source --> Water Source
    - Data Pipeline --> Water-pipes
    - Sinks --> Consumers

# 2. Types of Data Pipeline
1. **Batch**: Batch processing refers to processing of high volume of data in batch within a specific time span. 
    - Batch Data Pipeline
2. **Streaming**: Stream processing refers to processing of continuous stream of data immediately as it is produced.
   - Real Time Data Pipeline
    - Uses Real tiem message breaking services
3. **Lambda Architecture**: Real Time + Batch
    - The Best of both worlds
    
                            --------------->  Batch Layer----> Service Layer    <------>
                           |                  Master Data      Batch View 1, 2, 3       |
                           |                                                            |
                        Data Source                                                   Query
                           |                                                             |
                           |                 Speed Layer                                 |
                           |                 Real-time View                              |
                            ---------------> From T0 to T1 ... <------------------------>


                            
 <img src = './figs/dpl2.png'>
 
                                 Fig 1.2 Common Data Pipeline Look


- **ETL - Extract, Transform, & Load**: Trandition Data Processing Model
- **Apache KAFKA**: Streaming Messages Service
    - An open-source stream-processing platform for handling real-time data feeds.
    
- **Master Database Management (MDM)**, helps create one single master reference source for all critical business data, leading to fewer errors and less redundancy in business processes. 
- **Data Lake**: Data lake is ideal for the users who indulge in deep analysis. Such users include data scientists who need advanced analytical tools with capabilities such as predictive modeling and statistical analysis. 
    - A Data Lake is a storage repository that can store a large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers a large amount of data quantity for increased analytical performance and native integration.
    - Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake, you have multiple tributaries coming in; similarly, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
    - A Data Lake is a large size storage repository that holds a large amount of raw data in its original format until the time it is needed. Every data element in a Data lake is given a unique identifier and tagged with a set of extended metadata tags. It offers wide varieties of analytic capabilities.
- **Data Warehouse**: The data warehouse is ideal for operational users because of being well structured, easy to use and understand.
    - Data Warehouse stores data in files or folders which helps to organize and use the data to take strategic decisions. This storage system also gives a multi-dimensional view of atomic and summary data. The important functions which are needed to perform are:
        - Data Extraction
        - Data Cleaning
        - Data Transformation
        - Data Loading and Refreshing
- KEY DIFFERENCE
    - Data Lake stores all data irrespective of the source and its structure whereas Data Warehouse stores data in quantitative metrics with their attributes.
    - Data Lake is a storage repository that stores huge structured, semi-structured and unstructured data while Data Warehouse is blending of technologies and component which allows the strategic use of data.
    - Data Lake defines the schema after data is stored whereas Data Warehouse defines the schema before data is stored.
    - Data Lake uses the ELT(Extract Load Transform) process while the Data Warehouse uses ETL(Extract Transform Load) process.
    - Comparing Data lake vs Warehouse, Data Lake is ideal for those who want in-depth analysis whereas Data Warehouse is ideal for operational users.

### Here are key differences between data lakes vs data warehouse:

#### Parameters: Storage
- Data Lake: In the data lake, all data is kept irrespective of the source and its structure. Data is kept in its raw form. It is only transformed when it is ready to be used.
- Data Warehouse: A data warehouse will consist of data that is extracted from transactional systems or data which consists of quantitative metrics with their attributes. The data is cleaned and transformed

#### Parameters:  History	
- Big data technologies used in data lakes is relatively new.	
- Data warehouse concept, unlike big data, had been used for decades.

#### Data Capturing	
- Captures all kinds of data and structures, semi-structured and unstructured in their original form from source systems.	
- Captures structured information and organizes them in schemas as defined for data warehouse purposes

#### Data Timeline	
- Data lakes can retain all data. This includes not only the data that is in use but also data that it might use in the future. Also, data is kept for all time, to go back in time and do an analysis.	
- In the data warehouse development process, significant time is spent on analyzing various data sources.

#### Users	
- Data lake is ideal for the users who indulge in deep analysis. Such users include data scientists who need advanced analytical tools with capabilities such as predictive modeling and statistical analysis.	
- The data warehouse is ideal for operational users because of being well structured, easy to use and understand.

#### Storage Costs	
- Data storing in big data technologies are relatively inexpensive then storing data in a data warehouse.	
- Storing data in Data warehouse is costlier and time-consuming.

#### Task	
- Data lakes can contain all data and data types; it empowers users to access data prior the process of transformed, cleansed and structured.	
- Data warehouses can provide insights into pre-defined questions for pre-defined data types.

#### Processing time	
- Data lakes empower users to access data before it has been transformed, cleansed and structured. Thus, it allows users to get to their result more quickly compares to the traditional data warehouse.	
- Data warehouses offer insights into pre-defined questions for pre-defined data types. So, any changes to the data warehouse needed more time.

#### Position of Schema	
- Typically, the schema is defined after data is stored. This offers high agility and ease of data capture but requires work at the end of the process	
- Typically schema is defined before data is stored. Requires work at the start of the process, but offers performance, security, and integration.

#### Data processing	
- Data Lakes use of the ELT (Extract Load Transform) process.	
- Data warehouse uses a traditional ETL (Extract Transform Load) process.

#### Complain	
- Data is kept in its raw form. It is only transformed when it is ready to be used.	
- The chief complaint against data warehouses is the inability, or the problem faced when trying to make change in in them.

#### Key Benefits	
- They integrate different types of data to come up with entirely new questions as these users not likely to use data warehouses because they may need to go beyond its capabilities
- Most users in an organization are operational. These type of users only care about reports and key performance metrics.

# 3. Data Pipeline Designing
### Generic:
<img src = './figs/dpl3.png'>

### Specific Example:

<img src = './figs/dpl4.png'>

- **Google Cloud Data Fusion**:  , No need of programming (High-level)
    - Fully managed, cloud-native data integration at any scale.
    - New customers get $300 in free credits to spend on Google Cloud during the first 90 days. All customers get the first 120 hours of pipeline development per month, per account, free of charge.
        - Visual point-and-click interface enabling code-free deployment of ETL/ELT data pipelines

        - Broad library of 150+ pre-configured connectors and transformations, at no additional cost
        - Natively integrated best-in-class Google Cloud services
         - End-to-end data lineage for root cause and impact analysis
         - Built with an open source core (CDAP) for pipeline portability


- **Google Cloud Data Flow**: Advanced programming skills
    - Unified stream and batch data processing that's serverless, fast, and cost-effective.
    - New customers get $300 in free credits to spend on Dataflow or other Google Cloud products during the first 90 days. 


- **Google Cloud Looker**: 
     - Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.
     
- **Google Cloud Data Studio**: Google Cloud: lets us eaily gather and use all your insights -- from CSVs, Analytics, Google Ads, Google heets, BigQuery, and other sources.

- **Google Cloud AutoML (Google)**: Trains high-quality custom machine learning models with minimal effort and machine learning expertise.

- **Google Cloud Data Catalog**: a fully managed and highlyscalable data discovery and metadata management service

# 4. Common Data Pipeline Design Patterns

### Data Pipeline Components
- Data Pipeline
    - Batch
    - Streaming
    - Lambda Architecture
- DAG: Directed Acyclic Graph, upstreamt to downstream
- Source: point of data consumption
- Sink: point of data consumption/usage

### ETL Design Pattern
- A very traditional Pattern
<img src = './figs/dpl5.png'>

- Traditional DWH Design
- DAG can be a combination of sub-DAG
- DAGs are orchestrated (Scheduled to do extraction, apply some transformation, and loading)
    - Apache Airflow: For scheduling events/orchestrating
- Source --> source: we cannot impact the operational system while fetching the data

### ELT Design Pattern
<img src = './figs/dpl6.png'>

- Good Design to store data in Raw state (Data Lakes)
- Helps in CDC Design
- Transformation via SQL Queries (Business Logic)
- Needs More Computational Power <-- downside
- Suited For Cloud --> ELT is mostly used in cloud solution

        Cloud SQL --> Cloud Storage ---> BigQuery (has massive processing resources)
        
- CDC is built on top of ELT Design pattern

### CDC (Change Data Capture) 
<img src = './figs/dpl7.png'>

- Track changes @ Sources
- Get the latest record
- Based on ELT Pattern
- CDC Tools - Oracle CDC, attunity cdc
- Read from DB Logs
- Take the record with the maximum version

### EtLT Design Pattern
- t is mini-transform
- Diconnected from business logic
- Does mini 't' at early stages of pipeline.
- could help in Latency

<img src = './figs/dpl8.png'>
- Mini-transformation prevents from allowing garbage to go ahead:
    - Data Duplication
    - Parse URL parameters
    - Mask or Hide Sensitive Data (GDPR Compliance)
    - No Business Logic isperformed here

# 5. Client-Server Architecture For Python Objects Transfer
- Serialize python objects (list, dictionaty, .mat, dataframe ...) @Tx
- Deserialize python objects (list, dictionaty, .mat, dataframe ...) @Rx

### Server

In [1]:
import socket
import pickle

# Create Server Socket Object
serv_socket =  socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Get IP and Port Tuple
host = socket.gethostname()
port =  55555

# Bind the port and IP
serv_socket.bind((host, port))

# Listen for upcoming requests
serv_socket .listen(5)

# Set Server in infinite loop
while True:
    conn, addr = serv_socket.accept()
    
    msg = 'Hello'
    conn.send(msg.econde('utf-8')) # 'ascii'
    
    conn.recv(RECV_BYTES)
    
serv_socket.close()

### Client

In [None]:
import socket
import pickle

# Create Client Socket Obect
client_socket =  socket.socket()

# Define The port 
port =  55555

# Connect tot he server on local computer
client_socket.connect(('127.0.0.1', port))

# Listen for upcoming requests
serv_socket .listen(5)

# Send Data to Server
msg = 'Hi, from client'
client_socket.send(msg.econde('utf-8')) # 'ascii'

# Receive Data From Server
RECV_BYTES = 1024**2
client_socket.recv(RECV_BYTES)

# Close the connection
client_socket.close()

                                       ~END~