Linkedin: apache-spark-essential-training-big-data-engineering

**Data Engineering**  
- focus on data
- Capture, movement, storage , security and processing
- convert Raw data into knowledge data 


**Data Engineers**
- Build data pipelines, applications and APIs


**Data Engineering Pipeline**
- Stages
1. Acquisition
2. Transport
3. Storage
4. Processing
5. Servicing


#### 1. Aquisition
- Format of data
- Interfaces available to get access to data
- Security (authorisation, authentication, encryption)
- Reliability
- Latency


#### 2. Transport
- Reliability and Integrity
- Security
- Latency
- Cost

#### 3. Storage
- Flexibility and Ease of processing (keep data in native form or need to summarize it)
- Schema or Schema Less design
- High Availability/ Redundancy
- Cost

#### 4. Processing
- Cleaning to remove inconsistency and bad data
- Filtering - choosing relevant data
- Enriching (data joints and denormalization)
- Aggregating
- Machine Learning

#### 5. Servicing
- Latency
- Redundancy and High Availability
- Skill Levels of consumers
- Flexibility of schema
- APIs

#### BigData Classified on below attributes
- Volume (Resources req, scalable, maintaining latency)
- Velocity (real-time event data handling, need for speed, handling lags)
- Variety (Text, audio, video and images, more resources needed, Serving at low latency)
- Variability (spikes in load, decoupling needed with buffering zones, maintaining latency)

#### Pipeline
- Functionality
- Speed
- Reliability
- Security
- Availability


Apache Spark - Data processing engine  
Apache Kafka - Data Aquisition and transport Layer  
HDFS/MySQL - Storage  

**Apache Kafka Connect**
- scalable distributed pipeline for moving data
- Data Source -> Kafka -> Data Sink

# Apache Spark
- best tool for data engineering  

Advantages:  
1. Built as a compute engine
2. Faster Data processing
3. Massive Horizontal scalability
4. Streaming Support
5. Machine Learning Libraries (pyspark)
6. Third Party integrations 

Features:  
1. Spark Transformations - record level processing in distributed fashion, used in (1. Data Cleansing, 2. Data Validation, 3. Filtering, 4. Joining and Enriching Data, 5. Aggregation)
2. Spark Actions - extract meaningful information from massive dataset (1. Metrics, 2. Aggregate data to provide summaries, 3. Moving data to external systems like File system and databases)
3. Spark Broadcast variables and accumulators - minimize data across network and generalize system wide metrics(1. Lookup tables, 2. Shared variables, 3. Summary Metrics, 4. Data Consolidation)
 


How Spark Works? - Stages
1. Acquire Data
2. Create RDD
3. Transformation
4. Shuffle
5. Action to store

<img src="Image/spark-transformation.JPG" width="600" />

- No Shuffle - functions those run on individual records in the dataset independently do not create shuffle (e.g: Map, Filter, flatMap, mapPartitions)
- Shuffle Transformations - functions which require data to be consolidated in some fashion and cross referenced between RDDs for this consolidations (e.g: Distinct, groupByKey, reduceByKey, Join)
- Actions to move data from cluster back to single driver node - need to use minimal data to avoid huge network traffic (e.g: Reduce, Collect, Count, saveAsTextFile)

Lazy Evaluation  
- Spark executes transformations only when an action is executed on the resulting RDDs
- Spark optimizes execution of all statements in the batch when it executes the action.
- The more statements to execute, the better the chance to optimize  


To take adv of Lazy Evaluation  
- execute code with action
- do as many transformations as possible before hitting an action
- avoid debugging statements like "print count"

What is Spark Dependencies?  
- When transformation is executed and a RDD is created from another RDD
- Does the transformation result in shuffle
    - Wide Dependency - Yes Shuffle (E.g: RDD13 need data from both worker node)
    - Narrow Dependency - No Shuffle (E.g: RDD11 -> RDD12)
    - Wide Dependency cause data to flow between worker node, which is expensive and time consuming
    

Optimize for Dependencies?
- do as many Narrow Dependencies as possible before hitting a Wide Dependency
- try to group Wide Dep. together, possibly in a single function and do it once

Accumulators
- standard accumulator allow only single value like integer, string to pass around between driver programs and the nodes
- to pass multiple values like list, write your own accumulator implementation

**Questions**
1. What is driver and executor in Spark?