# Importing Libraries


In [20]:
from delta import *
from pyspark.sql import SparkSession, Row


from pyspark.sql.types import (
    IntegerType,
    LongType,
    StructField,
    StructType,
    DateType,
    DoubleType,
)

# Creating a Spark Session


In [21]:
spark = SparkSession.builder.appName("learn_spark").getOrCreate()

# Apache Spark Architechture

Apache Spark is a distributed computing framework designed for processing large-scale data across a cluster of machines. It provides high-level APIs in Java, Scala, Python, and R, making it accessible to a wide range of users. Spark is known for its in-memory processing capabilities, fault tolerance, and ease of use.

<img src="https://spark.apache.org/docs/latest/img/cluster-overview.png">


## Key Points

- **Driver Program**: The application starts with the driver program, which represents the user's Spark application. The driver program is responsible for creating a SparkContext that coordinates the execution of tasks.
- **SparkContext**: The SparkContext is the entry point to the Spark cluster. It communicates with the cluster manager to acquire resources (executors) and to schedule tasks.
- **Cluster Manager**: The cluster manager is responsible for managing resources across the cluster. It allocates resources to the Spark application and monitors their usage. Examples of cluster managers include `Apache Mesos, Hadoop YARN, and Spark's standalone cluster manager`.
- **Executors**: `Executors are worker nodes` in the Spark cluster responsible for executing tasks. Each executor runs in its own Java Virtual Machine (JVM) and communicates with the driver program and the cluster manager. An executor is a Spark process responsible for executing tasks on a specific node in the cluster. Each executor is assigned a fixed number of cores and a certain amount of memory. The number of executors determines the level of parallelism at which Spark can process data.
- **RDD (Resilient Distributed Dataset)**: RDD is the fundamental data structure in Spark. It represents a distributed collection of objects that can be processed in parallel. `RDDs are immutable and fault-tolerant`.
- **Directed Acyclic Graph (DAG)**: The execution plan of a Spark application is represented as a DAG. The DAG is a sequence of transformations and actions that Spark performs on the data. `Each stage in the DAG consists of transformations and is executed as a set of tasks`.
- **Task Scheduler**: The task scheduler, part of the SparkContext, is responsible for scheduling tasks on the executors. It takes into account the dependencies between tasks and aims to maximize parallelism.
- **Shuffle**: `A shuffle is an expensive operation where data is redistributed across the cluster`. It occurs when there is a need to exchange data between different partitions of RDDs.
- **Broadcast Variables**: Broadcast variables allow efficient distribution of read-only variables to all tasks in a Spark application. This reduces data transfer costs, especially for large variables like lookup tables.
- **Caching and Persistence**: Spark allows users to persist (cache) intermediate data in memory across multiple stages. This caching improves performance for iterative algorithms or when the same data is used multiple times.


# Spark Executors

An executor is a Spark process responsible for executing tasks on a specific node in the cluster. Each executor is assigned a fixed number of cores and a certain amount of memory. The number of executors determines the level of parallelism at which Spark can process data.

Generally, Having more executors allows for better parallelism and resource utilization.
Each executor can work on a subset of data independently, which can lead to increased processing speed.
However, it’s important to strike a balance between the number of executors and the available cluster resources. If the number of executors is too high, it can lead to excessive memory usage and increased overhead due to task scheduling.

**Advantages**:

1. More executors provide increased parallelism and the ability to process data in parallel.
2. Each executor can work on a subset of data independently, leading to improved processing speed.
3. It allows for better resource utilization by distributing the workload across multiple executor processes.

**Considerations**:

1. Allocating too many executors can lead to excessive memory usage and increased overhead due to task scheduling.
2. Inefficient executor allocation can result in the underutilization of cluster resources.
3. The optimal number of executors depends on factors such as dataset size, computation complexity, and available cluster resources.


In [22]:
# spark.conf.set("spark.executor.instances", 8) # no longer works within spark application

# Spark Cores

The number of cores refers to the total number of processing units available on the machines in your Spark cluster. It represents the parallelism level at which Spark can execute tasks. Each core can handle one concurrent task.

**Increasing the number of cores allows**

1. Spark to execute more tasks simultaneously, which can improve the overall throughput of your application.
2. However, adding too many cores can also `introduce overhead due to task scheduling and inter-node communication`, especially if the cluster resources are limited.
3. The optimal number of cores depends on factors such as the size of your dataset, the complexity of your computations, and the available cluster resources.

**Advantages**

1. Increasing the number of cores allows for higher parallelism and the ability to execute more tasks simultaneously.
2. More cores can lead to improved throughput and faster processing of data.
3. It allows better utilization of available computational resources in the cluster.

**Considerations**

1. Adding too many cores without sufficient resources can lead to resource contention and performance degradation.
2. Excessive parallelism can introduce overhead due to task scheduling and inter-node communication, impacting performance.
3. The optimal number of cores depends on the size of the dataset, the complexity of computations, and available cluster resources.


In [23]:
# spark.conf.set("spark.executor.cores", 4) # no longer works within spark application

# Spark Cluster Hierarchy

<img src ="http://c1.staticflickr.com/9/8854/28256131573_ec8328799b_o.png">


### Node -> Worker Node:

A node in the context of Spark typically refers to a machine in the cluster. A worker node is a node that runs Spark tasks.

### Worker Node -> Executor:

A worker node can host multiple executors, each running in its own JVM. The number of executors on a worker node is influenced by the cluster manager's configuration.

### Executor -> Cores:

An executor can have multiple cores, and each core can run a task concurrently. The number of cores per executor is a configuration parameter.

### Executor -> Task:

Tasks are executed within an executor. An executor runs multiple tasks concurrently, and `each task processes a partition of the data`.

### Parallelism:

Parallelism is determined by the number of tasks that can be executed concurrently. It is influenced by the number of partitions in RDDs, the number of cores per executor, and the number of executors.


# Configuring Spark Number of Executors and its Cores

Configuring the number of cores and executors in Apache Spark depends on several factors, including

- The characteristics of your workload,
- The available cluster resources, and
- Specific requirements of your application.

While there is no one-size-fits-all approach, here are some general guidelines to help you configure these parameters effectively:

- **Number of executors** : The `number of executors should be equal to the number of cores on each node` in the cluster.
  If there are more cores than nodes, then the number of executors should be equal to the number of nodes.
- **Memory per executor**: The amount of `memory allocated to each executor should be based on the size of the data` that will be processed by that executor. It is important to leave some memory available for the operating system and other processes. A good starting point is to allocate 1GB of memory per executor.
- **Number of partitions**: `The number of partitions used for shuffle operations should be equal to the number of executors`.

Let’s try to understand how to decide on the Spark number of executors and cores to be configured in a cluster. For our better understanding Let’s say you have a Spark cluster with 16 nodes, each having 8 cores and 32 GB of memory and your dataset size is relatively large, around 1 TB, and you’re running complex computations on it.

<img src = "https://sparkbyexamples.com/ezoimgfmt/i0.wp.com/sparkbyexamples.com/wp-content/uploads/2023/05/Executor-vs-Cores.jpg?w=263&ssl=1&ezimgfmt=ng:webp/ngcb1">

Note: For the above cluster configuration we have:

**Available Resources**

- Total cores in the cluster = 16 nodes \* 8 cores per node = 128 cores
- Total memory in the cluster = 16 nodes \* 32 GB per node = 512 GB

**Workload Characteristics**
Large dataset size and complex computations suggest that you need a high level of parallelism to efficiently process the data. Let’s assume that you want to allocate 80% of the available resources to Spark.
Now let’s try to analyze the efficient way to decide Spark’s Number of Executors and Cores.

## Tiny Executor Configuration

One way of configuring Spark Executor and its core is setting minimal configuration for the executors and incrementing it based on the application performance.

### Executor Memory and Cores per Executor (Considering having 1 core per executor)

- Number of executors per node=8,
- Executor-memory=32/8=4GB

### Calculating the Number of Executors (To calculate the number of executors, divide the available memory by the executor memory)

- Total memory available for Spark = 80% of 512 GB = 410 GB
- Number of executors = Total memory available for Spark / Executor memory = 410 GB / 4 GB ≈ 102 executors
- Number of executors per node = Total Number of Executors/ Number of Nodes = 102/16 ≈ 6 Executors/Node

So, in this example, you would configure Spark with 102 executors, each executor having 1 core and 4 GB of memory.

<img src = "https://sparkbyexamples.com/ezoimgfmt/i0.wp.com/sparkbyexamples.com/wp-content/uploads/2023/05/Executor-vs-Cores-1.jpg?w=561&ssl=1&ezimgfmt=ng:webp/ngcb1">

### Pros of Spark Tiny Executor Configuration:

- **Resource Efficiency** Tiny executors consume less memory and fewer CPU cores compared to larger configurations.
- **Increased Task Isolation** With tiny executors, each task runs in a more isolated environment. This isolation can prevent interference between tasks, reducing the chances of resource contention and improving the stability of your Spark application.
- **Task Granularity** Tiny executor configurations can be beneficial if your workload consists of a large number of small tasks. With smaller executors, Spark can allocate resources more precisely, ensuring that each task receives sufficient resources without excessive overprovisioning.

### Cons of Spark Tiny Executor Configuration:

- **Increased Overhead** Using tiny executors can introduce higher overhead due to the increased number of executor processes and task scheduling.
- **Limited Parallelism** Tiny executors have fewer cores, limiting the level of parallelism in your Spark application.
- **Potential Bottlenecks** In a tiny executor configuration, if a single task takes longer to execute than others, it can become a bottleneck for the entire application.
- **Memory Overhead** Although tiny executors consume less memory individually, the overhead of multiple executor processes can add up. This can lead to increased memory usage for managing the executor processes, potentially reducing the available memory for actual data processing.

## Fat Executor Configuration

The other way of configuring Spark Executor and its core is setting the maximum utility configuration i.e. having only one Executor per node and optimizing it based on the application performance.

### Executor Memory and Cores per Executor: Considering having 8 cores per executor,

- Number of executors per node= number of cores for a node/ number of cores for an executor = 8/8 = 1,
- Executor-memory=32/1= 32GB

### Calculating the Number of Executors: To calculate the number of executors, divide the available memory by the executor memory:

- Total memory available for Spark = 80% of 512 GB = 410 GB
- Number of executors = Total memory available for Spark / Executor memory = 410 GB / 32 GB ≈ 12 executors
- Number of executors per node = Total Number of Executors/ Number of Nodes = 12/16 ≈ 1 Executors/Node

So, in this example, you would configure Spark with 16 executors, each executor having 8 core and 32 GB of memory.

<img src = "https://sparkbyexamples.com/ezoimgfmt/i0.wp.com/sparkbyexamples.com/wp-content/uploads/2023/05/Executor-vs-Cores-1.jpg?w=561&ssl=1&ezimgfmt=ng:webp/ngcb1">

### Pros of Fat Executor Configuration

- **Increased Parallelism** Fat executor configurations allocate more CPU cores and memory to each executor, resulting in improved processing speed and throughput.
- **Reduced Overhead** With fewer executor processes to manage, a fat executor configuration can reduce the overhead of task scheduling, inter-node communication, and executor coordination. This can lead to improved overall performance and resource utilization.
- **Enhanced Data Locality** `Larger executor memory sizes can accommodate more data partitions in memory`, reducing the need for data shuffling across the cluster.
- **Improved Performance for Complex Tasks** By allocating more resources to each executor, you can efficiently handle complex computations and large-scale data processing.

### Cons of Fat Executor Configuration

- **Resource Overallocation** Using fat executors can result in overallocation of resources, especially if the cluster does not have sufficient memory or CPU cores.
- **Reduced Task Isolation** With larger executor configurations, tasks have fewer executor processes to run on. This can increase the chances of resource contention and interference between tasks, potentially impacting the stability and performance of your Spark application.
- **Longer Startup Times** Fat executor configurations require more resources and may have longer startup times compared to smaller configurations.
- **Difficulty in Resource Sharing** Fat executors may not be efficient when sharing resources with other applications or services running on the same cluster. It can limit the flexibility of resource allocation and hinder the ability to run multiple applications concurrently.

## Balanced Executor Configuration

Spark founder Databricks after several trail and error testing the spark Executor and cores configuration, they recommends to have 2-5 cores per executor as the best initial efficient configuration for running the application smoothly.

### Executor Memory and Cores per Executor: Considering having 3 cores per executor, Leaving 1 core per node for daemon processes

- Number of executors per node= (number of cores for a node – core for daemon process)/ number of cores for an executor = 7/3 ≈ 2,
- Executor-memory=Total memory per node/ number executors per node = 32/2= 16GB

### Calculating the Number of Executors: To calculate the number of executors, divide the available memory by the executor memory:

- Total memory available for Spark = 80% of 512 GB = 410 GB
- Number of executors = Total memory available for Spark / Executor memory = 410 GB / 16 GB ≈ 32 executors
- Number of executors per node = Total Number of Executors/ Number of Nodes = 32/16 = 2 Executors/Node

<img src = "https://sparkbyexamples.com/ezoimgfmt/i0.wp.com/sparkbyexamples.com/wp-content/uploads/2023/05/Executor-vs-Cores-2.jpg?w=561&ssl=1&ezimgfmt=ng:webp/ngcb1">

So, in this example, you would configure Spark with 32 executors, each executor having 3 core and 16 GB of memory.

In practice, one size does not fit all. You need to keep tuning as per cluster configuration. `But in general, the number of executor cores should be 2-5.`

### Pros of Balanced Executor Configuration:

- **Optimal Resource Utilization** A balanced executor configuration aims to evenly distribute resources across the cluster. This allows for efficient utilization of both CPU cores and memory, maximizing the overall performance of your Spark application.
- **Reasonable Parallelism** By allocating a moderate number of cores and memory to each executor, a balanced configuration strikes a balance between parallelism and resource efficiency. It can provide a good compromise between the high parallelism of small executors and the resource consumption of large executors.
- **Flexibility for Multiple Workloads** A balanced configuration allows for accommodating a variety of workloads. It can handle both small and large datasets, as well as diverse computational requirements, making it suitable for environments where multiple applications or different stages of data processing coexist.
- **Reduced Overhead** Compared to larger executor configurations, a balanced configuration typically involves fewer executor processes. This can reduce the overhead of task scheduling, inter-node communication, and executor coordination, leading to improved performance and lower resource consumption.

### Cons of Balanced Executor Configuration:

- **Limited Scaling** A balanced executor configuration may not scale as effectively as configurations with a higher number of cores or executors. In scenarios where the workload or dataset size significantly increases, a balanced configuration may reach its limit, potentially leading to longer processing times or resource contention.
- **Trade-off in Task Isolation** While a balanced configuration can provide a reasonable level of task isolation, it may not offer the same level of isolation as smaller executor configurations. In cases where tasks have distinct resource requirements or strict isolation requirements, a balanced configuration may not be the most suitable choice.
- **Task Granularity** In situations where the workload consists of a large number of small tasks, a balanced executor configuration may not offer the same level of fine-grained task allocation as smaller executor configurations. This can lead to suboptimal resource allocation and potentially impact performance.
- **Complexity in Resource Management** Maintaining a balanced executor configuration across a dynamic cluster can be challenging. As the cluster size and resource availability change, it may require frequent adjustments to ensure the configuration remains balanced, which can add complexity to cluster management.


## Comparing between the three configurations

| Configuration              | Tiny Executor                                               | Fat Executor                                                           | Balanced Executor                           |
| -------------------------- | ----------------------------------------------------------- | ---------------------------------------------------------------------- | ------------------------------------------- |
| **Executor Memory**        | Low memory allocation (e.g., 1-2 GB)                        | High memory allocation (e.g., 16-32 GB)                                | Moderate memory allocation (e.g., 4-8 GB)   |
| **Number of Cores**        | Few cores (e.g., 1-2)                                       | Many cores (e.g., 8-16)                                                | Moderate number of cores (e.g., 4-8)        |
| **Parallelism**            | Limited parallelism due to fewer cores                      | High parallelism due to many cores                                     | Balanced parallelism                        |
| **Task Granularity**       | Smaller tasks due to fewer resources                        | Larger tasks due to more resources                                     | Moderate task granularity                   |
| **Use Case**               | Suitable for lightweight tasks or constrained resources     | Suitable for memory-intensive tasks and large datasets                 | Balanced use for general workloads          |
| **Resource Utilization**   | Efficient utilization of resources, but limited parallelism | Efficient for parallel processing, but may lead to resource contention | Balanced resource utilization               |
| **Memory-Intensive Tasks** | Limited capacity for memory-intensive tasks                 | Well-suited for memory-intensive tasks                                 | Adequate capacity for moderate memory tasks |
| **CPU-Intensive Tasks**    | Suitable for tasks with low CPU requirements                | Suitable for CPU-intensive tasks                                       | Balanced for a mix of CPU and memory tasks  |


# RDD, DataFrame and Dataset

## RDD (Resilient Distributed Dataset):

**Definition:** RDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel. RDDs are fault-tolerant and can recover lost data due to node failures by using lineage information.

**Characteristics:**

- **Immutability:** RDDs are immutable, meaning their content cannot be changed once created. You can transform an RDD into another RDD through transformations, but the original RDD remains unchanged.
- **Partitioning:** RDDs are divided into partitions, each processed on a separate node in the cluster. Partitioning enables parallelism in distributed computing.
- **Resilience:** RDDs are fault-tolerant. If a partition is lost due to node failure, Spark can recompute it using the lineage information.

## DataFrame:

**Definition:** A DataFrame is a higher-level abstraction built on top of RDD. It represents a distributed collection of data organized into named columns. DataFrames provide a more structured and user-friendly API compared to RDDs, and they support various optimizations.

**Characteristics:**

- **Schema:** DataFrames have a predefined schema, meaning the data is organized into columns with specified data types. This allows Spark to perform optimizations during execution.
- **Lazy Evaluation:** Like RDDs, DataFrames use lazy evaluation, meaning transformations are not executed immediately but are scheduled to be executed later when an action is called.
- **Optimization:** DataFrames leverage the Catalyst optimizer and the Tungsten execution engine, leading to performance improvements.

## Dataset:

**Definition:** A Dataset is a distributed collection of data that provides the benefits of both RDDs and DataFrames. Datasets are strongly typed, allowing for type-safety and improved optimizations, while also providing a functional API similar to RDDs.

**Characteristics:**

- **Strong Typing:** Datasets are strongly typed, meaning that the type of data is known at compile-time. This allows for type safety and optimization of queries.
- **Functional API:** Datasets provide a functional programming API, similar to RDDs, allowing users to express complex data manipulations using transformations and actions.
- **Optimization:** Datasets benefit from the same optimizations as DataFrames, including Catalyst and Tungsten, leading to improved performance.

In summary, RDDs are the basic building blocks of Spark, DataFrames provide a higher-level, structured API for data manipulation, and Datasets combine the benefits of both RDDs and DataFrames, offering strong typing and functional programming capabilities. The choice between them depends on the specific requirements and characteristics of the data processing task at hand.

## Tabular Comparision

| Feature                 | RDD                                           | DataFrame                                         | Dataset                                                 |
| ----------------------- | --------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------- |
| **Definition**          | Resilient Distributed Dataset                 | Distributed collection of data with named columns | Typed distributed collection of data with named columns |
| **Type Safety**         | Not type-safe (untyped)                       | Type-safe (typed)                                 | Type-safe (typed)                                       |
| **Performance**         | Lower performance optimizations               | Optimized for performance                         | Optimized for performance                               |
| **Ease of Use**         | Lower-level API, more manual control          | Higher-level API, easier to use                   | Higher-level API, easier to use                         |
| **Interoperability**    | Supports Java, Scala, Python                  | Supports Java, Scala, Python                      | Supports Java, Scala                                    |
| **Compile-Time Safety** | No compile-time type checking                 | Compile-time type checking (statically typed)     | Compile-time type checking (statically typed)           |
| **Optimization**        | Limited optimizations                         | Optimized Catalyst and Tungsten engine            | Optimized Catalyst and Tungsten engine                  |
| **Schema**              | No predefined schema                          | Predefined schema                                 | Predefined schema                                       |
| **Serialization**       | Java Serialization                            | Tungsten binary format                            | Tungsten binary format                                  |
| **API**                 | Functional transformations                    | Declarative SQL-like operations                   | Functional transformations                              |
| **Immutable**           | Immutable                                     | Immutable                                         | Immutable                                               |
| **Lazy Evaluation**     | Supports lazy evaluation                      | Supports lazy evaluation                          | Supports lazy evaluation                                |
| **Use Cases**           | Low-level data processing, complex algorithms | Data manipulation, exploratory data analysis      | Type-safe, high-level data processing                   |


Type safety, also known as type checking, is a programming concept that ensures that operations performed on variables or data structures are compatible with their declared types. In the context of Apache Spark and the differentiation between RDD, DataFrame, and Dataset, type safety refers to the level of assurance the system provides regarding the correctness of data types during compile-time.

**RDD (Resilient Distributed Dataset):**

RDDs are untyped or loosely typed, meaning that Spark does not enforce strict type checking during compile-time. `You can perform operations on RDDs without being explicitly aware of the data types.` This lack of type safety can lead to runtime errors if the data types are not handled correctly.

**DataFrame:**

DataFrames are designed with a high level of type safety. They have a predefined schema, and operations on DataFrames are checked at compile-time to ensure that they are compatible with the specified schema. This reduces the chances of runtime errors related to type mismatches.

**Dataset:**

Datasets, like DataFrames, provide a high level of type safety. Datasets are strongly typed, meaning that operations on Datasets are checked at compile-time to ensure type compatibility. They offer the benefits of both RDDs and DataFrames, allowing for strong typing and optimization.


> DataFrames in Apache Spark are not strongly typed in the same sense as Datasets. While DataFrames have a predefined schema that enforces data types for each column, the enforcement is typically done at runtime rather than compile-time. This means that type checking occurs when actions are executed rather than during the compilation phase.


# Examples


## RDD's


In [24]:
# data = [1, 2, 3, 4, 5]

# rdd = spark.sparkContext.parallelize(data)

# # Perform transformations and actions on the RDD
# squared_rdd = rdd.map(lambda x: x ** 2)
# result = squared_rdd.reduce(lambda x, y: x + y)

# # Show the result
# print("RDD Result:", result)

# DataFrame


In [None]:
# # Create a DataFrame from a list of Row objects
# data = [Row(id=1, value=10), Row(id=2, value=20), Row(id=3, value=30)]
# df = spark.createDataFrame(data)

# # Perform operations on the DataFrame
# df_result = df.select("id", "value").filter(df["value"] > 15).groupBy("id").sum()
# df_result.show()

# Dataset


As Dataset has `compile-time safety`, it is only supported in a compiled language( Java & Scala ) but not in an interpreted language(R & Python). But Spark Dataframe API is available in all four languages( Java, Scala, Python & R ) supported by Spark.
