# Apache Spark UI Tutorial
## Introduction

This tutorial will help you understand Spark's Web UI - your dashboard for monitoring and debugging Spark applications. The Spark UI provides valuable insights into your application's performance, resource usage, and execution flow.

All information on this notebook is based on Apache Spark 3.5 UI Guide, available at:

https://spark.apache.org/docs/latest/web-ui.html

## Getting Started with Spark UI

The Spark UI automatically launches when you start a Spark application and is typically available at http://localhost:4040.

In [2]:
from pyspark.sql import SparkSession
import random

# Initialize a Spark session
spark = SparkSession.builder \
    .appName("SparkUIJobsTab") \
    .master("local[*]") \
    .getOrCreate()

print("Spark UI is available at: http://localhost:4040")


25/05/13 23:15:24 WARN Utils: Your hostname, ashrafk resolves to a loopback address: 127.0.1.1; using 192.168.1.205 instead (on interface wlo1)
25/05/13 23:15:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/13 23:15:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark UI is available at: http://localhost:4040


## 1. The Jobs Tab: Understanding Your Spark Workload

The Jobs tab is your primary view into Spark's execution. A "job" is created whenever you execute an action on your data (like `collect()`, `count()`, or `save()`).

When you visit the Jobs tab, you'll see:

- A list of all jobs with their status (running, completed, or failed)
- Execution times showing how long each job took
- Breakdown of stages within each job
- Links to more detailed views


![Jobs Tab Description](res/jobs_tab.png)

## Key Components of the Jobs Tab:

- **Job ID**: Each job gets a unique numerical identifier, starting at 0 for the first job in your application. In our example, we see jobs with IDs 4, 5, 6, and 7.
- **Description**: This column shows which operation triggered the job. Notice how operations like count() and show() appear here with line numbers from your code. This helps you connect specific actions in your code to the jobs they create.
- **Duration**: How long each job took to execute, measured in seconds. This is invaluable for identifying performance bottlenecks. In our example, durations range from 0.4 seconds to 17 seconds.
- **Stages**: Spark breaks each job into "stages" that can be executed independently. The "Succeeded/Total" format shows how many stages have completed out of the total. Notice how some jobs have just 1 stage while others have 3 stages - more complex operations typically require more stages.
- **Tasks**: These are the individual units of work distributed across your cluster. Simple jobs might have few tasks, while complex operations on large datasets can have thousands. The "Succeeded/Total" format shows completion progress.

### Active vs. Completed Jobs:
The Jobs tab separates currently running jobs from completed ones:

- **Active Jobs**: Job #7 is still running, as indicated by the "running" status in the Tasks column and incomplete stages (0/2).
- **Completed Jobs**: Jobs #4, 5, and 6 have finished execution. You can see all their tasks have succeeded (e.g., 9/9 and 1/1) and their stages are complete.

### What This Tells Us About Spark Execution:

Simple actions like show with small results (Job #6) are quick and require minimal processing (just 0.4s and one stage).
More complex operations (Jobs #4 and #5) require multiple stages and more tasks, which often means data shuffling between stages.
In-progress jobs can be monitored in real-time to track their execution progress.

When developing Spark applications, pay close attention to jobs with long durations or many stages, as these are prime candidates for optimization.
In the next sections, we'll look at what happens inside these jobs by examining the Stages tab and visualizations of execution.

### Let's create a code example to demonstrates :
run this code:

In [None]:
from pyspark.sql import SparkSession
import time

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Jobs Tab Example") \
    .master("local[*]") \
    .getOrCreate()

# Create a sample DataFrame
data = [(i, f"Product_{i % 5}", i * 10) for i in range(10000)]
df = spark.createDataFrame(data, ["id", "product", "value"])

# Register as a temporary view
df.createOrReplaceTempView("products")

# Job 1: Count the total number of records
print("Job 1: Counting records")
record_count = df.count()
print(f"Total records: {record_count}")
time.sleep(1)  # Pause to separate jobs in the UI

# Job 2: Filter and aggregate
print("Job 2: Filtering and aggregating")
filtered = df.filter(df.value > 50)
filtered_count = filtered.count()
print(f"Records with value > 50: {filtered_count}")
time.sleep(1)  # Pause to separate jobs in the UI

# Job 3: Group by operation (creates a shuffle)
print("Job 3: Performing groupBy operation")
summary = df.groupBy("product").count()
summary.show()
time.sleep(1)  # Pause to separate jobs in the UI

# Job 4: SQL query execution
print("Job 4: Executing SQL query")
sql_result = spark.sql("""
    SELECT
        product,
        AVG(value) as avg_value,
        MAX(value) as max_value
    FROM products
    GROUP BY product
    ORDER BY avg_value DESC
""")
sql_result.show()

print("\nCheck the Jobs tab in the Spark UI at http://localhost:4040")

What to Look for in the Jobs Tab:
After running this code, go to the Jobs tab in the Spark UI where you'll see:
[SCREENSHOT 1: Jobs Overview]
Take a screenshot of the main Jobs tab showing all 4 jobs with their descriptions, durations, and stages. This clearly shows how each action in your code creates a separate job in Spark.

Multiple Jobs Listed: You'll see 4 distinct jobs created by the actions in your code
Job Descriptions: Each job will have a description that maps directly to your code:

Job for df.count() (Job 1)
Job for filtered.count() (Job 2)
Job for summary.show() (Job 3)
Job for SQL query execution through sql_result.show() (Job 4)


Job Details: For each job, you'll see:

Duration: How long each job took to execute
Stages: Number of stages each job was broken into
Tasks: Number of parallel tasks executed



Direct Connection to Your Code:
Each job in the Jobs tab directly corresponds to an action in your code:

Job 1 comes from the df.count() action - a simple action that typically results in a single stage
Job 2 comes from the filtered.count() action - another simple job, but preceded by a filter operation

[SCREENSHOT 2: Job Details Page]
Take a screenshot of the detailed view of Job 3 (the groupBy operation) after clicking on it. This shows how a single line of code df.groupBy("product").count() translates into multiple stages with different operations.

Job 3 comes from the summary.show() action - this job involves a groupBy operation which creates a shuffle and typically results in multiple stages
Job 4 comes from the SQL query execution - the most complex job with grouping, aggregation, and sorting that creates multiple stages

[SCREENSHOT 3: DAG Visualization]
Take a screenshot of the DAG visualization for the SQL query job (Job 4). This shows the directed acyclic graph of operations that Spark creates from your SQL query, with nodes representing operations and edges showing data flow.
The key insight is that transformations in your code (like filter() and groupBy()) don't create jobs on their own - they are only executed when an action (like count() or show()) is called.
[SCREENSHOT 4: Event Timeline]
Take a screenshot of the Event Timeline section showing the chronological execution of all 4 jobs. This visualizes how jobs execute in sequence but may have internal parallelism.
By examining the Jobs tab, you can see exactly:

Which parts of your code trigger execution
How complex operations are broken into stages
How long each operation takes
The logical progression of your Spark application

[SCREENSHOT 5: Task Distribution]
Take a screenshot of the task details for a shuffle stage in Job 4, showing the distribution of work across executors. This helps identify if certain tasks are taking much longer than others (indicating data skew).
Understanding this relationship between your code and the Jobs tab helps you identify performance bottlenecks and optimize your Spark applications by focusing on the most expensive operations. The Jobs tab serves as your main window into Spark's execution model, showing the direct translation from your high-level code to Spark's distributed execution plan.

## 2. Understanding Execution Through Visualizations

### Event Timeline

The Event Timeline provides a chronological view of your Spark application's execution, showing:

- When each job, stage, and task ran
- How long each component took to execute
- Parallel execution across your cluster
- Wait times and potential bottlenecks

Pay attention to "gaps" in the timeline, which may indicate scheduling delays or resource contention. If you see tasks executing serially rather than in parallel, you might need to **adjust your partitioning**. Long-running tasks that delay an entire stage completion are often signs of data skew.


![Jobs Tab Description](res/Event_timeLine.png)

1. **Time Progression**: The horizontal axis shows time, with timestamps marking specific points (17:42 through 17:50 in this example).

2. **Component Layers**: The timeline is divided into horizontal sections:
   - **Executors**: Shows when executors are added or removed from your application
   - **Jobs**: Shows when jobs are running, succeeded, or failed

3. **Color Coding**: Different colors represent different states:
   - **Blue** (for Executors): Added executors
   - **Pink/Red**: Removed executors or failed jobs
   - **Blue** (for Jobs): Succeeded jobs
   - **Green**: Currently running jobs

4. **Key Events Visible**:
   - At the beginning, you can see when executors were added to the application
   - Throughout the timeline, small blue bars represent completed jobs
   - On the right side, a green "count" job is currently running

### What to Look For in the Event Timeline:

The Event Timeline helps you identify:

- **Executor Lifecycle**: When executors join or leave your application
- **Job Execution Patterns**: When jobs start and finish
- **Idle Periods**: Gaps in the timeline where no activity occurs
- **Concurrency**: Multiple jobs or tasks running in parallel
- **Long-Running Operations**: Jobs that span significant portions of the timeline

While this example shows a relatively simple execution pattern, in more complex applications you would see many more parallel activities, helping you visualize how effectively your application utilizes cluster resources.

The Event Timeline is particularly valuable when diagnosing performance issues, as it can reveal execution bottlenecks, resource contention, or inefficient scheduling patterns.

You can enable zooming (checkbox at the top) to examine specific time periods in more detail, which is especially useful for complex applications with many overlapping activities.


## DAG Visualization: Understanding Spark's Execution Plan

The Directed Acyclic Graph (DAG) visualization provides a clear picture of how Spark structures the operations in your job. This visualization is extremely valuable for understanding the logical flow of data through your Spark application

- Each node represents an operation (like map, filter, join)
- Arrows show data flow and dependencies
- Stages are separated by shuffle boundaries (where data needs to be redistributed)

The DAG helps you identify expensive operations like shuffle-heavy joins or cartesian products. Multiple arrows converging on a single operation often indicate a join or aggregation that could become a bottleneck. Wide transformations (those that require shuffles) create stage boundaries and are typically more expensive than narrow transformations..


![Jobs Tab Description](res/DAG.png)

### What You're Seeing in This DAG:

This visualization shows how a Spark job is broken down into multiple stages (Stage 10 and Stage 11) and the operations within each stage:

1. **Stage Boundaries**: Each pink box represents a distinct stage. Stages are separated at points where data must be redistributed across the cluster (called shuffle boundaries).

2. **Operations Within Stages**: The blue boxes represent specific operations:
   - **Parallelize**: Creating a distributed dataset from data
   - **Scan**: Reading data from a source
   - **Exchange**: Redistributing data across the cluster (shuffle operation)
   - **WholeStageCodegen**: Spark's optimization that compiles multiple operations into efficient bytecode
   - **MapPartitionsInternal**: An internal transformation operating on each partition

3. **Data Flow Direction**: The arrows show the direction of data flow, always moving downward within stages and then across to the next stage.

4. **Shuffle Boundary**: The curved line connecting the "Exchange" operation from Stage 10 to Stage 11 represents a shuffle - a point where data must be redistributed across the cluster. Shuffles create stage boundaries in Spark.

### Why This Matters:

The DAG visualization reveals important insights about your Spark application:

- **Performance Bottlenecks**: Exchange operations (shuffles) are often performance bottlenecks because they involve network transfer and disk I/O. In this example, there's a shuffle between Stages 10 and 11.

- **Optimization Opportunities**: The presence of "WholeStageCodegen" indicates that Spark is applying code generation optimizations to improve performance by combining multiple operations.

- **Execution Dependencies**: You can see which operations must complete before others can begin. For example, in Stage 10, the "Scan" must complete before "WholeStageCodegen (1)" can start.

- **Parallelization Potential**: Operations within a stage can be parallelized, but stages must execute in sequence when there are dependencies between them.

Understanding the DAG helps you reason about how your code translates into actual execution steps. When optimizing Spark applications, you'll often refer to this visualization to identify opportunities for improvement, such as reducing the number of shuffle operations or ensuring that data is properly partitioned.

In complex jobs with many stages, the DAG becomes even more valuable as it helps you visualize the execution flow that would otherwise be difficult to comprehend.

### Let's create a code example to demonstrates :
run this code:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Simple DAG Example") \
    .master("local[*]") \
    .getOrCreate()

# Create a sample DataFrame
data = [(i, f"product_{i % 3}", i * 10) for i in range(50)]
df = spark.createDataFrame(data, ["id", "product", "value"])

# Apply a series of transformations that will create a visible DAG
# Step 1: Filter the data
filtered_df = df.filter(col("value") > 20)

# Step 2: Group by product
summary_df = filtered_df.groupBy("product").agg(
    avg("value").alias("avg_value")
)

# Execute an action to materialize the DAG
print("Results after transformation:")
summary_df.show()

print("\nCheck the Spark UI at http://localhost:4040")
print("Go to the Jobs tab, click on the job, then view the DAG Visualization")

## 3. The Stages Tab: Diving Deeper Into Execution

Stages are sets of tasks that can be executed in parallel without data shuffling. The Stages tab shows:

- Detailed metrics for each stage
- Input/output data sizes
- Task distribution and execution times
- Whether stages completed successfully or failed

When examining the Stages tab, look for:

- **Task skew**: When some tasks take much longer than others in the same stage
- **Spill metrics**: Data spilled to disk indicates memory pressure
- **Shuffle read/write sizes**: Large shuffles can slow down your application
- **Input/output records**: Help identify data flow bottlenecks

Clicking on a specific stage provides a task-level view where you can see outliers and understand the distribution of work across your cluster.
This view is invaluable for understanding the actual work happening in your Spark application.


![Jobs Tab Description](res/Stage_image1.png)

### Understanding the Stages Tab Structure:

The Stages tab is divided into three key sections, each showing stages in different states:

1. **Active Stages**: Currently running stages
   - In our example, Stage 13 is active with 0/4 tasks completed (4 running)
   - Shows real-time progress of execution

2. **Pending Stages**: Stages waiting for resources or dependent stages to complete
   - Stage 14 is pending with 0/200 tasks completed
   - These stages are queued but haven't started execution yet

3. **Completed Stages**: Stages that have finished execution
   - Our example shows multiple completed stages (5-12)
   - Provides historical performance data for analysis

### Key Metrics To Monitor:

Each stage displays several important metrics:

- **Stage ID**: Unique identifier for each stage
- **Pool Name**: The scheduler pool handling this stage (affects resource allocation)
- **Description**: The operation being performed (like "show" or "foreach")
- **Duration**: How long the stage took to execute (ranging from 0.2s to 64ms in our example)
- **Tasks**: Shows completed/total tasks and visualizes progress
- **Shuffle Read/Write**: Amount of data transferred during shuffle operations
  - Note stages with significant shuffle write (1205.5 KiB) which indicate data redistribution

### What This Tells Us About Performance:

Looking at the completed stages, we can observe:
- Most stages complete quickly (0.3s-0.5s)
- Stage 6 took significantly longer (64ms) than others - a potential bottleneck
- Several stages have identical shuffle write sizes (1205.5 KiB), suggesting similar data volume processing
- Task counts vary (some have 1 task, others have 4), indicating different levels of parallelism

By examining these metrics, you can identify which stages contribute most to your job's overall execution time and focus your optimization efforts accordingly.

You can click on any stage ID to view more detailed information about that specific stage, including a visual representation of its internal operations.


## Stage Detail View: Understanding Operation Flow

When you click on a specific stage in the Stages tab, you'll see a detailed view of the operations within that stage, as shown below:

![Jobs Tab Description](res/stage_image2_dag.png)

### Decoding the Stage Internals:

This visualization shows the execution plan for Stage 15, revealing how data flows through various operations:

1. **Parallel Input Paths**: The stage begins with two parallel data paths, each starting with a "ShuffledRowRDD" operation followed by "Exchange" - indicating data coming from previous stages through a shuffle.

2. **Data Transformations**: Each path goes through a series of operations:
   - **MapPartitionsRDD**: Applying transformations to each partition
   - **WholeStageCodegen**: Spark's optimization that combines operations into efficient bytecode
   - **Exchange**: Points where data is redistributed across the cluster

3. **Data Flow Convergence**: The two paths merge at the "ZippedPartitionsRDD2" operation, combining data from both sources.

4. **Final Processing**: After merging, the data goes through additional mapping operations and a final exchange before completing the stage.

### Performance Insights From This View:

This detailed view reveals important aspects of Spark's execution strategy:

- **Code Generation Optimization**: Multiple "WholeStageCodegen" boxes show where Spark compiles operations together for efficiency - operations within these boundaries execute much faster.

- **Data Shuffling Points**: Each "Exchange" operation represents a potential bottleneck where data moves between executors.

- **Operation Dependencies**: The arrows show which operations must wait for others to complete, revealing the critical path through your execution.

- **Operation Numbering**: Each operation has a unique identifier (like "[35]", "[41]", etc.) that helps trace specific transformations back to your code.

This visualization is invaluable for understanding complex transformations and identifying optimization opportunities. When tuning Spark applications, you'll often examine these diagrams to spot inefficient patterns like excessive shuffling or missed opportunities for pipelining operations.

## 4. The Storage Tab: Managing Cached Data

The Storage tab provides insight into how Spark manages cached data, which is crucial for optimizing performance by avoiding redundant computations.

In the Storage tab, you can see:

- Which datasets are cached
- How much memory they're using
- Whether they're stored in memory, on disk, or both
- The fraction of the dataset that's cached

If the "Fraction Cached" is less than 100%, it means some partitions couldn't fit in memory and were either spilled to disk or not cached at all, depending on your storage level. Different storage levels (like MEMORY_ONLY, MEMORY_AND_DISK, or DISK_ONLY) affect both performance and resilience.

The "Size in Memory" vs "Size on Disk" comparison helps you understand serialization overhead and compression efficiency. Partitions that are well-distributed will show more even storage across executors.



![Jobs Tab Description](res/storage_image1.png)
![Jobs Tab Description](res/Storage_tab.png)

### Understanding the Storage Tab:

The Storage tab displays all cached RDDs, DataFrames, and Datasets in your application, along with key metrics that help you monitor your memory usage:

1. **RDD/Table Information**:
   - **ID**: Unique identifier for each cached object
   - **Name**: Descriptive name of the dataset (e.g., "rdd" or "LocalTableScan [count#7, name#8]")

2. **Storage Characteristics**:
   - **Storage Level**: How data is stored - shown in our example:
     - "Memory Serialized 1x Replicated": Stored in memory in serialized format
     - "Disk Serialized 1x Replicated": Stored on disk in serialized format

3. **Partition Information**:
   - **Cached Partitions**: Number of partitions stored (5 and 3 respectively)
   - **Fraction Cached**: Percentage of data actually cached (both at 100%)

4. **Size Metrics**:
   - **Size in Memory**: Space used in RAM (236.0 B for the first RDD)
   - **Size on Disk**: Space used on disk (2.1 KiB for the second RDD)

### Why the Storage Tab Is Important:

This tab helps you:

- **Verify Caching Strategy**: Confirm that your `.cache()` or `.persist()` operations worked as expected
- **Monitor Memory Usage**: Ensure you're not caching too much data and risking out-of-memory errors
- **Diagnose Performance Issues**: If a query is slow despite caching, check if data was actually cached
- **Optimize Storage Levels**: Make informed decisions about which storage level to use (memory-only, disk-only, or combined)

For larger datasets, you'll also see when data partially spills to disk or when some partitions couldn't be cached due to memory constraints.

You can click on any RDD name to see a more detailed view showing exactly how the data is distributed across executors and individual partition sizes.

For large-scale applications, proper caching can dramatically improve performance by reusing data across operations rather than recomputing it each time. The Storage tab gives you visibility into this crucial optimization.

## 5. The SQL Tab: Analyzing Query Performance

The SQL tab in Spark UI provides visibility into all Spark SQL operations in your application, including DataFrame operations which are translated into SQL behind the scenes.

![Jobs Tab Description](res/sql_image.png)

### Understanding the SQL Tab:

The SQL tab shows a list of all queries executed in your Spark application, with crucial information about each:

1. **Query Identification**:
   - **ID**: Each query receives a unique identifier (0, 1, 2 in our example)
   - **Description**: Shows the operation executed (e.g., "count at \<console\>:26", "createGlobalTempView at \<console\>:26")

2. **Execution Timeline**:
   - **Submitted**: When the query was sent for execution
   - **Duration**: How long each query took to run (ranging from 0.3s to 2s in our example)

3. **Related Jobs**:
   - **Job IDs**: Links to the Spark jobs created to execute this query
   - Note how query ID 2 generated multiple jobs ([1][2][3][4][5]), indicating a more complex execution

4. **Details Link**:
   - The "+details" link allows you to dive deeper into each query
   - Clicking it reveals the logical and physical plans for the query

### What This Tells Us About Query Execution:

Looking at this overview, we can observe:

- The "count" operation (ID 0) took 2 seconds and generated one job
- The "createGlobalTempView" operation (ID 1) was relatively quick at 0.3 seconds
- The "show" operation (ID 2) was more complex, generating 5 different jobs and taking 2 seconds

### Why the SQL Tab Is Valuable:

The SQL tab helps you:

- **Track SQL Performance**: Identify which queries are taking the longest time
- **Connect Operations**: Link high-level DataFrame operations to the underlying Spark jobs
- **Troubleshoot Issues**: When a query performs poorly, you can examine its execution details
- **Optimize Queries**: By understanding the execution plan, you can modify your code for better performance

When you click the "+details" link for a query, you'll see the logical and physical plans, which reveal how Spark interprets and optimizes your query. This deeper view shows operations like joins, filters, and projections, along with important optimization decisions made by the Spark SQL engine.


## 6. The Environment Tab: Configuration Details

The Environment tab provides a comprehensive view of your Spark application's configuration settings, helping you understand exactly how Spark is set up to run.


![Jobs Tab Description](res/Environment_tab.png)

### Key Components of the Environment Tab:

The Environment tab is divided into several sections, with the most important being Runtime Information and Spark Properties:

1. **Runtime Information**:
   - **Java/Scala Versions**: In our example, we see Java 1.8.0_221 and Scala 2.12.8
   - **Java Home**: Shows where the JVM is installed
   - This information helps troubleshoot compatibility issues

2. **Spark Properties**:
   - **Application Configuration**: Settings like `spark.app.id` and `spark.app.name`
   - **Execution Settings**: Properties that control how Spark runs your jobs
   - **Critical Performance Parameters**:
     - `spark.scheduler.mode`: "FIFO" in our example
     - `spark.sql.catalogImplementation`: "in-memory"
     - `spark.master`: "local[*]" indicating local mode execution

### Important Settings to Monitor:

Several key properties in this tab directly impact application performance:

- **Resource Allocation**: While not explicitly configured in our example, you'd typically see memory settings like `spark.executor.memory` here
- **Execution Control**: The `spark.scheduler.mode` determines how concurrent jobs are handled
- **SQL Optimization**: `spark.sql.catalogImplementation` shows how Spark manages SQL metadata

### Why This Information Matters:

The Environment tab serves several crucial purposes:

- **Troubleshooting**: When something isn't working, this is often the first place to check
- **Verification**: Confirm your configuration settings were properly applied
- **Documentation**: Provides a complete record of your application's environment
- **Optimization**: Identify opportunities to tune settings for better performance

For developers and administrators, the Environment tab is invaluable for understanding the exact conditions under which your Spark application is running. Many performance issues can be traced back to suboptimal configuration settings visible in this tab.

In production environments, you'll often compare configurations across different applications to standardize settings or identify differences that might explain performance variations.

## 7. The Executors Tab: Resource Utilization

The Executors tab provides insight into the worker processes that execute your Spark jobs, showing how computational resources are allocated and utilized across your cluster.


![Jobs Tab Description](res/exector_tab.png)

### Understanding the Executors Tab:

The Executors tab is divided into two main sections:

1. **Summary Statistics**:
   - **Active/Total/Dead Executors**: In our example, there are 3 active executors and 0 dead ones
   - **Resource Allocation**: The cluster has 2 cores active and 5.9 KiB / 1.1 GiB of storage memory in use
   - **Task Metrics**: 5 completed tasks with a total task time of 4 seconds (including 0.2 seconds of garbage collection)

2. **Individual Executor Details**:
   - **Executor ID**: Each executor has a unique identifier (0, 1, and "driver")
   - **Status**: All executors show as "Active" in our example
   - **Resource Allocation**: Each executor has 1 core and 2 KiB / 366.3 MiB of storage memory
   - **Task Distribution**: Executor 1 has completed 3 tasks, Executor 0 has completed 2 tasks
   - **Performance Metrics**: Task time and GC time help identify processing bottlenecks
   - **Logs**: Links to stdout/stderr logs and thread dumps for debugging

### Key Metrics to Monitor:

Several important indicators of application health are visible here:

- **Task Distribution**: Ideally, tasks should be evenly distributed across executors. In our example, tasks are reasonably balanced between Executors 0 and 1.
- **Memory Usage**: Storage memory shows how much data is cached on each executor
- **GC Time**: Garbage collection time as a proportion of total task time (0.2s out of 4s here) helps identify memory pressure
- **Shuffle Read/Write**: In our example, there's no shuffle activity (all 0.0 B), but in data-intensive applications, high shuffle volumes can indicate potential performance issues

### Why the Executors Tab Is Important:

This tab helps you:

- **Monitor Resource Utilization**: Ensure executors have appropriate memory and CPU allocation
- **Identify Skew**: Detect when certain executors are processing significantly more data or tasks than others
- **Troubleshoot Failures**: Quickly access logs when executors fail or tasks encounter errors
- **Track Performance**: Monitor GC time and task execution metrics to optimize resource allocation

For production Spark applications, regularly checking the Executors tab helps ensure your cluster is properly sized and resources are efficiently utilized across all worker nodes.

In larger deployments, you might see dozens or hundreds of executors, making this view essential for identifying outliers or problematic nodes that could impact overall application performance.


## Putting It All Together: Analyzing a Complete Workflow

Let's create a more comprehensive example to demonstrate how all aspects of the Spark UI work together:


In [3]:
from pyspark.sql import SparkSession
import random

# Initialize a Spark session (if not already created)
spark = SparkSession.builder \
    .appName("SparkUIComprehensiveExample") \
    .master("local[*]") \
    .getOrCreate()

# Create example datasets
students = spark.createDataFrame([
    (i, f"Student_{i}", random.randint(14, 18))
    for i in range(1000)
], ["id", "name", "age"])

grades = spark.createDataFrame([
    (random.randint(1, 1000),
     random.choice(['Math', 'Science', 'History']),
     random.randint(60, 100))
    for _ in range(5000)
], ["student_id", "subject", "grade"])

attendance = spark.createDataFrame([
    (random.randint(1, 1000),
     random.choice(['Math', 'Science', 'History']),
     random.randint(70, 100))
    for _ in range(7000)
], ["student_id", "subject", "attendance_pct"])

# Cache one of our tables
grades.cache()
grades.count()  # Materialize the cache

# Make them available for SQL
students.createOrReplaceTempView("students")
grades.createOrReplaceTempView("grades")
attendance.createOrReplaceTempView("attendance")

# Run a complex query
result = spark.sql("""
    SELECT
        s.name,
        s.age,
        AVG(g.grade) as avg_grade,
        AVG(a.attendance_pct) as avg_attendance,
        COUNT(DISTINCT g.subject) as subjects_taken
    FROM students s
    JOIN grades g ON s.id = g.student_id
    JOIN attendance a ON s.id = a.student_id AND g.subject = a.subject
    GROUP BY s.name, s.age
    HAVING AVG(g.grade) > 80 AND AVG(a.attendance_pct) > 85
    ORDER BY avg_grade DESC, avg_attendance DESC
""")

# Show the results
result.show()

25/05/13 23:15:27 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
                                                                                

+-----------+---+-----------------+-----------------+--------------+
|       name|age|        avg_grade|   avg_attendance|subjects_taken|
+-----------+---+-----------------+-----------------+--------------+
|Student_890| 16|            100.0|             99.0|             1|
|Student_127| 15|             97.0|92.83333333333333|             1|
|Student_626| 17|             96.0|             98.0|             1|
|Student_364| 18|             95.2|             86.0|             2|
|Student_149| 18|             95.0|             97.0|             1|
|Student_419| 16|            94.25|             89.0|             2|
|Student_159| 15|             93.0|90.71428571428571|             2|
|Student_764| 18|             93.0|            86.75|             2|
|Student_927| 14|92.92307692307692|89.76923076923077|             3|
|Student_125| 18|92.85714285714286|85.42857142857143|             2|
|Student_398| 17|92.66666666666667|89.33333333333333|             2|
| Student_80| 14|             92.3

After running this example, explore:
1. The Jobs tab to see all created jobs
2. The Stages tab to analyze execution stages
3. The SQL tab to view the query plans
4. The Storage tab to see cached data
5. The Executors tab to examine resource usage
