<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/02_SparkSession_%26_SparkContext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SparkSession & SparkContext: The Entry Points

To interact with Spark, you use either a `SparkSession` (recommended for Spark 2.x+) or a `SparkContext`.

### SparkSession.builder.getOrCreate()
*   **Unified Entry Point:** The primary way to start using Spark from version 2.x onwards. It consolidates functionalities of `SparkContext`, `SQLContext`, `HiveContext`, and `StreamingContext`.
*   **How it Works:**
    *   `builder`: Returns a `SparkSession.Builder` object for configuring Spark properties.
    *   `getOrCreate()`:
        *   If a `SparkSession` instance already exists, it returns the existing one.
        *   If no `SparkSession` exists, it creates a new one based on the builder's configuration.
*   **Benefit:** Ensures you always have a single, active `SparkSession`, preventing conflicts.

### Common SparkSession Configurations

When building a `SparkSession`, you can configure various properties:

| Method                 | Description                                                                                             | Example                                        |
| :--------------------- | :------------------------------------------------------------------------------------------------------ | :--------------------------------------------- |
| `appName(name)`        | Sets a name for your application, visible in the Spark UI.                                              | `.appName("MySparkApp")`                       |
| `master(url)`          | Specifies the master URL for the cluster.                                                               | `.master("local[*]")`                          |
| `config(key, value)`   | Sets a specific Spark configuration property.                                                           | `.config("spark.executor.memory", "2g")`       |

#### `master(url)` Options:

*   `local`: Runs Spark locally with one thread.
*   `local[*]`: Runs Spark locally with as many worker threads as logical cores on your machine.
*   `local[N]`: Runs Spark locally with `N` worker threads.
*   `spark://host:port`: Connects to a standalone Spark cluster.
*   `yarn`: Connects to a YARN cluster.
*   `mesos://host:port`: Connects to a Mesos cluster.

#### Example (Python): Building a SparkSession

In [1]:
from pyspark.sql import SparkSession

# Build a SparkSession
spark = SparkSession.builder \
    .appName("MySparkApp") \
    .master("local[*]") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

print("SparkSession created successfully!")

SparkSession created successfully!


In [2]:
# Example of configuring more properties
spark_configured = SparkSession.builder \
                               .appName("ConfiguredSparkApp") \
                               .master("local[4]") \
                               .config("spark.sql.shuffle.partitions", "200") \
                               .config("spark.driver.memory", "1g") \
                               .getOrCreate()

print("\nSpark configuration:")
print(f"App Name: {spark_configured.conf.get('spark.app.name')}")
print(f"Master: {spark_configured.conf.get('spark.master')}")
print(f"Shuffle Partitions: {spark_configured.conf.get('spark.sql.shuffle.partitions')}")

# Remember to stop the SparkSession when done
spark.stop()
spark_configured.stop()


Spark configuration:
App Name: ConfiguredSparkApp
Master: local[4]
Shuffle Partitions: 200



### Accessing SparkContext from SparkSession

While `SparkSession` is the primary entry point, you can still access the underlying `SparkContext` instance.

*   **How to Access:** Use `spark.sparkContext`.
*   **Role of SparkContext:**
    *   Connects to the Spark cluster.
    *   Creates Resilient Distributed Datasets (RDDs) – the foundational data structure in Spark 1.x.
    *   Responsible for broadcasting variables (sending read-only data to all executors).
*   **Usage for Beginners:** Primarily used for low-level RDD operations or when you need to broadcast data, though most modern Spark tasks can be handled directly by `SparkSession` with DataFrames.

#### Example (Python): Accessing SparkContext


In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("AccessSparkContext") \
    .master("local") \
    .getOrCreate()

# Access SparkContext from SparkSession
sc = spark.sparkContext

print(f"SparkContext application ID: {sc.applicationId}")
print(f"SparkContext version: {sc.version}")

SparkContext application ID: local-1759410308814
SparkContext version: 3.5.1


In [4]:
# Example of an RDD operation using sc
rdd_data = sc.parallelize([10, 20, 30])
print("RDD sum:", rdd_data.sum())

spark.stop()

RDD sum: 60



## Understanding Driver vs. Executors

In a Spark cluster, two main components facilitate distributed processing: the **driver** and **executors**.

### 1. Driver Program
*(The "Project Manager")*

*   **Location:** Runs on the driver node (the machine where you launched your Spark application).
*   **Key Responsibilities:**
    *   Contains the `SparkSession` and `SparkContext`.
    *   Translates user code (DataFrame transformations) into a Directed Acyclic Graph (DAG).
    *   Converts the DAG into an optimized physical plan (via Catalyst Optimizer).
    *   Schedules tasks on executors.
    *   Maintains information about the cluster (e.g., available executors).
    *   Aggregates results from executors.

### 2. Executors
*(The "Workers")*

*   **Location:** Worker processes that run on worker nodes in the cluster.
*   **Key Responsibilities:**
    *   Execute the actual tasks assigned by the driver.
    *   Perform computations on their assigned partition of data.
    *   Store data (if caching is involved).
    *   Return results or status updates to the driver.
    *   Each executor can run multiple tasks concurrently.

### Communication Flow

1.  **User Code to Logical Plan:** Your Spark application code (e.g., DataFrame transformations) is translated into a logical plan by the driver.
2.  **Logical to Physical Plan:** The Catalyst Optimizer converts the logical plan into an optimized physical plan.
3.  **Task Scheduling:** The driver breaks down the physical plan into stages and then into individual tasks, distributing them to available executors.
4.  **Task Execution:** Executors receive tasks, process their assigned portion of the data, and return results or status updates to the driver.

This distributed architecture allows Spark to process vast amounts of data in parallel, making it highly scalable.