<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/01_Spark_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Spark Basics

*   **What is Spark?**
    *   A unified analytics engine for large-scale data processing.
    *   Significantly faster than traditional Hadoop MapReduce for many workloads.
    *   Provides high-level APIs in Scala, Java, Python, and R.
    *   Designed for batch processing, interactive queries, streaming, and machine learning.

### 2. Spark Entry Points: `SparkSession` & `SparkContext`

To interact with Spark, you need to establish an entry point.

*   **`SparkContext` (Legacy):**
    *   The original entry point for Spark 1.x.
    *   Responsible for connecting to the Spark cluster and managing resources.
    *   Primarily used for RDD operations.
*   **`SparkSession` (Modern & Recommended):**
    *   The unified entry point for Spark 2.x and later.
    *   Combines the functionalities of `SparkContext`, `SQLContext`, `HiveContext`, and `StreamingContext` into a single object.
    *   Recommended for all new Spark applications, as it provides access to all Spark features (RDDs, DataFrames, Datasets, SQL, Streaming).
    *   **`SparkSession.builder.getOrCreate()`**:
        *   This method is the standard way to initialize `SparkSession`.
        *   `builder`: Returns a `SparkSession.Builder` object for configuring Spark properties.
        *   `getOrCreate()`:
            *   If a `SparkSession` instance already exists, it returns the existing one.
            *   If not, it creates a new one based on the builder's configuration.
        *   This ensures you always have a single, active `SparkSession` per application.

**Example (Python): Initializing SparkSession**


In [1]:
from pyspark.sql import SparkSession

# Initialize SparkSession
# appName: A name for your application, shown in the Spark UI.
# master: 'local' runs Spark locally on your machine using as many cores as specified ('local[*]' uses all available cores).
# For a cluster, this would be a URL like 'spark://host:port' or 'yarn'.
spark = SparkSession.builder \
    .appName("MySparkApplication") \
    .master("local[*]") \
    .getOrCreate()

# Access SparkContext via SparkSession (for RDD operations if needed)
sc = spark.sparkContext

print(f"SparkSession created: {spark.version}")
print(f"SparkContext created: {sc.version}")

# Stop the SparkSession when done
spark.stop()

SparkSession created: 3.5.1
SparkContext created: 3.5.1


### 3. Spark Data Structures: RDD, DataFrame, and Dataset

Spark offers three primary data structures, each with distinct characteristics and use cases. Understanding their differences is crucial for efficient Spark development.

**Comparison Table:**

| Feature            | RDD (Resilient Distributed Dataset)                                 | DataFrame                                                                     | Dataset (Scala/Java only)                                                     |
| :----------------- | :------------------------------------------------------------------ | :---------------------------------------------------------------------------- | :---------------------------------------------------------------------------- |
| **Abstraction Level** | Low-level, object-oriented API                                      | High-level, tabular abstraction (named columns)                               | High-level, strongly-typed objects (JVM)                                      |
| **Data Structure** | Distributed collection of objects                                   | Distributed collection of Row objects with schema                             | Distributed collection of Scala case classes or Java beans with schema        |
| **Type Safety**    | No compile-time type safety (runtime errors possible)               | No compile-time type safety (runtime errors possible if column names are wrong) | **Compile-time type safety**                                                  |
| **Optimization**   | **Less optimized** (Spark doesn't know schema, limited internal optimization) | **Highly optimized** (Leverages Catalyst Optimizer & Tungsten)              | **Highly optimized** (Leverages Catalyst Optimizer & Tungsten)              |
| **API**            | Lambda functions, map, filter, reduce                               | SQL-like API, operations on named columns                                     | Functional API on objects, operations on fields of case classes               |
| **Language Support** | Scala, Java, Python, R                                              | Scala, Java, Python, R                                                        | **Scala, Java only**                                                          |
| **Use Cases**      | Unstructured data, custom transformations, when fine-grained control is needed | Structured & semi-structured data, most common use cases, SQL queries         | Structured data where compile-time type safety is critical (Scala/Java)       |

#### 3.1. RDD: Resilient Distributed Dataset

*   **Definition:** The fundamental data structure of Spark. A fault-tolerant, immutable, distributed collection of objects that can be operated on in parallel.
*   **Key Characteristics:**
    *   **Low-level & Object-oriented:** Provides granular control, working directly with Python/Scala/Java objects.
    *   **Immutable:** Once created, you cannot change an RDD. Transformations create new RDDs.
    *   **Fault-tolerant:** Can automatically recover lost data partitions.
    *   **Distributed:** Data is partitioned across nodes in a cluster.
    *   **Less Optimized:** Spark's internal optimizers have limited visibility into the data's internal structure because RDDs don't have a schema. This means Spark can't perform many performance optimizations automatically; the developer is responsible.
*   **When to use:** When you need very fine-grained control over your data transformations, are dealing with truly unstructured data where schema inference is not possible, or when you are implementing custom Spark functionalities. For most modern use cases, DataFrames are preferred.

**Example (Python): RDD Operations**


In [2]:
from pyspark import SparkContext

# Initialize SparkContext
# 'local' means running Spark in local mode. "RDD_Example" is the application name.
sc = SparkContext("local", "RDD_Example")

# Create an RDD from a Python list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform a transformation: Multiply each element by 2
rdd_transformed = rdd.map(lambda x: x * 2)

# Collect and print the results (collect() brings data from distributed RDD to local driver)
print("RDD transformed data:", rdd_transformed.collect()) # Expected: [2, 4, 6, 8, 10]

# Stop the SparkContext
sc.stop()

RDD transformed data: [2, 4, 6, 8, 10]



#### 3.2. DataFrame

*   **Definition:** A distributed collection of data organized into named columns, conceptually equivalent to a table in a relational database or a Pandas DataFrame.
*   **Key Characteristics:**
    *   **Schema-aware:** Has a well-defined schema (column names and types), which is crucial for Spark to understand and optimize data operations.
    *   **Optimized with Catalyst Optimizer:** This is the primary advantage. Spark's Catalyst Optimizer uses the schema information to build an optimized logical and physical plan for queries.
        *   **Rule-based optimization:** Applies rules (e.g., predicate pushdown, column pruning) to simplify and optimize the query plan.
        *   **Cost-based optimization:** Uses data statistics to choose the most efficient physical execution plan.
    *   **Tungsten Execution Engine:** DataFrames leverage Tungsten, which performs operations directly on serialized binary data in memory, reducing memory overhead and improving CPU utilization.
        *   **Memory Management:** Optimized allocation and deallocation.
        *   **Code Generation:** Generates efficient bytecode for operations.
        *   **Cache Locality:** Improves data access patterns.
    *   **Ease of Use & Language Agnostic:** Offers a high-level, SQL-like API that's intuitive and available across all major Spark languages (Scala, Java, Python, R).
*   **When to use:** For most Spark workloads involving structured or semi-structured data. DataFrames are generally the preferred choice due to their balance of ease of use and performance optimizations.

**Example (Python): DataFrame Operations**


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession (using getOrCreate ensures only one session is active)
spark = SparkSession.builder \
                    .appName("DataFrame_Example") \
                    .master("local[*]") \
                    .getOrCreate()

# Create a DataFrame from a list of tuples with schema
data = [("Alice", 1),
        ("Bob", 2),
        ("Charlie", 3),
        ("David", 4)]
columns = ["Name", "ID"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

Original DataFrame:
+-------+---+
|   Name| ID|
+-------+---+
|  Alice|  1|
|    Bob|  2|
|Charlie|  3|
|  David|  4|
+-------+---+



In [4]:
print("\nDataFrame Schema:")
df.printSchema()


DataFrame Schema:
root
 |-- Name: string (nullable = true)
 |-- ID: long (nullable = true)



In [5]:
# Perform a transformation: Filter rows where ID > 1
df_filtered = df.filter(col("ID") > 1) # Using 'col' for clarity and safety, can also use df.ID > 1

print("\nFiltered DataFrame (ID > 1):")
df_filtered.show()

# Stop the SparkSession
spark.stop()


Filtered DataFrame (ID > 1):
+-------+---+
|   Name| ID|
+-------+---+
|    Bob|  2|
|Charlie|  3|
|  David|  4|
+-------+---+



#### 3.3. Dataset (Scala/Java only)

*   **Definition:** A strongly-typed collection of JVM objects that combines the best features of RDDs (compile-time type safety) and DataFrames (optimization with Catalyst).
*   **Key Characteristics:**
    *   **Strongly Typed:** Provides compile-time type safety. If you try to access a non-existent column or use an incompatible type, you'll get an error at compilation time, not runtime.
    *   **Optimized with Catalyst & Tungsten:** Benefits from the same performance optimizations as DataFrames.
    *   **Encoders:** Use encoders to efficiently serialize and deserialize JVM objects to/from Spark's internal Tungsten binary format, optimizing storage and processing.
*   **Limitations:** Exclusively available in Scala and Java. Python and R do not support Datasets directly, as they are not statically typed JVM languages.
*   **When to use:** When working in Scala or Java and type safety is paramount, providing peace of mind during development and preventing runtime errors related to schema mismatches.

**Conceptual Example (Scala - illustrating the idea):**

```scala
import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder()
  .appName("Dataset_Example")
  .master("local[*]")
  .getOrCreate()

// Define a case class (strongly-typed object)
case class Person(name: String, age: Long)

// Create a DataFrame (e.g., from JSON, schema inferred)
val peopleDF = spark.read.json("examples/src/main/resources/people.json")
// In a real scenario, "people.json" would contain data like: {"name":"Alice","age":30}, {"name":"Bob","age":25}

// Convert the DataFrame to a Dataset of Person objects by providing the case class
// This is where the strong typing comes in. Spark will try to map DataFrame columns to Person fields.
val peopleDS = peopleDF.as[Person]

// Perform a transformation using object-oriented API (with compile-time type safety)
// `_.age` directly accesses the 'age' field of the `Person` object.
val filteredPeopleDS = peopleDS.filter(_.age > 25)

filteredPeopleDS.show()

spark.stop()
```

### 4. Why DataFrame is Generally Preferred

For most Spark workloads, DataFrames are the go-to choice due to a compelling combination of factors:

*   **Ease of Use:** DataFrames offer a high-level, SQL-like API that is intuitive for anyone familiar with relational databases or data manipulation libraries like Pandas. This abstracts away much of the complexity of distributed processing.
*   **Optimized Performance:** The built-in **Catalyst Optimizer** and **Tungsten Execution Engine** automatically optimize queries for significantly better performance compared to RDDs, without requiring manual optimization by the developer.
*   **Broad Language Support:** DataFrames are equally powerful and available across all major Spark-supported languages (Python, Scala, Java, R), making them versatile for diverse teams.
*   **Schema Awareness:** Having a defined schema allows Spark to perform efficient operations, detect errors earlier, and provide better tooling.