# PySpark - 5 min intro

PySpark (Python API for Apache Spark) - offers a faster, more flexible alternative to the traditional MapReduce framework

## RDD - Resilient Distributed Dataset

- It is the fundamental data structure in Spark.
- Think of it as a distributed collection of elements that can be processed in parallel.


---

**RDDs are fault-tolerant**, **immutable**, and can be operated on using two types of operations:

- **Transformations**: Create a new RDD from an existing one (e.g., map(), filter(), flatMap()).

- **Actions**: Return a value to the driver program or write data to an external storage system (e.g., collect(), count(), reduce()).

In [1]:
from pyspark import SparkContext

# Initialize a SparkContext
# This is the entry point to Spark functionality.
sc = SparkContext("local", "PySparkBasics-5min-Primer")

# Create an RDD from a list of numbers (our sample dataset)
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 18:09:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### RDD operations

In [2]:
# Multiply each element by 2
doubled = numbers.map(lambda x: x * 2)

# Print the results
print("Doubled Numbers:", doubled.collect())

[Stage 0:>                                                          (0 + 1) / 1]

Doubled Numbers: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]


                                                                                

> `collect()` gathers the results from all nodes back to the driver program. Not recommended for big data (for our toy dataset, it's fine). Use `.take()` for inspecting data.

In [3]:
doubled.take(5)

[2, 4, 6, 8, 10]

In [4]:
# Filter out numbers greater than 5, returns a new RDD
filtered = numbers.filter(lambda x: x > 5)

# Print the filtered numbers
print("Numbers > 5:", filtered.collect())

Numbers > 5: [6, 7, 8, 9, 10]


In [5]:
# Sum all the numbers in the RDD
sum_numbers = numbers.reduce(lambda a, b: a + b)

# Print the sum
print("Sum of Numbers:", sum_numbers)

Sum of Numbers: 55


In [6]:
# Create an RDD from a list of sentences
sentences = sc.parallelize([
    "PySpark is fun",
    "MapReduce is old school",
    "Spark offers more functionality"
])

# Split each sentence into words and flatten the result
words = sentences.flatMap(lambda sentence: sentence.split(" "))

# Print the words
print("Words:", words.collect())

Words: ['PySpark', 'is', 'fun', 'MapReduce', 'is', 'old', 'school', 'Spark', 'offers', 'more', 'functionality']


### Chaining transformations

In [7]:
# Chain multiple operations: square even numbers from our list
result = numbers.filter(lambda x: x % 2 == 0) \
                .map(lambda x: x * x)  # square the numbers

# Execute the pipeline and collect the results
print("Squared Even Numbers:", result.collect())

Squared Even Numbers: [4, 16, 36, 64, 100]


25/02/01 18:09:16 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


> `.collect()` triggers the computation

In [8]:
numbers.count()

10