In [15]:
# Set the PySpark environment variables
import os
os.environ['SPARK_HOME'] = "/home/user5/Downloads/spark"
os.environ['PYSPARK_DRIVER_PYTHON'] = 'jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = 'lab'
os.environ['PYSPARK_PYTHON'] = 'python'

In [16]:
from pyspark.sql import SparkSession

In [17]:
# Create a SparkSession
spark = SparkSession.builder.appName("RDD-Demo").getOrCreate()

##getOrCreate() → Reuse existing SparkSession or create a new one (prevents Jupyter errors).


RDD = Resilient Distributed Dataset
It’s Spark’s core data structure

Why “Resilient”?
Because it can recover lost data automatically if a worker fails.

Why “Distributed”?
Because the data is split into partitions across the cluster.

Why “Dataset”?
Because it’s just a collection of records (like a list in Python).


How to create RDDs

In [18]:
numbers = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(numbers)

#Converts the Python list into a distributed dataset (RDD = Resilient Dis#tributed Dataset).

#Spark splits the list into partitions so it can process data in parallel.
#spark.sparkContext is the lower-level API that works directly with RDDs.

In [19]:
# Collect action: Retrieve all elements of the RDD
rdd.collect()

[1, 2, 3, 4, 5]

In [20]:
# Create an RDD from a list of tuples
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("Alice", 40)]
rdd = spark.sparkContext.parallelize(data)

In [21]:
# Collect action: Retrieve all elements of the RDD
print("All elements of the rdd: ", rdd.collect())

All elements of the rdd:  [('Alice', 25), ('Bob', 30), ('Charlie', 35), ('Alice', 40)]


RDDs Operation: Actions

In [22]:
# Count action: Count the number of elements in the RDD
count = rdd.count()
print("The total number of elements in rdd: ", count)

The total number of elements in rdd:  4


In [23]:
# First action: Retrieve the first element of the RDD
first_element = rdd.first()
print("The first element of the rdd: ", first_element)

The first element of the rdd:  ('Alice', 25)


In [24]:
# Take action: Retrieve the n elements of the RDD
taken_elements = rdd.take(2)
print("The first two elements of the rdd: ", taken_elements)

The first two elements of the rdd:  [('Alice', 25), ('Bob', 30)]


In [27]:
# Foreach action: Print each element of the RDD
rdd.foreach(lambda x: print(x))

('Bob', 30)
('Alice', 25)
('Alice', 40)
('Charlie', 35)


RDDs Operation: Transformations

In [29]:
# Map transformation: Convert name to uppercase
mapped_rdd = rdd.map(lambda x: (x[0].upper(), x[1]))
#Applies a function to each element of the RDD


In [30]:
result = mapped_rdd.collect()
print("rdd with uppercease name: ", result)

rdd with uppercease name:  [('ALICE', 25), ('BOB', 30), ('CHARLIE', 35), ('ALICE', 40)]


In [31]:
# Filter transformation: Filter records where age is greater than 30
filtered_rdd = rdd.filter(lambda x: x[1] > 30)
filtered_rdd.collect()

[('Charlie', 35), ('Alice', 40)]

In [32]:
# ReduceByKey transformation: Calculate the total age for each name
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y) # reduceByKey Pair RDDs 
reduced_rdd.collect()

                                                                                

[('Charlie', 35), ('Alice', 65), ('Bob', 30)]

In [33]:
# SortBy transformation: Sort the RDD by age in descending order
sorted_rdd = rdd.sortBy(lambda x: x[1], ascending=False)
sorted_rdd.collect()

[('Alice', 40), ('Charlie', 35), ('Bob', 30), ('Alice', 25)]

Save RDDs to text file and read RDDs from text file

In [34]:
# Save action: Save the RDD to a text file
rdd.saveAsTextFile("output.txt")

In [35]:
# create rdd from text file
rdd_text = spark.sparkContext.textFile("output.txt")
rdd_text.collect()

["('Bob', 30)", "('Alice', 40)", "('Alice', 25)", "('Charlie', 35)"]

In [36]:
spark.stop()

Why “Lazy”?
In Spark, transformations don’t run immediately.

They just build a plan of what to do.

The plan only runs when you call an action.


Transformations (lazy → return a new RDD)

map() → Apply a function to each element.

flatMap() → Apply a function & flatten results.

filter() → Keep only elements matching a condition.

reduceByKey() → Combine values with the same key using a function.



Actions (trigger execution → return a value to driver or write data)
collect() → Bring all elements to the driver as a list.

count() → Count number of elements.

first() → Get the first element.