# RDD - Resiliant Distributed Datasets

## Characteristics
Immutable
Distributed - Spark distributes data for parallel processing via RDD
Resilient - liniage tracks transformations done to data for fault tolerance
Lazily evaluated - executed after plan
fault-tolerant transformations

## Transformations
Creates new RDDS
Lazy eval, lineage graph

## Actions
Return results or perform actions on RDD, triggering execution
Eager evalutation
data movement/ computation



In [2]:
# set env
import os
os.environ['SPARK_HOME'] = "/home/cloud_user/apps/spark/current"
os.environ['PYSPARK_DRIVER_PYTHON'] = 'jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = 'lab'
os.environ['PYSPARK_PYTHON'] = 'python'

In [4]:
# Import PySpark
from pyspark.sql import SparkSession

In [7]:
# Create SparkSession
spark = SparkSession.builder \
    .appName("rdd-tutorial") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/25 08:01:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/25 08:01:36 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Create RDD

In [8]:
numbers = [1,2,3,4,5]
rdd = spark.sparkContext.parallelize(numbers)

In [10]:
# Collect Action: retrieve all elements of the RDD
rdd.collect()

                                                                                

[1, 2, 3, 4, 5]

In [33]:
data = [("Alice", 25),("Bob", 30),("Charlie",35),("Fred",33),("Alice",20)]
rdd = spark.sparkContext.parallelize(data)

In [34]:
print(f"All elements of the rdd: {rdd.collect()}")

All elements of the rdd: [('Alice', 25), ('Bob', 30), ('Charlie', 35), ('Fred', 33), ('Alice', 20)]


## RDD Actions

In [35]:
# Count action: Count number of elements in an RDD
count = rdd.count()
print(f"The total number of elements in the rdd: {count}")

The total number of elements in the rdd: 5


In [36]:
# First action: returns the first item in the RDD
first = rdd.first()
print(f"The first element in the RDD: {first}")

The first element in the RDD: ('Alice', 25)


In [37]:
# Take action: Take specified amount of elements from the RDD starting from the front
taken = rdd.take(2)
print(f"Taken elements: {taken}")

Taken elements: [('Alice', 25), ('Bob', 30)]


In [38]:
# Foreach action: Lets us call a function on each element of the RDD
rdd.foreach(lambda x: print(f"Item: {x}"))

Item: ('Alice', 25)
Item: ('Bob', 30)
Item: ('Charlie', 35)
Item: ('Fred', 33)
Item: ('Alice', 20)


## RDD Transformations

In [39]:
# Map transformation: Convert name to uppercase
# remember, transformations are lasy and only execute when driven by actions.
mapped_rdd = rdd.map(lambda x: (x[0].upper(), x[1]))

In [40]:
result = mapped_rdd.collect()
print(f"RDD with uppercase transform: {result}")

RDD with uppercase transform: [('ALICE', 25), ('BOB', 30), ('CHARLIE', 35), ('FRED', 33), ('ALICE', 20)]


In [41]:
# Filter transformation: filter elements based on a condition
filter_gt_30 = rdd.filter(lambda x: x[1] > 30)

In [43]:
result = filter_gt_30.collect()
print(f"RDD filtered age greater than 30: {result}")

RDD filtered age greater than 30: [('Charlie', 35), ('Fred', 33)]


In [46]:
# reduceByKey transformation: groups by and reduces by lambda function
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)

In [47]:
result = reduced_rdd.collect()
print(f"Total Age of RDD: {result}")

Total Age of RDD: [('Fred', 33), ('Alice', 45), ('Bob', 30), ('Charlie', 35)]


In [48]:
# sortby transformation: Sort the RDD by age in descending order
sorted_rdd = rdd.sortBy(lambda x: x[1], ascending=False)
result = sorted_rdd.collect()

In [49]:
print(f"Sorted RDD: {result}")

Sorted RDD: [('Charlie', 35), ('Fred', 33), ('Bob', 30), ('Alice', 25), ('Alice', 20)]


## Save RDD to textfile and Read from Textfile

In [51]:
# Save action
rdd.saveAsTextFile("data/rdd-lab-output.txt")

In [56]:
# Read action
rdd_text = spark.sparkContext.textFile("data/rdd-lab-output.txt")
print(f"First element of text file: {rdd_text.first()}")

First element of text file: ('Charlie', 35)


In [57]:
print(f"All elements of text file: {rdd_text.collect()}")

All elements of text file: ["('Charlie', 35)", "('Fred', 33)", "('Alice', 20)", "('Alice', 25)", "('Bob', 30)"]


## Shutdown

In [58]:
spark.stop()