# RDD

- RDD: Resilient Distributed Dataset

- Main components of the Spark architecture are the _driver_ and _executor_.

- There will be one driver program and one or more executors run on the cluster slave machine.

- Spark follows master-slave architecture.

- PySpark is dispatched with Standalone Cluster Manager. 

- PySpark can also be configured on YARN and Apache Mesos.


### Driver:

- Coordinates with many executors running on various slave machines.

- _SparkContext_ object is created by the driver, which is the main entry point to a PySpark application.

- driver breaks our application into small tasks; a task is the smallest unit of your application.

- Tasks are run on different executors in parallel. 

- The driver is also responsible for scheduling tasks to different executors.

### Executors:

- Slave processes.

- Runs tasks.

- Has the capability to cache data in memory by using the BlockManager process.

- Each executor runs in its own Java Virtual Machine (JVM)

![](spark-architecture.png)

### RDD:

- RDD is a dataset abstraction over the distributed collection.

- An RDD is recomputed on node failures. 

- Only part of the data is calculated or recalculated as required.

- RDDs are immutable

- Can perform two types of operations on RDD:
    1. Transformation
    2. Action

- Transformation on an RDD returns another RDD.

- Transformations are lazy, actions are eagerly evaluated.

- All the transformations are applied when the first action is called, that's why transformation are lazy.



In [1]:
# import pyspark libraries

import findspark
findspark.init()
import pyspark

from pyspark.sql import SparkSession
from pyspark.sql import functions

spark = SparkSession.builder.appName("TTY").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/25 00:03:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


`parallelize(c, numSlices)`:

- `parallelize()` function is used to parallelize or distribute the data.

- `c`: collection

- `numSlices`: number of distributed chunks of data

In [2]:
# create an RDD
l = [i for i in range(1,22,2)]

rdd = spark.sparkContext.parallelize(l, 2)
rdd.collect()

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]

In [3]:
rdd.first() # takes first element

                                                                                

1

In [4]:
rdd.take(3)

[1, 3, 5]

In [5]:
# num partitions 

rdd.getNumPartitions()

2

### **Convert Temperature Data**

$$C{\degree} = (F - 32) * 5/9$$ 

In [6]:
temp_data = [59, 57.2, 53.6, 55.4, 51.8, 53.6, 55.4]
temp_rdd = spark.sparkContext.parallelize(temp_data, 2)
temp_rdd.collect()

[59, 57.2, 53.6, 55.4, 51.8, 53.6, 55.4]

In [7]:
def fahrenheitToCentigrade(t):
    """Convert Fahrenheit to Centigrade"""
    return (t-32) * 5/9

temp_centigrade_rdd = temp_rdd.map(fahrenheitToCentigrade)
temp_centigrade_rdd.collect()

[15.0, 14.000000000000002, 12.0, 13.0, 10.999999999999998, 12.0, 13.0]

In [8]:
# filter temp. > than 13 C

filtered_temp_rdd = temp_centigrade_rdd.filter(lambda x: x >= 13)
filtered_temp_rdd.collect()

[15.0, 14.000000000000002, 13.0, 13.0]

In [None]:
rdd1 = rdd.map(lambda x: [x[0], x[1], int(x[2]), int(x[3])])
rdd1.collect()

[['a', 'b', 2, 3], ['i', 'j', 1, 4], ['q', 'r', 6, 8]]

### **Basic Manipulation**

Manipulation on student grades

In [9]:
students_mark_sheet = [
    ["si1", "year1", 62.08, 62.4],
    ["si1", "year2", 75.94, 76.75],
    ['si2', 'year1', 68.26, 72.95],
    ['si2', 'year2', 85.49, 75.8],
    ['si3', 'year1', 75.08, 79.84],
    ['si3', 'year2', 54.98, 87.72],
    ['si4', 'year1', 50.03, 66.85],
    ['si4', 'year2', 71.26, 69.77],
    ['si5', 'year1', 52.74, 76.27],
    ['si5', 'year2', 50.39, 68.58],
    ['si6', 'year1', 74.86, 60.8],
    ['si6', 'year2', 58.29, 62.38],
    ['si7', 'year1', 63.95, 74.51],
    ['si7', 'year2', 66.69, 56.92]
]

mark_sheet_rdd = spark.sparkContext.parallelize(students_mark_sheet)
mark_sheet_rdd.collect()

[['si1', 'year1', 62.08, 62.4],
 ['si1', 'year2', 75.94, 76.75],
 ['si2', 'year1', 68.26, 72.95],
 ['si2', 'year2', 85.49, 75.8],
 ['si3', 'year1', 75.08, 79.84],
 ['si3', 'year2', 54.98, 87.72],
 ['si4', 'year1', 50.03, 66.85],
 ['si4', 'year2', 71.26, 69.77],
 ['si5', 'year1', 52.74, 76.27],
 ['si5', 'year2', 50.39, 68.58],
 ['si6', 'year1', 74.86, 60.8],
 ['si6', 'year2', 58.29, 62.38],
 ['si7', 'year1', 63.95, 74.51],
 ['si7', 'year2', 66.69, 56.92]]

**`collect()`**:

- `collect` func takes the whole RDD to the driver.

- If size of the RDD is very large, the driver may face a memory issue.

- `take(n)` fetches first n elements.

In [10]:
mark_sheet_rdd.take(3)

[['si1', 'year1', 62.08, 62.4],
 ['si1', 'year2', 75.94, 76.75],
 ['si2', 'year1', 68.26, 72.95]]

In [11]:
# Average semester grades

avg_marks = mark_sheet_rdd.map(lambda x: [x[0], x[1], (x[2] + x[3])/2])
avg_marks.collect()

[['si1', 'year1', 62.239999999999995],
 ['si1', 'year2', 76.345],
 ['si2', 'year1', 70.605],
 ['si2', 'year2', 80.645],
 ['si3', 'year1', 77.46000000000001],
 ['si3', 'year2', 71.35],
 ['si4', 'year1', 58.44],
 ['si4', 'year2', 70.515],
 ['si5', 'year1', 64.505],
 ['si5', 'year2', 59.485],
 ['si6', 'year1', 67.83],
 ['si6', 'year2', 60.335],
 ['si7', 'year1', 69.23],
 ['si7', 'year2', 61.805]]

In [12]:
# student's average grade in second year
second_year_marks = avg_marks.filter(lambda x: "year2" in x)
second_year_marks.collect()

[['si1', 'year2', 76.345],
 ['si2', 'year2', 80.645],
 ['si3', 'year2', 71.35],
 ['si4', 'year2', 70.515],
 ['si5', 'year2', 59.485],
 ['si6', 'year2', 60.335],
 ['si7', 'year2', 61.805]]

In [13]:
# top three students

sorted_marks_desc = second_year_marks.sortBy(keyfunc= lambda x: -x[2]) # descending order
sorted_marks_desc.collect()

                                                                                

[['si2', 'year2', 80.645],
 ['si1', 'year2', 76.345],
 ['si3', 'year2', 71.35],
 ['si4', 'year2', 70.515],
 ['si7', 'year2', 61.805],
 ['si6', 'year2', 60.335],
 ['si5', 'year2', 59.485]]

In [14]:
sorted_marks_asc = second_year_marks.sortBy(keyfunc= lambda x: x[2]) # ascending order
sorted_marks_asc.collect() 

[['si5', 'year2', 59.485],
 ['si6', 'year2', 60.335],
 ['si7', 'year2', 61.805],
 ['si4', 'year2', 70.515],
 ['si3', 'year2', 71.35],
 ['si1', 'year2', 76.345],
 ['si2', 'year2', 80.645]]

In [15]:
# Top 3 students
sorted_marks_desc.take(3)

[['si2', 'year2', 80.645], ['si1', 'year2', 76.345], ['si3', 'year2', 71.35]]

In [16]:
# takeOrdered func
top_three_students = second_year_marks.takeOrdered(num=3, key=lambda x: -x[2])
top_three_students

[['si2', 'year2', 80.645], ['si1', 'year2', 76.345], ['si3', 'year2', 71.35]]

**`takeOrdered(num, key)`**:

- `takeOrdered` is an action

- num: number of records to choose

- key: a list or a function that results sorted result

- returns a list, as an action it does not require `collect`.

In [17]:
# bottom three students in second year

bottom_three_students = second_year_marks.takeOrdered(num=3, key=lambda x: x[2])
bottom_three_students

[['si5', 'year2', 59.485], ['si6', 'year2', 60.335], ['si7', 'year2', 61.805]]

In [18]:
# students with 80% averages

more_than_80 = second_year_marks.filter(lambda x: x[2] > 80)
more_than_80.collect()

[['si2', 'year2', 80.645]]

### **Set Operations on RDDs**

In [19]:
data2001 = "RIN1 RIN2 RIN3 RIN4 RIN5 RIN6 RIN7".split() # projects in year 2001
data2002 = "RIN3 RIN4 RIN7 RIN8 RIN9".split() # projects in year 2002
data2003 = "RIN4 RIN8 RIN10 RIN11 RIN12".split() # projects in year 2003

# create RDDs
rdd2001 = spark.sparkContext.parallelize(data2001,2)
rdd2002 = spark.sparkContext.parallelize(data2002, 3)
rdd2003 = spark.sparkContext.parallelize(data2003, 2)


In [20]:
# projects initiated in three years

all_projects = rdd2001.union(rdd2002).union(rdd2003)
all_projects.collect()

['RIN1',
 'RIN2',
 'RIN3',
 'RIN4',
 'RIN5',
 'RIN6',
 'RIN7',
 'RIN3',
 'RIN4',
 'RIN7',
 'RIN8',
 'RIN9',
 'RIN4',
 'RIN8',
 'RIN10',
 'RIN11',
 'RIN12']

**Set operation**

- In PySpark set operations are *pseudo set operations* because the operations give duplicated result.



In [21]:
all_projects.distinct().count()

                                                                                

12

In [22]:
# or
all_projects = rdd2001.union(rdd2002).union(rdd2003).distinct()
all_projects.collect()

                                                                                

['RIN7',
 'RIN12',
 'RIN2',
 'RIN6',
 'RIN10',
 'RIN11',
 'RIN9',
 'RIN5',
 'RIN1',
 'RIN3',
 'RIN4',
 'RIN8']

In [23]:
# project completed in first year : first_year.subtract(second_year)
first_year_completion = rdd2001.subtract(rdd2002)
first_year_completion.collect()

                                                                                

['RIN2', 'RIN5', 'RIN1', 'RIN6']

In [24]:
# projects completed in the first two years
first_second_year_completion = rdd2001.union(rdd2002).subtract(rdd2003).distinct()
first_second_year_completion.collect()

                                                                                

['RIN7', 'RIN2', 'RIN6', 'RIN9', 'RIN5', 'RIN1', 'RIN3']

In [25]:
# projects started in 2001 and continued through 2003

rdd2001.intersection(rdd2002).subtract(rdd2003).distinct().collect()

                                                                                

['RIN7', 'RIN3']

### **Summary Statistics**

- `count()`: number of elements

- Sum: There are two ways to find the sum; first `sum()` then `reduce()`.

- Mean: Center point of the given data; `mean()`, `fold()`

$$mean = \frac{\sum\limits_{i=1}^n x_i}{n}$$

- Variance: Indicates the spread of the data around the mean; `variance()`, `sampleVariance()` to calculate sample variance.

$$variance = \frac{\sum\limits_{i=1}^n (x_i - mean)^{2}}{n}$$
$$sample variance = \frac{\sum\limits_{i=1}^n (x_i - mean)^{2}}{n-1}$$
- Standard Deviation: `stdev()` and `sampleStdev()`.

- Alternatively we can use `stats()` to find summary statistics of data.


In [26]:
air_velocity_kmph = [12, 13, 15, 12, 11, 12, 11]
air_velocity_rdd = spark.sparkContext.parallelize(air_velocity_kmph)
air_velocity_rdd.collect()

[12, 13, 15, 12, 11, 12, 11]

In [27]:
# total number of data points
air_velocity_rdd.count()

7

In [28]:
# Air velocity in a day: sum
air_velocity_rdd.sum()

86

In [29]:
# mean air velocity:
print(air_velocity_rdd.sum()/air_velocity_rdd.count())

print(air_velocity_rdd.mean())

12.285714285714286
12.285714285714286


                                                                                

In [30]:
# variance 
air_velocity_rdd.variance()

1.63265306122449

In [31]:
# sample variance

air_velocity_rdd.sampleVariance()

1.904761904761905

In [32]:
# standard deviation
air_velocity_rdd.stdev()

1.2777531299998799

In [33]:
# sample standard deviation
air_velocity_rdd.sampleStdev()

1.3801311186847085

In [34]:
# stats
for k in air_velocity_rdd.stats().asDict():
    print("{}: \t{}".format(k,air_velocity_rdd.stats().asDict()[k]))

count: 	7
mean: 	12.285714285714286
sum: 	86.0
min: 	11.0
max: 	15.0
stdev: 	1.3801311186847085
variance: 	1.904761904761905


**`stats()`**:

- Returns a tuple like structure with key value paired called `startCounter`.

- Individual elements can be fetched from the stats.

In [35]:
print("mean: {}".format(air_velocity_rdd.stats().mean()))
print("variance: {}".format(air_velocity_rdd.stats().variance()))
print("st. deviation: {}".format(air_velocity_rdd.stats().stdev()))
print("count: {}".format(air_velocity_rdd.stats().count()))
print("min: {}".format(air_velocity_rdd.min()))

mean: 12.285714285714286
variance: 1.63265306122449
st. deviation: 1.2777531299998799
count: 7
min: 11
