<a href="https://colab.research.google.com/github/aayrm5/PySpark/blob/main/Working_with_RDD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Working with RDD (Resilient Distributed Dataset)**

**`Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark`**

**`Author: Amin Karami (PhD, FHEA)`**

---

**Resilient Distributed Dataset (RDD)**: RDD is the fundamental data structure of Spark. It is fault-tolerant (resilient) and immutable distributed collections of any type of objects.

source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

source: https://spark.apache.org/docs/latest/api/python/reference/

In [1]:
########## ONLY in Colab ##########
!pip3 install pyspark
########## ONLY in Colab ##########

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 36 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 39.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=0ca4a79e32f5f62e5ba2cf919acede521c51720daae0a9eff26141b17a0a6131
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [2]:
########## ONLY in Ubuntu Machine ##########
# Load Spark engine
# !pip3 install -q findspark
# import findspark
# findspark.init()
########## ONLY in Ubuntu Machine ##########

In [3]:
# Linking with Spark
from pyspark import SparkContext, SparkConf

In [4]:
# Initializing Spark
conf = SparkConf().setAppName("RDD_practice").setMaster("local[*]")
sc = SparkContext(conf=conf)
print(sc)

<SparkContext master=local[*] appName=RDD_practice>


In [5]:
sc.defaultParallelism

2

# **Part 1: Create RDDs and Basic Operations**
# **There are two ways to create RDDs:**

1.   Parallelizing an existing collection in your driver program
2.   Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

In [6]:
# Generate random data:
import random
randomlist = random.sample(range(0,40), 10)
print(randomlist)

[1, 25, 2, 19, 17, 32, 29, 21, 23, 13]


In [7]:
# Create RDD:
rdd1 = sc.parallelize(randomlist, 4)

In [8]:
rdd1.getNumPartitions()

4

In [9]:
# Data distribution in partitions:
rdd1.glom().collect()

[[1, 25], [2, 19], [17, 32], [29, 21, 23, 13]]

In [10]:
# Print last partition
rdd1.glom().collect()[3]

[29, 21, 23, 13]

In [11]:
# count():
rdd1.count()

10

In [12]:
# first():
rdd1.first()

1

In [13]:
# top():
rdd1.top(5)

[32, 29, 25, 23, 21]

In [14]:
# distinct():
rdd1.distinct().collect()

[32, 1, 25, 17, 29, 21, 13, 2, 19, 23]

In [15]:
# map():
rdd1.map(lambda x: x*2).collect()

[2, 50, 4, 38, 34, 64, 58, 42, 46, 26]

In [16]:
# filter(): 
rdd1.filter(lambda x: x%2==0).collect()

[2, 32]

In [17]:
# flatMap():
rdd_flatmap = rdd1.flatMap(lambda x: [x+2, x+5])
print(rdd_flatmap.collect())
print(rdd_flatmap.reduce(lambda x,y: x+y))

[3, 6, 27, 30, 4, 7, 21, 24, 19, 22, 34, 37, 31, 34, 23, 26, 25, 28, 15, 18]
434


In [18]:
# Descriptive statistics:
print([
       rdd1.max(), rdd1.min(), rdd1.mean(), rdd1.sum(), round(rdd1.stdev(),2), rdd1.top(2)
])

[32, 1, 18.2, 182, 9.86, [32, 29]]


In [19]:
# mapPartitions():
def my_func(partition):
    sum=0
    for item in partition:
        sum=+item
    yield sum

rdd_mapPartition = rdd1.mapPartitions(my_func)
rdd_mapPartition.collect()

[25, 19, 32, 13]

# **Part 2: Advanced RDD Transformations and Actions**

In [None]:
# union():


In [None]:
# intersection():


In [None]:
# Find empty partitions


In [None]:
# coalesce(numPartitions):


In [None]:
# takeSample(withReplacement, num, [seed])


In [None]:
# takeOrdered(n, [ordering])


In [None]:
# reduce():


In [None]:
# reduceByKey():


In [None]:
# sortByKey():


In [None]:
# countByKey()


In [None]:
# groupByKey():


In [None]:
# lookup(key):


In [None]:
# cache:
# By default, each transformed RDD may be recomputed each time you run an action on it.
# However, you may also persist an RDD in memory using the persist (or cache) method,
# in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.


In [None]:
# Persistence (https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence)
