In [1]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('RDD_Act1').getOrCreate()

In [2]:
data = [("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),("B", 60)]
inputRDD = spark.sparkContext.parallelize(data)

listRDD = spark.sparkContext.parallelize([1,2,3,4,5,3,2])

# aggregate - action

<p>syntax:</p>

<code>
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
     (implicit arg0: ClassTag[U]): U
</code>
     

Since RDD’s are partitioned, the aggregate takes full advantage of it by first aggregating elements in each partition and then aggregating results of all partition to get the final result. and the result could be any type than the type of your RDD.

This takes the following arguments –

zeroValue – Initial value to be used for each partition in aggregation, this value would be used to initialize the accumulator. we mostly use 0 for integer and Nil for collections.

seqOp – This operator is used to accumulate the results of each partition, and stores the running accumulated result to U,

combOp – This operator is used to combine the results of all partitions U.

In [3]:
seqOp = (lambda x, y: x + y)
combOp = (lambda x, y: x + y)
agg = listRDD.aggregate(0, seqOp, combOp)
print(agg)

20


In [4]:
seqOp2 = (lambda x, y: (x[0] + y, x[1] + 1))
combOp2 = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
agg2 = listRDD.aggregate((0, 0), seqOp2, combOp2)
print(agg2)

(20, 7)


# treeAggregate – action

treeAggregate() – Aggregates the elements of this RDD in a multi-level tree pattern. The output of this function will be similar to the aggregate function.

Syntax: treeAggregate(zeroValue, seqOp, combOp, depth=2)

In [5]:
seqOp = (lambda x, y: x + y)
combOp = (lambda x, y: x + y)
agg = listRDD.treeAggregate(0, seqOp, combOp)
print(agg)

20


# fold - action

fold() – Aggregate the elements of each partition, and then the results for all the partitions.

In [6]:
from operator import add
foldRes = listRDD.fold(0, add)
print(foldRes)

20


# reduce

reduce() – Reduces the elements of the dataset using the specified binary operator.

In [7]:
redRes = listRDD.reduce(add)
print(redRes)

20



collect() -Return the complete dataset as an Array.

count() – Return the count of elements in the dataset.

countApprox() – Return approximate count of elements in the dataset, this method returns incomplete when execution time meets timeout.

countApproxDistinct() – Return an approximate number of distinct elements in the dataset.

countByValue() – Return Map[T,Long] key representing each unique value in dataset and value represents count each value present.


first() – Return the first element in the dataset.

top() – Return top n elements from the dataset.

Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.
    
min() – Return the minimum value from the dataset.

max() – Return the maximum value from the dataset.

take() – Return the first num elements of the dataset.

takeOrdered() – Return the first num (smallest) elements from the dataset and this is the opposite of the take() action.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.
    
takeSample() – Return the subset of the dataset in an Array.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

# GET THE NUMBER OF PARTITIONS

In [9]:
listRDD.getNumPartitions()

8

In [10]:
# df.rdd.getNumPartitions()
# for dataframes