In [1]:
# %pip freeze > requirements.txt 

# SparkContext and RDD basics



Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

### Import libraries

In [2]:
# %pip install numpy

In [3]:
from pyspark import SparkContext
import numpy as np

## Initialize a `SparkContext` (the main abstraction to the cluster)
**Note the '4' in the argument. It denotes 4 cores to be used for this SparkContext object.**

In [4]:
# sc.stop() # stop the current SparkContext 

In [5]:
sc=SparkContext(master="local[4]")  

22/10/10 12:10:15 WARN Utils: Your hostname, AMRIT resolves to a loopback address: 127.0.1.1; using 172.27.198.74 instead (on interface eth0)
22/10/10 12:10:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/10 12:10:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/10 12:10:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/10 12:10:18 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/10 12:10:18 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [6]:
print(sc) #  print the SparkContext object

<SparkContext master=local[4] appName=pyspark-shell>


### Generate a list of random integeres

In [7]:
lst=np.random.randint(0,10,20) # starting 0 ,randomly select 20 elements up to 9

In [8]:
print(lst)

[2 3 5 8 5 1 3 4 7 1 4 3 1 6 2 7 9 7 2 9]


### Parallelize the list - this is the main operation toward distributed computing

In [9]:
A=sc.parallelize(lst) #parallelize the list to create an RDD object 

### What did we just do? We created a RDD? What is a RDD?
![](https://i.stack.imgur.com/cwrMN.png)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a **fault-tolerant collection of elements that can be operated on in parallel**. SparkContext manages the distributed data over the worker nodes through the cluster manager. 

There are two ways to create RDDs: 
* parallelizing an existing collection in your driver program, or 
* referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

We created a RDD using the former approach

### `A` is a pyspark RDD object, we cannot access the elements directly

In [10]:
type(A)

pyspark.rdd.RDD

In [11]:
A

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

### Opposite to parallelization - `collect` brings all the distributed elements and returns them to the head node. <br><br>Note - this is a slow process, do not use it often. 

In [12]:
A.collect() #collect the RDD object to a list   , collect is slow operation because it moves data from cluster to driver node (head node)  

[2, 3, 5, 8, 5, 1, 3, 4, 7, 1, 4, 3, 1, 6, 2, 7, 9, 7, 2, 9]

### How were the partitions created? Use `glom` method

In [13]:
A.glom().collect()  #glom() is used to collect the elements of each partition into a list

[[2, 3, 5, 8, 5], [1, 3, 4, 7, 1], [4, 3, 1, 6, 2], [7, 9, 7, 2, 9]]

### Now stop the SC and reinitialize it with 2 cores and see what happens when you repeat the process!

In [14]:
sc.stop()

In [15]:
sc=SparkContext(master="local[2]")

22/10/10 12:10:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/10 12:10:22 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/10 12:10:22 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [16]:
A = sc.parallelize(lst)

In [17]:
A.glom().collect()

#since only 2 cores are used, the elements are distributed in 2 partitions

[[2, 3, 5, 8, 5, 1, 3, 4, 7, 1], [4, 3, 1, 6, 2, 7, 9, 7, 2, 9]]

**The RDD is now distributed over two chunks, not four!** 

So, let's redo the process with 4 cores again.

In [18]:
sc.stop()

In [19]:
sc = SparkContext(master="local[4]")

22/10/10 12:10:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/10 12:10:24 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/10 12:10:24 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [20]:
A = sc.parallelize(lst)

## Basic operations
### `Count` the elements

In [21]:
A.count() #count the number of elements in the RDD object 

20

### The first element (`first`) and the first few elements (`take`)

In [22]:
A.first() #return the first element in the RDD object       

2

In [23]:
A.take(4) #return the first 4 elements in the RDD object , simialr to  head() in pandas

[2, 3, 5, 8]

### Removing duplicates: Get another RDD with only the `distinct` elements

The method `RDD.distinct()` Returns a new dataset that contains the distinct elements of the source dataset.

**NOTE**: This operation requires a **shuffle** in order to detect duplication across partitions. **So, it is a slow operation.**

In [24]:
A_distinct=A.distinct() # since RDD is immutable, we need to assign the distinct RDD object to a new variable

In [25]:
A_distinct.collect()

[8, 4, 5, 1, 9, 2, 6, 3, 7]

### To sum all the elements use `reduce` method

In [26]:
A.reduce(lambda x,y:x+y) #reduce the RDD object to a single value using a function  ((....((((((1+2)+3)+4)+5)+6)...))

89

### Or direct `sum` method

In [27]:
A.sum()

89

### Or using the `fold` method, which aggregates the elements of each partition, and then the results for all the partitions

In [28]:
A.fold(0,lambda x,y:x+y) #fold is similar to reduce, but it takes an initial value(0) as the first argument , 

89

### Finding maximum element by `reduce`

In [29]:
A.reduce(lambda x,y: x if x > y else y) #find the maximum element in the RDD object  .
# for [0, 4, 8, 1, 5, 6, 2, 7, 3 ]   steps is  first compare 0 and 4  as result is  4 , then compare 4 and 8 as result is 8 , then compare 8 and 1 as result is 8 , then compare 8 and 5 as result is 8 , then compare 8 and 6 as result is 8 , then compare 8 and 2 as result is 8 , then compare 8 and 7 as result is 8 , then compare 8 and 3 as result is 8 .. ( move  larger eelement to the x  and new element to y and again compare x and y until the last element)

9

### Finding longest word using `reduce`

In [30]:
words = 'These are some of the best Macintosh computers ever'.split(' ') #split the string into a list of words
wordRDD = sc.parallelize(words) #parallelize the list to create an RDD object
wordRDD.reduce(lambda w,v: w if len(w)>len(v) else v)  # as above lenght of first two words are compared and result is compared with the next word and so on

'computers'

## Functions/filtering over RDD
### Use `filter` to return a new RDD with elements satisfying a given predicate (lambda expression)

In [31]:
# Return RDD with elements divisible by 3
A.filter(lambda x:x%3==0).collect() # here we use collect() to form list for those element divisible by 3( after filtering)

[3, 3, 3, 6, 9, 9]

### Lambda functions are short and sweet but we can write regular Python functions to use with `reduce`

In [32]:
def largerThan(x,y):
    """
    Returns the last word among the longest words in a list
    """
    if len(x)> len(y):
        return x
    elif len(y) > len(x):
        return y
    else: # if the lengths are equal, return the last word
        if x < y: return x 
        else: return y

    # len(x) > len(y) != x > y   because x > y  will return the first word if the length of the words are equal
    # x < y is used to return the last word if the length of the words are equal

In [33]:
largerThan('apple','banana')

'banana'

In [34]:
wordRDD.reduce(largerThan)

'Macintosh'

## Sampling an RDD
* RDDs are often very large.
* **Aggregates, such as averages, can be approximated efficiently by using a sample.** This comes handy often for operation with extremely large datasets where a sample can tell a lot about the pattern and descriptive statistics of the data.
* Sampling is done in parallel and requires limited computation.

The method `RDD.sample(withReplacement,p)` generates a sample of the elements of the RDD. where
- `withReplacement` is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
- `p` is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.

In [35]:
# get a sample whose expected size is m
# Note that the size of the sample is different in different runs
m=5
n=20
print('sample1=',A.sample(False,m/n).collect()) 
print('sample2=',A.sample(False,m/n).collect())
print('sample3=',A.sample(False,m/n).collect())
print('sample4=',A.sample(False,m/n).collect())

sample1= [4, 6, 2, 9]
sample2= [1, 9]
sample3= [2, 7, 7]
sample4= [5, 1, 4]


### Things to note and think about
* Each time you run the previous cell, you get a different estimate
* The accuracy of the estimate is determined by the size of the sample $n*p$. Here, probability $p=\frac{m}{n}$
* See how the error changes as you vary $p$

## Basic statistics

In [36]:
print("Maximum: ",A.max())
print("Minimum: ",A.min())
print("Mean (average): ",A.mean())
print("Standard deviation: ",A.stdev())

Maximum:  9
Minimum:  1
Mean (average):  4.45
Standard deviation:  2.6167728216259047


In [37]:
A.stats()

(count: 20, mean: 4.45, stdev: 2.6167728216259047, max: 9.0, min: 1.0)

## Mapping
### `map` operation with _lambda_ function

In [38]:
B=A.map(lambda x:x*x)

In [39]:
B.collect()

[4, 9, 25, 64, 25, 1, 9, 16, 49, 1, 16, 9, 1, 36, 4, 49, 81, 49, 4, 81]

### `map` operation with regular Python function

In [40]:
def square_if_odd(x):
    if x%2==1:
        return x*x
    else:
        return x

In [41]:
A.map(square_if_odd).collect()

[2, 9, 25, 8, 25, 1, 9, 4, 49, 1, 4, 9, 1, 6, 2, 49, 81, 49, 2, 81]

### `flatmap` method returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results

In [42]:
A.flatMap(lambda x:(x,x*x)).collect()  # here result of operation  is also flattened to a single list 

[2,
 4,
 3,
 9,
 5,
 25,
 8,
 64,
 5,
 25,
 1,
 1,
 3,
 9,
 4,
 16,
 7,
 49,
 1,
 1,
 4,
 16,
 3,
 9,
 1,
 1,
 6,
 36,
 2,
 4,
 7,
 49,
 9,
 81,
 7,
 49,
 2,
 4,
 9,
 81]

## Grouping and binning
### `groupby` returns a RDD of grouped elements (iterable) as per a given group operation (function)

In [52]:
result=A.groupBy(lambda x:x%2).collect()
print(A.collect())
print(sorted(result[0][1]))
sorted([(x, sorted(y)) for (x, y) in result])

[2, 3, 5, 8, 5, 1, 3, 4, 7, 1, 4, 3, 1, 6, 2, 7, 9, 7, 2, 9]
[2, 2, 2, 4, 4, 6, 8]


[(0, [2, 2, 2, 4, 4, 6, 8]), (1, [1, 1, 1, 3, 3, 3, 5, 5, 7, 7, 7, 9, 9])]

### `histogram` method takes a list of bins/buckets and returns a tuple with result of the histogram (binning) 

In [44]:
A.histogram([x for x in range(0,100,10)])

([0, 10, 20, 30, 40, 50, 60, 70, 80, 90], [20, 0, 0, 0, 0, 0, 0, 0, 0])

## Set operations
### Create smaller RDDs to demonstrate joint operations

In [45]:
lst1=np.random.randint(0,10,3)
C=sc.parallelize(lst1)
lst2=np.random.randint(10,20,3)
D=sc.parallelize(lst2)
print("C:",C.collect())
print("D:",D.collect())

C: [2, 1, 0]
D: [16, 11, 15]


### `C+D` gives the union (like set union), not the element wise sum

In [46]:
(C+D).collect()

[2, 1, 0, 16, 11, 15]

### `cartesian` gives the pairwise product (as tuples) 

In [47]:
C.cartesian(D).collect()

[(2, 16),
 (2, 11),
 (2, 15),
 (1, 16),
 (1, 11),
 (1, 15),
 (0, 16),
 (0, 11),
 (0, 15)]

### `intersection` and `subtract `methods return a RDD of the set intersection and subtraction (difference)

In [48]:
rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
rdd1.intersection(rdd2).collect()

[1, 2, 3]

In [49]:
rdd1.subtract(rdd2).collect()

[10, 4, 5]

### Stop the `SparkContext` at the end

In [53]:
sc.stop()