<H3> Parallelize & Chaining </H3>

<b> Parallelize </b> is used to convert a local Python collection into a RDD. This is used to convert a sample data into a RDD and to test the logic/transformations so that we don't need to test the logic on a huge file (say 10TB large file).

<b> Chaining </b>: We can chain the Spark functions calls to together to write concise code. It is a good idea to seperate the transformations from the actions so that we can add on more transformation in future if we need to

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

In [4]:
words = ("big","Data","Is","SUPER","Interesting","BIG","data","IS","A","Trending","technology")

In [5]:
words

('big',
 'Data',
 'Is',
 'SUPER',
 'Interesting',
 'BIG',
 'data',
 'IS',
 'A',
 'Trending',
 'technology')

In [6]:
words_rdd = spark.sparkContext.parallelize(words)

In [8]:
words_rdd.collect()

['big',
 'Data',
 'Is',
 'SUPER',
 'Interesting',
 'BIG',
 'data',
 'IS',
 'A',
 'Trending',
 'technology']

In [9]:
words_lower_rdd = words_rdd.map(lambda x: x.lower())

In [10]:
words_lower_rdd.collect()

['big',
 'data',
 'is',
 'super',
 'interesting',
 'big',
 'data',
 'is',
 'a',
 'trending',
 'technology']

In [11]:
mapped_words = words_lower_rdd.map(lambda x:(x,1))

In [12]:
word_count = mapped_words.reduceByKey(lambda x,y:x+y)

In [13]:
word_count.collect()

[('is', 2),
 ('super', 1),
 ('interesting', 1),
 ('trending', 1),
 ('technology', 1),
 ('big', 2),
 ('data', 2),
 ('a', 1)]

<H3> Chaining

In [14]:
result = spark. \
sparkContext. \
parallelize(words). \
map(lambda x:x.lower()). \
map(lambda x:(x,1)). \
reduceByKey(lambda x,y:x+y)

In [15]:
result.collect()

[('is', 2),
 ('trending', 1),
 ('technology', 1),
 ('super', 1),
 ('interesting', 1),
 ('big', 2),
 ('data', 2),
 ('a', 1)]