# RDD Transformations with example

**First, create an RDD by reading a text file.**

In [0]:
rdd = sc.textFile('dbfs:/FileStore/test.txt')

**flatMap** – flatMap() transformation flattens the RDD after applying the function and returns a new RDD. On the below example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a single word on each record.

In [0]:
rdd2 = rdd.flatMap(lambda x:x.split(" "))

**map** – map() transformation is used to apply any complex operations like adding a column, updating a column e.t.c, the output of map transformations would always have the same number of records as input.

In our word count example, we are adding a new column with value 1 for each word, the result of the RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of type Int as value.

In [0]:
rdd3 = rdd2.map(lambda x: (x,1))

In [0]:
rdd3.collect()

Out[5]: [('Project', 1),
 ('Gutenberg’s', 1),
 ('Alice’s', 1),
 ('Adventures', 1),
 ('in', 1),
 ('Wonderland', 1),
 ('by', 1),
 ('Lewis', 1),
 ('Carroll', 1),
 ('This', 1),
 ('eBook', 1),
 ('is', 1),
 ('for', 1),
 ('the', 1),
 ('use', 1),
 ('of', 1),
 ('anyone', 1),
 ('anywhere', 1),
 ('at', 1),
 ('no', 1),
 ('cost', 1),
 ('and', 1),
 ('with', 1),
 ('Alice’s', 1),
 ('Adventures', 1),
 ('in', 1),
 ('Wonderland', 1),
 ('by', 1),
 ('Lewis', 1),
 ('Carroll', 1),
 ('This', 1),
 ('eBook', 1),
 ('is', 1),
 ('for', 1),
 ('the', 1),
 ('use', 1),
 ('of', 1),
 ('anyone', 1),
 ('anywhere', 1),
 ('at', 1),
 ('no', 1),
 ('cost', 1),
 ('and', 1),
 ('with', 1),
 ('This', 1),
 ('eBook', 1),
 ('is', 1),
 ('for', 1),
 ('the', 1),
 ('use', 1),
 ('of', 1),
 ('anyone', 1),
 ('anywhere', 1),
 ('at', 1),
 ('no', 1),
 ('cost', 1),
 ('and', 1),
 ('with', 1),
 ('Project', 1),
 ('Gutenberg’s', 1),
 ('Alice’s', 1),
 ('Adventures', 1),
 ('in', 1),
 ('Wonderland', 1),
 ('by', 1),
 ('Lewis', 1),
 ('Carroll', 1),
 ('T

**reduceByKey** – reduceByKey() merges the values for each key with the function specified. In our example, it reduces the word string by applying the sum function on value. The result of our RDD contains unique words and their count.

In [0]:
rdd4 = rdd3.reduceByKey(lambda a,b: a+b)

In [0]:
rdd4.collect()

Out[7]: [('Project', 9),
 ('Gutenberg’s', 9),
 ('Alice’s', 18),
 ('in', 18),
 ('Lewis', 18),
 ('Carroll', 18),
 ('is', 27),
 ('use', 27),
 ('of', 27),
 ('anyone', 27),
 ('anywhere', 27),
 ('at', 27),
 ('no', 27),
 ('Adventures', 18),
 ('Wonderland', 18),
 ('by', 18),
 ('This', 27),
 ('eBook', 27),
 ('for', 27),
 ('the', 27),
 ('cost', 27),
 ('and', 27),
 ('with', 27)]

**sortByKey** – sortByKey() transformation is used to sort RDD elements on key. In our example, first, we convert RDD[(String,Int]) to RDD[(Int, String]) using map transformation and apply sortByKey which ideally does sort on an integer value. And finally, foreach with println statements returns all words in RDD and their count as key-value pair

In [0]:
rdd5 = rdd4.map(lambda x: (x[1], x[0])).sortByKey()

# Print rdd5 result to console
print(rdd5.collect())

[(9, 'Project'), (9, 'Gutenberg’s'), (18, 'Alice’s'), (18, 'in'), (18, 'Lewis'), (18, 'Carroll'), (18, 'Adventures'), (18, 'Wonderland'), (18, 'by'), (27, 'is'), (27, 'use'), (27, 'of'), (27, 'anyone'), (27, 'anywhere'), (27, 'at'), (27, 'no'), (27, 'This'), (27, 'eBook'), (27, 'for'), (27, 'the'), (27, 'cost'), (27, 'and'), (27, 'with')]


**filter** – filter() transformation is used to filter the records in an RDD. In our example we are filtering all words starts with “a”.

In [0]:
rdd6 = rdd5.filter(lambda x: 'an' in x[1])

print(rdd6.collect())

[(18, 'Wonderland'), (27, 'anyone'), (27, 'anywhere'), (27, 'and')]
