<a id='installing-spark'></a>
### Installing Spark

Install Dependencies:


1.   Java 8
2.   Apache Spark with hadoop and
3.   Findspark (used to locate the spark in the system)


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

Set Environment Variables:

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [None]:
!ls

sample_data  spark-3.1.1-bin-hadoop3.2	spark-3.1.1-bin-hadoop3.2.tgz


In [None]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()

In [None]:
sc = SparkContext.getOrCreate();

In [None]:
wordsList = ["abc", "apple", "apple", "orange", "watermelon", "seed", "apple_seed", "history", "happy"]
wordsRDD = sc.parallelize(wordsList)
wordsRDD.getNumPartitions()

2

In [None]:
print(wordsRDD.collect())

['abc', 'apple', 'apple', 'orange', 'watermelon', 'seed', 'apple_seed', 'history', 'happy']


In [None]:
wordsRDD = sc.parallelize(wordsList,4)
wordsRDD.getNumPartitions()

4

is there any way to check the elements in each partition?

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.glom.html

In [None]:
wordsRDD.glom().collect()

[['abc', 'apple'],
 ['apple', 'orange'],
 ['watermelon', 'seed'],
 ['apple_seed', 'history', 'happy']]

In [None]:
a = sc.parallelize(range(10), 5)
a.glom().collect()

[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]

Can we perform repartition ? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.repartition.html

In [None]:
a.repartition(2).glom().collect()

[[0, 1, 4, 5, 6, 7], [2, 3, 8, 9]]

In [None]:
a.repartition(2).getNumPartitions()

2

The above output involved shuffuling

In [None]:
a = sc.parallelize(range(12), 4)
a.glom().collect()

[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]

In [None]:
a.coalesce(2).glom().collect()

[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11]]

In [None]:
wordsRDD.collect()

['abc',
 'apple',
 'apple',
 'orange',
 'watermelon',
 'seed',
 'apple_seed',
 'history',
 'happy']

In [None]:
rdd1 = wordsRDD.map(lambda x: (x, x[0]))
rdd1.collect()

[('abc', 'a'),
 ('apple', 'a'),
 ('apple', 'a'),
 ('orange', 'o'),
 ('watermelon', 'w'),
 ('seed', 's'),
 ('apple_seed', 'a'),
 ('history', 'h'),
 ('happy', 'h')]

In [None]:
rdd2 = rdd1.map(lambda x: (x[1], x[0]))
rdd2.collect()

[('a', 'abc'),
 ('a', 'apple'),
 ('a', 'apple'),
 ('o', 'orange'),
 ('w', 'watermelon'),
 ('s', 'seed'),
 ('a', 'apple_seed'),
 ('h', 'history'),
 ('h', 'happy')]

In [None]:
a = 'apple'
b = 'orange'
a + b

'appleorange'

In [None]:
rdd2.collect()

[('a', 'abc'),
 ('a', 'apple'),
 ('a', 'apple'),
 ('o', 'orange'),
 ('w', 'watermelon'),
 ('s', 'seed'),
 ('a', 'apple_seed'),
 ('h', 'history'),
 ('h', 'happy')]

In [None]:
rdd3 = rdd2.map(lambda x:  (x[0],   ( x[1], 1  ) )  )
rdd3.collect()

[('a', ('abc', 1)),
 ('a', ('apple', 1)),
 ('a', ('apple', 1)),
 ('o', ('orange', 1)),
 ('w', ('watermelon', 1)),
 ('s', ('seed', 1)),
 ('a', ('apple_seed', 1)),
 ('h', ('history', 1)),
 ('h', ('happy', 1))]

In [None]:
rdd3.reduceByKey( lambda a, b : ( a[0], a[1]+b[1] )).collect()

[('s', ('seed', 1)),
 ('a', ('abc', 4)),
 ('w', ('watermelon', 1)),
 ('h', ('history', 2)),
 ('o', ('orange', 1))]

In [None]:
a = 'apple'
a[0]

'a'

In [None]:
wordsRDD.map(lambda x:   (x, 1)  ).collect()

[('abc', 1),
 ('apple', 1),
 ('apple', 1),
 ('orange', 1),
 ('watermelon', 1),
 ('seed', 1),
 ('apple_seed', 1),
 ('history', 1),
 ('happy', 1)]

In [None]:
wordPairs = wordsRDD.map(lambda x:x)
wordPairs.collect()

['abc',
 'apple',
 'apple',
 'orange',
 'watermelon',
 'seed',
 'apple_seed',
 'history',
 'happy']

In [None]:
wordPairs = wordsRDD.map(lambda x:(x,1))
wordPairs.collect()

[('abc', 1),
 ('apple', 1),
 ('apple', 1),
 ('orange', 1),
 ('watermelon', 1),
 ('seed', 1),
 ('apple_seed', 1),
 ('history', 1),
 ('happy', 1)]

In [None]:
wordsCount = wordPairs.reduceByKey(lambda a,b:a +b)
wordsCount.collect()

[('orange', 1),
 ('watermelon', 1),
 ('apple_seed', 1),
 ('history', 1),
 ('abc', 1),
 ('seed', 1),
 ('apple', 2),
 ('happy', 1)]

In [None]:
rdd2.collect()

[('a', 'abc'),
 ('a', 'apple'),
 ('a', 'apple'),
 ('o', 'orange'),
 ('w', 'watermelon'),
 ('s', 'seed'),
 ('a', 'apple_seed'),
 ('h', 'history'),
 ('h', 'happy')]

In [None]:
wordsGrouped = rdd2.groupByKey()
wordsGrouped.collect()

[('s', <pyspark.resultiterable.ResultIterable at 0x79b26823b7f0>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x79b26823af20>),
 ('w', <pyspark.resultiterable.ResultIterable at 0x79b26823ad70>),
 ('h', <pyspark.resultiterable.ResultIterable at 0x79b26823b310>),
 ('o', <pyspark.resultiterable.ResultIterable at 0x79b26823bb80>)]

In [None]:
rdd4 = wordsGrouped.mapValues(list)
rdd4.collect()

[('s', ['seed']),
 ('a', ['abc', 'apple', 'apple', 'apple_seed']),
 ('w', ['watermelon']),
 ('h', ['history', 'happy']),
 ('o', ['orange'])]

In [None]:
rdd5 = rdd4.map(lambda x:  (x[0],  len(x[1]) ))
rdd5.collect()

[('s', 1), ('a', 4), ('w', 1), ('h', 2), ('o', 1)]

In [None]:
rdd2.collect()

[('a', 'abc'),
 ('a', 'apple'),
 ('a', 'apple'),
 ('o', 'orange'),
 ('w', 'watermelon'),
 ('s', 'seed'),
 ('a', 'apple_seed'),
 ('h', 'history'),
 ('h', 'happy')]

In [None]:
rdd2prime = rdd2.map(lambda x: (x[0],  x[1]  ))
rdd2prime.collect()

[('a', 'abc'),
 ('a', 'apple'),
 ('a', 'apple'),
 ('o', 'orange'),
 ('w', 'watermelon'),
 ('s', 'seed'),
 ('a', 'apple_seed'),
 ('h', 'history'),
 ('h', 'happy')]

In [None]:
rdd2prime = rdd2.map(lambda x: (x[0],   [x[1]]  ))
rdd2prime.collect()

[('a', ['abc']),
 ('a', ['apple']),
 ('a', ['apple']),
 ('o', ['orange']),
 ('w', ['watermelon']),
 ('s', ['seed']),
 ('a', ['apple_seed']),
 ('h', ['history']),
 ('h', ['happy'])]

In [None]:
a = ['apple']
b = ['ornage']
a+b

['apple', 'ornage']

In [None]:
rdd2prime.reduceByKey(lambda a,b: a+b).collect()

[('s', ['seed']),
 ('a', ['abc', 'apple', 'apple', 'apple_seed']),
 ('w', ['watermelon']),
 ('h', ['history', 'happy']),
 ('o', ['orange'])]

In [None]:
rdd3 = rdd2prime.reduceByKey(lambda a,b: a+b)
rdd3.mapValues(lambda a: len(a)).collect()

[('s', 1), ('a', 4), ('w', 1), ('h', 2), ('o', 1)]

In [None]:
rdd3 = rdd2prime.reduceByKey(lambda a,b: a+b)
rdd3.mapValues(len).collect()

[('s', 1), ('a', 4), ('w', 1), ('h', 2), ('o', 1)]

Can we check the values ?

In [None]:
wordsGrouped.mapValues(list).collect()

[('orange', [1]),
 ('watermelon', [1]),
 ('apple_seed', [1]),
 ('history', [1]),
 ('abc', [1]),
 ('seed', [1]),
 ('apple', [1, 1]),
 ('happy', [1])]

In [None]:
wordCountsGrouped = wordsGrouped.map(lambda args: (args[0], len(args[1])))

In [None]:
wordCountsGrouped.collect()

[('orange', 1),
 ('watermelon', 1),
 ('apple_seed', 1),
 ('history', 1),
 ('abc', 1),
 ('seed', 1),
 ('apple', 2),
 ('happy', 1)]

In [None]:
def startWith(str):
  return str[0]

In [None]:
startWith("apple")

'a'

In [None]:
wordPairs = wordsRDD.map(lambda x:(x,1))

Write a code that converst a word into a key-value pair, where the key is the beginning of each word.

In [None]:
wordPairs2 = wordsRDD.map(lambda x:(x,startWith(x)))

In [None]:
wordPairs2.collect()

[('abc', 'a'),
 ('apple', 'a'),
 ('apple', 'a'),
 ('orange', 'o'),
 ('watermelon', 'w'),
 ('seed', 's'),
 ('apple_seed', 'a'),
 ('history', 'h'),
 ('happy', 'h')]

Now, change key-value to value-key   ('abc','a') should be ('a','abc')

In [None]:
wordPairs3 = wordPairs2.map(lambda x: (x[1], x[0]))
wordPairs3.collect()

[('a', 'abc'),
 ('a', 'apple'),
 ('a', 'apple'),
 ('o', 'orange'),
 ('w', 'watermelon'),
 ('s', 'seed'),
 ('a', 'apple_seed'),
 ('h', 'history'),
 ('h', 'happy')]

Let's try to collect all the words that start with the same character :  ('a', ['abc','apple',.... ])

In [None]:
wordPairs3.reduceByKey(lambda a,b: a+b).collect()

[('s', 'seed'),
 ('a', 'abcappleappleapple_seed'),
 ('w', 'watermelon'),
 ('h', 'historyhappy'),
 ('o', 'orange')]

In [None]:
wordPairs3.groupByKey().collect()

[('s', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c9e110>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c9e590>),
 ('w', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c9dc60>),
 ('h', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c9f970>),
 ('o', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c9d990>)]

In [None]:
data = [(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')]
rdd = sc.parallelize(data)
result = rdd.groupByKey().collect()

how about using + with list ?

In [None]:
a = [1,2,3]
b = [4,5,6]
a+b

[1, 2, 3, 4, 5, 6]

In [None]:
wordPairs3.reduceByKey(lambda a,b: [a]+[b]).collect()

[('s', 'seed'),
 ('a', [[['abc', 'apple'], 'apple'], 'apple_seed']),
 ('w', 'watermelon'),
 ('h', ['history', 'happy']),
 ('o', 'orange')]

In [None]:
wordPairs3.combineByKey(lambda v:[v],lambda x,y:x+[y],lambda x,y:x+y).collect()

[('s', ['seed']),
 ('a', ['abc', 'apple', 'apple', 'apple_seed']),
 ('w', ['watermelon']),
 ('h', ['history', 'happy']),
 ('o', ['orange'])]

combineByKey
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.combineByKey.html

more simple way ?   create value into list

In [None]:
wordPairs3.map(lambda x: (x[0], [x[1]])).reduceByKey(lambda p,q: p+q).collect()

[('s', ['seed']),
 ('a', ['abc', 'apple', 'apple', 'apple_seed']),
 ('w', ['watermelon']),
 ('h', ['history', 'happy']),
 ('o', ['orange'])]

still another way?

In [None]:
wordPairs3.groupByKey().collect()

[('s', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c2fac0>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c2e7d0>),
 ('w', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c2ee60>),
 ('h', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c2e4a0>),
 ('o', <pyspark.resultiterable.ResultIterable at 0x7a6ff9c2fb20>)]

In [None]:
wordPairs3.groupByKey().mapValues(list).collect()

[('s', ['seed']),
 ('a', ['abc', 'apple', 'apple', 'apple_seed']),
 ('w', ['watermelon']),
 ('h', ['history', 'happy']),
 ('o', ['orange'])]

In [None]:
wordPairs4 = wordPairs3.map(lambda x: (x[0], [x[1]]))