# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.3.2 with hadoop 3.2, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [11]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
    !tar xf spark-3.5.5-bin-hadoop3.tgz
    !mv spark-3.5.5-bin-hadoop3 spark
    !pip install -q findspark
    import os
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark"

# Start a Local Cluster
Use findspark.init() to start a local cluster.  If you plan to use remote cluster, skip the findspark.init() and change the cluster_url according.

In [12]:
import findspark
findspark.init()

In [13]:
spark_url = 'local'

In [14]:
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

spark = SparkSession.builder\
        .master(spark_url)\
        .appName('Spark Tutorial')\
        .config('spark.ui.port', '4040')\
        .getOrCreate()

# Spark Entry Points

In [15]:
sc = spark.sparkContext

In [16]:
spark

In [17]:
sc

## Simple RDD Operations

There are 2 types of RDD operations, tranformation and action.  Transformation is an operation applied on a RDD to create new RDD (or create a new RDD from data).  Action is an operation applied on a RDD to perform computation and send the result back to driver.

### Transformation Operations
- *sc.parallelize(data)*
create an RDD from data
- *rdd.filter(func)*
create a new rdd from existing rdd and keep only those elements that func is true

### Action Operations
- *rdd.count()*
count number of elements in an rdd
- *rdd.first()*
get the frist element in the rdd
- *rdd.collect()*
gather all elements in the rdd into a python list
- *rdd.take(n)*
gather first n-th elements in the rdd into a python list

In [18]:
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

In [19]:
rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289

In [20]:
n = rdd.count()
print('count = {0}'.format(n))

count = 5


In [21]:
rdd.first()

1

In [22]:
l = rdd.collect()
print(l)

[1, 2, 3, 4, 5]


In [23]:
l = rdd.take(3)
print(l)

[1, 2, 3]


In [27]:
f_rdd = rdd.filter(lambda d: d > 2)

In [25]:
f_rdd.collect()

[3, 4, 5]

In [28]:
f_rdd.count()

3

## RDD Operations - map and reduce

- *rdd.map(func)* -- **transformation** --
create a new rdd by performing function func on each element in an rdd
- *rdd.reduce(func)* -- **action** --
aggregate all elements in an rdd using function func

These two operations perform functions on rdd elements.  The function can be provided using lambda function.
We can supply any lambda function to map and reduce operations.  For map operation, the function must take one input and return one output.  For reduce operation, the function must take two inputs and return one output.

In [29]:
data = ['line 1', '2', 'more lines', 'last line']

In [30]:
lines = sc.parallelize(data)

In [31]:
print(lines)

ParallelCollectionRDD[7] at readRDDFromFile at PythonRDD.scala:289


In [32]:
print(lines.collect())

['line 1', '2', 'more lines', 'last line']


Count the length of each line in the RDD and store results in a new RDD

In [33]:
lineLengths = lines.map(lambda line: len(line))
print(lineLengths.collect())

[6, 1, 10, 9]


Sum the lenght of lines in the RDD.  As RDD is partitioned, this reduce operation performs in a parallel fashion.

In [34]:
totalLength = lineLengths.reduce(lambda a, b: a+b)
print(totalLength)

26


In [35]:
data = (1,2,3,4)
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x: x*2)
print(rdd2.collect())
sum_val = rdd2.reduce(lambda a, b: a+b)
print('sum = {0}'.format(sum_val))
mul_val = rdd2.reduce(lambda a, b: a*b)
print('mul = {0}'.format(mul_val))

[2, 4, 6, 8]
sum = 20
mul = 384


## RDD Operations - aggregate

Aggregate is an action operation *rdd.aggregate(zeroValue, seqOp, combOp)* that:
- performs *seqOp* to *zeroValue* and all RDD elements -- this basically transforms all elements in RDD into the type of output value
- and then aggregates the transformed RDD elements using *combOp*

Note that reduce is a simple form of aggreate operation.

In [36]:
rdd.collect()

[1, 2, 3, 4]

The following aggregate operation is basically a *rdd.reduce(lambda a, b: a+b)* as the type output value is an integer which is the same as the RDD elements

In [37]:
rdd.aggregate(0,
              lambda zero, e: zero+e,
              lambda a, b: a+b)

10

In [38]:
rdd.aggregate(0,
              lambda zero, e: zero+1,
              lambda a, b: a+b)

4

The following aggregate operation returns an order pairs of (x, y) where
- x is the sum of all elements in RDD
- y is the count of all elements in RDD

In [39]:
rdd.aggregate((0, 0),
              lambda zero, e: (zero[0]+e, zero[1]+1),
              lambda a, b: (a[0]+b[0], a[1]+b[1]))

(10, 4)

In [40]:
lines.collect()

['line 1', '2', 'more lines', 'last line']

The following aggregate operation returns an order pairs of (x, y) where
- x is the concatenation of all elements in RDD
- y is the sum of the length of all elements in RDD

In [42]:
lines.aggregate(("", 0),
                lambda zero, e: (zero[0]+e, zero[1]+len(e)),
                lambda a, b: (a[0]+b[0], a[1]+b[1]))

('line 12more lineslast line', 26)

In [43]:
lines.collect()

['line 1', '2', 'more lines', 'last line']

In [44]:
lines.reduce(lambda s1, s2: s1+s2)

'line 12more lineslast line'

# Example: Word Count

Word Count is an "Hello World" for big data programming.  In this example, we will perform word counting e.g. counting number of word occurances in a text file, "star-wars.txt".  Note that the code is not perfect as it still cannot handle punctuations, plural nouns, and those verbs in past tense properly.

Before running this example, make sure that a data file 'star-wars.txt' has been uploaded to content folder of this colab.

### Reading from "star-wars.txt"

First, read the content of the file using sc.textFile().  This creates an rdd whose elements are lines in the input file.

In [45]:
sw = sc.textFile('star-wars.txt')
for line in sw.take(10):
    print('{0}: [{1}]'.format(len(line), line))

0: []
35: [                          STAR WARS]
41: [                    !! PUBLIC  VERSION !!]
2: [  ]
49: [          �A long time ago, in a galaxy far, far ]
18: [          away...�]
0: []
55: [A vast sea of stars serves as the backdrop for the main]
55: [title.  War drums echo through the heavens as a rollup ]
28: [slowly crawls into infinity.]


In [46]:
print('Total = {0} lines'.format(sw.count()))

Total = 7518 lines


### Data Cleansing

Remove all blank lines and lower all characters in all lines.

In [47]:
nb_lines = sw.filter(lambda line: len(line) > 0)
all_lowers = nb_lines.map(lambda line: line.lower())

In [49]:
print('Non blank line = {0} lines'.format(nb_lines.count()))
for line in all_lowers.take(10):
    print('{0}: [{1}]'.format(len(line), line))

Non blank line = 4754 lines
35: [                          star wars]
41: [                    !! public  version !!]
2: [  ]
49: [          �a long time ago, in a galaxy far, far ]
18: [          away...�]
55: [a vast sea of stars serves as the backdrop for the main]
55: [title.  war drums echo through the heavens as a rollup ]
28: [slowly crawls into infinity.]
47: [          �it is a period of civil war.  rebel ]
45: [          spaceships, striking from a hidden ]


### Data Preparation - from lines to words

We can split each line into words.  Note that if we use *map* each element in the output RDD from *map* is a list of words in each line.  However, if we use *flatMap* lists in all lines are combined into an RDD of all words.

For clear understanding, let take a look at the results of map vs. flatmap below:

In [50]:
words_map = all_lowers.map(lambda line: line.split())
for l in words_map.take(5):
    print(l)

['star', 'wars']
['!!', 'public', 'version', '!!']
[]
['�a', 'long', 'time', 'ago,', 'in', 'a', 'galaxy', 'far,', 'far']
['away...�']


In [51]:
words = all_lowers.flatMap(lambda line: line.split())
for w in words.take(10):
    print(w)

star
wars
!!
public
version
!!
�a
long
time
ago,


### Counting Occurances

To count the occurances of each word, we first transform a word into a pairwise (key, value) of (word, 1)

In [52]:
words.map(lambda word: (word, 1)).take(5)

[('star', 1), ('wars', 1), ('!!', 1), ('public', 1), ('version', 1)]

After transformation, we can count the occurances using *reduceByKey* which perform reduce(function) for all elements with the same key

In [53]:
mappers = words.map(lambda word: (word, 1))
counts = mappers.reduceByKey(lambda x, y: x+y)
for wc in counts.take(10):
    print(wc)

('star', 211)
('wars', 1)
('!!', 2)
('public', 1)
('version', 1)
('�a', 1)
('long', 31)
('time', 16)
('ago,', 1)
('in', 396)


In [54]:
spark.stop()