# **Introduction to Spark**

❗Note - This is taken from Noam Cohen's [**Spark Course**](https://github.com/cnoam/spark-course/tree/master). Each notebook is accompanied by a [short video](https://panoptotech.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=ef346a37-dcee-479c-a5cd-afa800b16489).

❗If you want to learn more about the [Spark Architecture](https://panoptotech.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=0ab9daf8-d152-4656-b10f-afa800a52e2b)

## Initialization

Whenever we request a Spark session, we either get an existing one or a new one — potentially with any specified configuration changes.

We're setting up Spark on Google Colab by installing Java and PySpark because Spark is built in Scala and runs on the Java Virtual Machine (JVM).

In [1]:
# Install PySpark on the Colab machine - code in "חומר עזר קולאב" on moodle.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version

openjdk version "1.8.0_442"
OpenJDK Runtime Environment (build 1.8.0_442-8u442-b06~us1-0ubuntu1~22.04-b06)
OpenJDK 64-Bit Server VM (build 25.442-b06, mixed mode)


Here, we install PySpark and findspark, which helps locate the PySpark installation on the Google Colab machine.

In [2]:
!pip install --force-reinstall pyspark==3.4
!pip install findspark

Collecting pyspark==3.4
  Using cached pyspark-3.4.0-py2.py3-none-any.whl
Collecting py4j==0.10.9.7 (from pyspark==3.4)
  Using cached py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Using cached py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstalling py4j-0.10.9.7:
      Successfully uninstalled py4j-0.10.9.7
  Attempting uninstall: pyspark
    Found existing installation: pyspark 3.4.0
    Uninstalling pyspark-3.4.0:
      Successfully uninstalled pyspark-3.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-spark-connect 0.5.2 requires pyspark>=3.5, but you have pyspark 3.4.0 which is incompatible.[0m[31m
[0mSuccessfully installed py4j-0.10.9.7 pyspark-3.4.0


In [3]:
import pyspark # Importing PySpark (Providing a Python API to work with Spark's Distributed Computing Engine)
from pyspark.sql import SparkSession
from pyspark.mllib.random import RandomRDDs
from pyspark.sql.types import*

In [4]:
spark = SparkSession.builder.appName('Tutorial 3').getOrCreate() # Creating our SparkSession
sc = spark.sparkContext
# keep only important logs
spark.sparkContext.setLogLevel("ERROR")

In [5]:
# ONLY when running in jupyter (NOT IMPORTANT - But basically displays dfs withouth explicitly calling .show())
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [6]:
# see what version of Spark we are running.
spark

You should get something like
```
SparkSession - in-memory

SparkContext

Spark UI

Version           v3.2.0  << should be at least 3.2.0
Master            local[*] << local means Spark is running on one machine, '*' means it uses all the cores in this machine
AppName           Tutorial 3
```

In [7]:
import os
print(f"SparkContext default Number of partitions: {sc.defaultParallelism}")
print(f"Number of CPUs in the system: {os.cpu_count()}")

SparkContext default Number of partitions: 2
Number of CPUs in the system: 2


The Spark UI is available once the session object is created

ONLY on Local workspace (not on Colab) - Now open this link to see the Spark UI:
http://localhost:4040

## Working with RDDs

In [8]:
# parallelize() will copy the python object to the JVM, then cut it into partitions according to some rule,
# and then send the partitions to the worker nodes for processing.

nums = sc.parallelize([1, 2, 3, 4, 555])
print('Type:', type(nums))
print('Count:', nums.count())
print(f"Number of partitions in nums RDD: {nums.getNumPartitions()}")

# each node runs the map() on the partitions it has.
# The collect() collects all the results (partitions of the result RDD) to the driver node,
# Then copies the data from the JVM to the python process.

print('Squared:', nums.map(lambda x: x**2).collect())

Type: <class 'pyspark.rdd.RDD'>
Count: 5
Number of partitions in nums RDD: 2
Squared: [1, 4, 9, 16, 308025]


In [9]:
# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.

u = RandomRDDs.normalRDD(spark, 1000000, 10)
u = u.map(lambda x: (x,)) # convert to tuple so we can transorm into DF

## Working with Dataframes


In [10]:
schema = StructType([  StructField('c1', FloatType(), False)])
# we can move from RDD to Dataframe and back.
df = spark.createDataFrame(u, schema)

In [11]:
# each DF has a schema:
df.printSchema()
df.show(5)

root
 |-- c1: float (nullable = false)

+-----------+
|         c1|
+-----------+
| -1.3167918|
| 0.32184425|
|  1.2145109|
| -1.1810241|
|-0.17757528|
+-----------+
only showing top 5 rows



In [12]:
#Tip: To see the table in nicer format, convert it to Pandas:
df.toPandas()

Unnamed: 0,c1
0,-1.316792
1,0.321844
2,1.214511
3,-1.181024
4,-0.177575
...,...
999995,1.269174
999996,1.022593
999997,-0.519907
999998,-0.431149


In [13]:
# Get the RDD from the Dataframe
r = df.rdd
type(r)

In [14]:
# Create a simple dataframe
dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]
rdd = spark.sparkContext.parallelize(dept)

df = rdd.toDF()
df.printSchema()
df.show(truncate=True)

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

+---------+---+
|       _1| _2|
+---------+---+
|  Finance| 10|
|Marketing| 20|
|    Sales| 30|
|       IT| 40|
+---------+---+



In [15]:
#  Transformation:
# create an array of M numbers
# This is fast since it is a TRANSFORMATION.
# It is just an execution plan, so if aFalsellocating M numbers
# will use all the memory on this machine, we will not see it now.

M = 100*1000#*1000

myRange = spark.range(M).toDF("number")
nums_doubled_df = myRange.selectExpr("(number * 2) as value")

In [16]:
# Actions:
# Collect the dataframe from all worker nodes (the executors) to the driver program.
# if this is too large a "Java heap space exception" will happen, and then you have to restart your kernel.
# Since we use Pyspark, this data is then *copied* from the JVM to the python runtime.

print(nums_doubled_df.take(5))

[Row(value=0), Row(value=2), Row(value=4), Row(value=6), Row(value=8)]


In [17]:
the_big_list = myRange.collect()
type(the_big_list), len(the_big_list)

(list, 100000)

In [18]:
divisBy2 = myRange.where("number % 2 = 0")
print("Count: ",divisBy2.count())
divisBy2.sort('number').show(8)

Count:  50000
+------+
|number|
+------+
|     0|
|     2|
|     4|
|     6|
|     8|
|    10|
|    12|
|    14|
+------+
only showing top 8 rows



## **Example**: Text Processing using RDDs and MapReduce (Counting "Grrahs")

In [19]:
icespice_rdd = sc.textFile('./IceSpiceLyricsClean.txt')
print('Type:', type(icespice_rdd))
print('Count (rows):', icespice_rdd.count())

Type: <class 'pyspark.rdd.RDD'>
Count (rows): 65


<hr>

In [20]:
import re

# Split rows into words AND clean them
deli_words = icespice_rdd.flatMap(lambda row: re.findall(r"\b\w+\b", row.lower()))


In [21]:
# Top 10 most frequent words:
# str.casefold() >= str.lower() (Works also for Unicode)
deli_words.map(lambda word: (word.casefold(), 1)) \
        .reduceByKey(lambda a, b: a + b) \
            .sortBy(lambda t: t[1], ascending=False) \
                .take(10)

[('grrah', 34),
 ('i', 29),
 ('m', 21),
 ('like', 17),
 ('she', 17),
 ('my', 12),
 ('a', 11),
 ('in', 11),
 ('to', 10),
 ('the', 10)]

In [22]:
# counting 'baby':
deli_words.filter(lambda word: word.lower() == 'grrah').count()

34

Checking we got it right - not using Spark

In [23]:
# Counting 'grrah' - sanity check - loading into RAM (NOT SPARK!)
grrah_count = 0
for word in deli_words.take(550): # Load (at most) 550 elements into local variable
  if word == 'grrah':
    grrah_count += 1
print(f'The word \'grrah\' has appeared {grrah_count} times in the song')

The word 'grrah' has appeared 34 times in the song


# Check yourself

Try to increase M (in the range() above ) by 1000, and run the code again. What do you expect?