# 00 The basics of using Spark and Jupyter notebook


## Definitions
### Application
A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

### SparkSession
An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession object yourself.

### Job
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()).

### Stage
Each job gets divided into smaller sets of tasks called stages that *depend* on each other.
This means that a stage is executed serially

### Task
A single unit of work or execution that will be sent to a Spark executor. Tasks can be executed in parallel

<hr>

Depending on the configuration, when starting the notebook that is connected to Spark, a SparkSession and SparkContext are already created.

In [1]:
if 'spark' in dir():
    print("spark context is already created for you!")
else: print("You need to create your own SparkSession object")

You need to create your own SparkSession object


In any case, we can ask for a spark session, and we will get the existing or a new one, *maybe* with the configuration modification we specify.

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.mllib.random import RandomRDDs
from pyspark.sql.types import*

In [3]:
# Before getting/creating the Session, we can try to modify parameters. 
spark = SparkSession.builder.appName('00 the basics')\
    .getOrCreate()
sc = spark.sparkContext
# keep only important logs
spark.sparkContext.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/31 10:21:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# ONLY when running in jupyter:
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [5]:
# see what version of Spark we are running.
spark

You should get something like
```
SparkSession - in-memory

SparkContext

Spark UI

Version           v3.2.0  << should be at least 3.2.0
Master            local[*] << local means Spark is running on one machine, '*' means it uses all the cores in this machine
AppName           00 the basics
```

The Spark UI is available once the session object is created

Now open this link to see the Spark UI: 
http://localhost:4040

# What is Spark?
Apache Spark is an open-source cluster computing framework.

Built on top of Hadoop MapReduce.

Utilizes In-memory computing.

Originally developed at UC Berkeley (2009).

# Basic Dataframe operations

## The RDD  - Resilient Distributed Datasets
- Spark's primary data abstraction.
- A fault-tolerant collection of elements (any type), partitioned across the nodes of the cluster
and capable of accepting parallel operations.
- Sharing data across multiple stages of an iterative computation.<br>
Efficency is accomplished in two ways:
  - Ensures that the partitions that are assigned to each worker node are maintained between iterations to avoid shuffling data.
  - Avoids writing and reading from HDFS in between iteration jobs by keeping the RDDs in memory, since the assignment to workers is maintained from one iteration to the next, this is feasible.
- Immutable.
- RDD Operations: **Transformations & Actions**

### Transformations

Transformations are operations that will not be completed at the time
you write and execute the code in a cell - they will only get executed
once you have called a action. An example of a transformation might be
to convert an integer into a float or to filter a set of values.

### Actions

Actions are commands that are computed by Spark right at the time of
their execution. They consist of running all of the previous
transformations in order to get back an actual result. An action is
composed of one or more jobs which consists of tasks that will be
executed by the workers in parallel where possible.


Here are some simple examples of transformations and actions. Remember,
these are not all the transformations and actions - this is just a short
sample of them. We'll get to why Apache Spark is designed this way
shortly!

| Transformations(*lazy*) | Actions |
|-------------------------|---------|
| select                  | show    |
| distinct                | count   |
| groupBy                 | collect |
| sum                     | save    |
| orderBy                 |         |
| filter                  |         |
| limit                   |         |


## Directed Acyclic Graph (DAG)
* Vertices are RDD, edges are Transformations

* Generalization of MapReduce

* Action divides DAG to Stages

* This model lets Spark decide which calculations should be recomputed and which can be reused (shown as "skipped" in the user interface)

<img src="https://i.stack.imgur.com/yQf7L.png" >


## Spark​ Structured​ API
The Structured APIs are a tool for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly structured Parquet files. These APIs refer to three core types of distributed collection APIs:

* Datasets (Java and Scala API only).
* DataFrames.
* SQL tables and views


## The Dataframe

A DataFrame is the most common Structured API and simply represents a table of data with
rows and columns. The list that defines the columns and the types within those columns is called
the schema. You can think of a DataFrame as a spreadsheet with named columns.

Each column has a name and a type (e.g. StringType) .

Like RDD, the DataFrame is 
Immutable, in-memory, resilient, distributed collection of data.

It allows better optimizations (memory management and optimized execution plan), and was added to spark in version 2.


see [Data types](https://spark.apache.org/docs/latest/sql-ref-datatypes.html)


### DataFrame vs. RDD: which is better?
Usually, DataFrame. Consider RDD if:

* Unstructured data (text, media).
* Specific execution control needed.
* Data manipulation with functional programming concepts.

In practice Spark DF rides on RDD

## Working with RDD

In [25]:
# parallelize() will copy the python object to the JVM, then cut it into partitions according to some rule, 
# and then send the partitions to the worker nodes for processing.
nums = sc.parallelize([1, 2, 3, 4, 5])
print('Type:', type(nums))
print('Count:', nums.count())

# each node runs the map() on the partitions it has.
# The collect() collects all the results (partitions of the result RDD) to the driver node,
# Then copies the data from the JVM to the python process.
print('Squared:', nums.map(lambda x: x**2).collect())

Type: <class 'pyspark.rdd.RDD'>
Count: 5
Squared: [1, 4, 9, 16, 25]


In [8]:
import os
print(f"SparkContext default Number of partitions: {sc.defaultParallelism}")
print(f"Number of CPUs in the system: {os.cpu_count()}")
print(f"Number of partitions in nums RDD: {nums.getNumPartitions()}")

SparkContext default Number of partitions: 8
Number of CPUs in the system: 8
Number of partitions in nums RDD: 8


In [9]:

# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.normalRDD(spark, 1000000, 10)
u = u.map(lambda x: (x,)) # convert to tuple so we can transorm into DF

## Working with Dataframe

The RDD is the basic building block, and usually we will want to use a higher level object: The Dataframe wraps the RDD and exposes a convenient API.

DataFrame always has a schema

### The Schema
 A schema is a StructType made up of a number of fields, StructFields, that have a name, type, and a Boolean flag which specifies whether that column can contain missing or null values.
 
When reading data from a file, the schema can be inferred automatically at a cost of reading the data more than once.

### The Row

In Spark, each row in a DataFrame is a single record. Spark represents
this record as an object of type Row. Spark manipulates Row objects
using column expressions in order to produce usable values. Row objects
internally represent arrays of bytes. The byte array interface is never
shown to users because we only use column expressions to manipulate
them.



In [11]:
schema = StructType([  StructField('c1', FloatType(), False)])
# we can move from RDD to Dataframe and back. 
df = spark.createDataFrame(u, schema) #something is wrong here with the schema

# each DF has a schema:
df.printSchema()
df.show(5)

root
 |-- c1: float (nullable = false)

+-----------+
|         c1|
+-----------+
|   0.815636|
|-0.33020452|
| -0.7046485|
|  2.5385444|
| -1.1390029|
+-----------+
only showing top 5 rows



In [12]:
# get the RDD from the Dataframe
r = df.rdd
type(r)

pyspark.rdd.RDD

In [14]:
# Create a simple dataframe
dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]
rdd = spark.sparkContext.parallelize(dept)

df = rdd.toDF()
df.printSchema()
df.show(truncate=False)

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

+---------+---+
|_1       |_2 |
+---------+---+
|Finance  |10 |
|Marketing|20 |
|Sales    |30 |
|IT       |40 |
+---------+---+



In [26]:
#  Transformation:
# create an array of M numbers
# This is fast since it is a TRANSFORMATION. 
# It is just an execution plan, so if allocating M numbers
# will use all the memory on this machine, we will not see it now.
M = 100*1000 *1000
myRange = spark.range(M).toDF("number")
nums_doubled_df = myRange.selectExpr("(number * 2) as value")

In [27]:
# Actions:
# Collect the dataframe from all worker nodes (the executors) to the driver program.
# if this is too large a "Java heap space exception" will happen, and then you have to restart your kernel.
# Since we use Pyspark, this data is then *copied* from the JVM to the python runtime.

the_big_list = myRange.collect()
print(nums_doubled_df.take(5))

23/02/01 13:43:02 ERROR Executor: Exception in task 2.0 in stage 38.0 (TID 136)]
java.lang.OutOfMemoryError: Java heap space
23/02/01 13:43:02 ERROR Executor: Exception in task 3.0 in stage 38.0 (TID 137)
java.lang.OutOfMemoryError: Java heap space
23/02/01 13:43:02 ERROR Executor: Exception in task 4.0 in stage 38.0 (TID 138)
java.lang.OutOfMemoryError: Java heap space
23/02/01 13:43:02 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 2.0 in stage 38.0 (TID 136),5,main]
java.lang.OutOfMemoryError: Java heap space
23/02/01 13:43:02 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 3.0 in stage 38.0 (TID 137),5,main]
java.lang.OutOfMemoryError: Java heap space
23/02/01 13:43:02 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 4.0 in stage 38.0 (TID 138),5,main]
java.lang.OutOfMemoryError: Java heap space
23/

ConnectionRefusedError: [Errno 111] Connection refused

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/lib/python3.9/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/lib/python3.9/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/lib/python3.9/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/spark/python/pyspark/accumulators.py", line 262, in handle
    poll(accum_updates)
  File "/usr/local/spark/python/pyspark/accumulators.py", line 235, in poll
    if func():
  File "/usr/local/spark/python/pyspark/accumulators.py", line 239, in accum_updates
    num_updates = read_int(self.rfile)
  File "/usr/local/spark/python/pyspark/serializers.py", line 564, in read_int
    raise EOFError
EOFError
----------------------------------------


In [None]:
type(the_big_list), len(the_big_list)

In [19]:
divisBy2 = myRange.where("number % 2 = 0")
print("Count: ",divisBy2.count())
divisBy2.sort('number').show(8)

Count:  5000
+------+
|number|
+------+
|     0|
|     2|
|     4|
|     6|
|     8|
|    10|
|    12|
|    14|
+------+
only showing top 8 rows



## Example: text processing using RDD

In [21]:
toxic_rdd = sc.textFile('../data/toxic.txt')
print('Type:', type(toxic_rdd))
print('Count (rows):', toxic_rdd.count())

Type: <class 'pyspark.rdd.RDD'>
Count (rows): 44


<hr>

In [22]:
# Split rows to words:
toxic_words = toxic_rdd.flatMap(lambda row: row.split())

In [23]:
# Top 10 most frequent words:
toxic_words.map(lambda word: (word.casefold(), 1)) \
        .reduceByKey(lambda a, b: a + b) \
            .sortBy(lambda t: t[1], ascending=False) \
                .take(10)

[('a', 19),
 ('you', 18),
 ("i'm", 18),
 ("you're", 12),
 ('toxic', 11),
 ('now', 9),
 ('with', 9),
 ('i', 8),
 ('of', 8),
 ('taste', 8)]

In [24]:
# counting 'baby':
toxic_words.filter(lambda word: word.lower() == 'baby').count()

2

# Check yourself

Try to increase M (in the range() above ) by 1000, and run the code again. What do you expect? 