# Introduction to Spark

## 1. What is Apache Spark?

Apache Spark is an **open-source distributed computing system** that provides an interface for programming entire clusters with implicit **data parallelism** and **fault tolerance**. It was built on top of **Hadoop MapReduce** and extends the MapReduce model to efficiently use more types of computations, including interactive queries and stream processing.

**Key aspects**:
- Framework for distributed data computing.
- Designed to be executed in large scale clusters with lots of data!
- Run faster than Hadoop MapReduce, up to 100x due to memory (RAM) usage.
- More functions than just Map and Reduce.
- Multiple APIs, multiple programming languages:
    - Core, SQL, Streaming, GraphX, ML, MlLib, Structured Streaming, …
    - Scala (native), Java (native?), Python, R.
- Runs everywhere:
    - Standalone, YARN, Mesos, Kubernetes, AWS, ...
- Fault tolerance (RDD).
- Easier resource managing.
- Reusable data: caching.
- Code control and analysis (DAG).
- Generic programming patterns: the same code can run in local mode or 100’s of executors.
- Lazy evaluation: transformations and actions.

### 1.1 Architecture

Apache Spark is a distributed data processing system, and its architecture is designed to efficiently distribute and process data across a cluster of computers. Here's a simple explanation of its main components:

**1. Driver Program**: Think of this as the "master" of the application. It's where your Spark application starts and where the final results are collected. It's responsible for:
- Running the main function of the application.
- Creating the SparkContext to coordinate tasks.
- Distributing tasks across executor processes.

**2. SparkContext (SC)**: This is like the "brain" of your Spark application. Once initialized, it coordinates tasks and keeps a connection with the Spark cluster.

**3. Cluster Manager**: This can be likened to a "job dispatcher". It's not part of Spark per se, but Spark can work with several of them, like Apache Mesos, Hadoop YARN, Kubernetes, or the built-in standalone manager. Its job is to:
- Allocate resources (like memory and CPU) for Spark applications.
- Keep track of available/used resources.
  
**4. Executors**: These are like the "workers". Each executor:
- Runs on a node in the Spark cluster.
- Is responsible for executing the tasks assigned by the driver program.
- Stores data in its memory for quick access.
   
**5. Tasks**: These are the "actual work" that needs to be done. When you write a Spark application, the driver breaks down the operations into tasks that are sent to the executors. Each task:
- Works on a slice of your data.
- Runs on an executor.

![Spark Architecture Diagram](images/spark_architecture.png)

**6. Jobs, Stages, and Tasks**: 
- **Job**: A complete computation, which can be a single action like `count()` or `saveAsTextFile()`.
- **Stage**: Jobs are divided into stages. A stage is a set of transformations on data (e.g., filtering, mapping). Stages are divided based on transformations that have wide dependencies (like a reduce) which often involve shuffling data around.
- **Task**: Each stage is further divided into tasks. A task is a unit of work sent to an executor.

<img src="images/jobs_stages_tasks.png" title="Jobs, Stages, and Tasks" width="700px"/>

**7. RDD (Resilient Distributed Dataset)**: Think of these as the "backbone" of data in Spark. They are:
- Immutable distributed collections of data.
- Split into partitions, with each partition residing on a single node.
- Can be cached in memory for faster access.

**8. DataFrames and Datasets**: These are like "enhanced" RDDs. They:
- Offer more optimizations.
- Come with a schema, so you can think of them as distributed tables.

In essence, when you run a Spark application, the driver program coordinates tasks to be executed. The cluster manager allocates resources, and executors on various nodes run these tasks on slices of data. The processed data can be collected back to the driver or stored in external storage systems.

## 2. Components

Apache Spark has a well-defined layered architecture where all the Spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries.

<img src="images/spark_stack.png" title="The Spark Stack" width="600px"/>

### 2.1 Spark Core

Spark Core is the underlying general execution engine. It provides in-memory computing capabilities to deliver speed, a generalized execution model, and the ability to integrate with a wide variety of data sources.

### 2.2 Spark SQL

Spark SQL is Spark's package for working with structured data. It provides a programming interface for data structured as well as relational processing with SQL.

### 2.3 Spark Streaming

Spark Streaming allows the processing of live data streams. With Spark Streaming, you can use Spark's API for processing data and then use the same code to process real-time data.

### 2.4 MLlib

MLlib is Spark's machine learning (ML) library. It provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as tools for constructing, evaluating, and tuning ML pipelines.

### 2.5 GraphX

GraphX is Spark's API for graphs and graph-parallel computation. It provides a growing library of graph algorithms and builders to simplify graph analytics tasks.

## 3. Creating our first Spark Session

In Jupyter notebooks you can check the documentation of Python functions and classes prepending the `?` and `??` directives

In [None]:
from pyspark.sql import SparkSession

?SparkSession

In [None]:
spark = SparkSession.builder \
            .appName("My first Spark Session") \
            .getOrCreate()
spark

The Spark UI link provides you a graphical interface to monitor information about the processes running in your Spark Session, such as Jobs, Stages, Executors and more.

## 4. Understanding RDDs (Resilient Distributed Dataset)

RDD is a fundamental data structure of Spark. It is an immutable distributed collection of objects that can be processed in parallel. RDDs can be created from Hadoop InputFormats or by transforming other RDDs.

### Properties of RDD:
- **Immutable**: Once created, the data they contain cannot be changed.
- **Lazy Evaluations**: Computations on RDDs are lazily evaluated, meaning that tasks are not executed until an action is called.
- **Fault Tolerant**: They track data lineage information to rebuild lost data.

### Creating a RDD:

RDDs can be created in two ways: by loading an external dataset or by distributing a set of collection objects (like lists or sets) from the driver program.

In [None]:
# From a list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# From a file
rdd_from_file = spark.sparkContext.textFile("datasets/numbers.txt")

<img src="images/RDD_concept.png" title="RDD concept" width="700px"/>

- The `parallelize` function distributes the data in partitions across the executors in the cluster and returns a RDD.
- The `collect` function retrieves all the partitions into the driver which then returns a single Python collection.

`Warning`: Using collect on a large RDD can be problematic as it brings the entire dataset to the driver, possibly causing it to run out of memory. It's generally used with caution and mainly for retrieving small results or during debugging.

In [None]:
rdd.collect()

In [None]:
rdd_from_file.collect()

The `glom()` function returns a list of the elements in each partition of the RDD as you can see:

In [None]:
rdd.glom().collect()

In [None]:
rdd.repartition(3).glom().collect()

## Questions

In [None]:
!pip install ipywidgets

Execute the following cell and hide the code

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

style = {'description_width': 'initial', 'width': '500px'}

OUTPUT = widgets.Output()

def create_question(index, question, options, answer_idx):
    is_multiple = isinstance(answer_idx, list)

    question_dropdown = widgets.RadioButtons(
        options=options,
        description=f'{index}. {question}',
        disabled=False,
        layout=widgets.Layout(width='100%'),
        style=style
    ) if not is_multiple else widgets.SelectMultiple(
        options=options,
        description=f'{index}. {question} (Choose all that apply by clicking while holding CTRL)',
        disabled=False,
        layout=widgets.Layout(width='100%'),
        style=style
    )

    check_button = widgets.Button(description=f"Check question {index}")

    output = widgets.Output()

    def check_answer(button):
        with output:
            clear_output(wait=True)

            if(is_multiple):
                if sorted(question_dropdown.index) == sorted(answer_idx):
                    print(f"Question {index} is correct! 🎉")
                else:
                    print(f"Question {index} is incorrect 😢 Rigth answer is {', '.join([options[idx] for idx in answer_idx])}")
                return
            
            if question_dropdown.value == options[answer_idx]:
                print(f"Question {index} is correct! 🎉")
            else:
                print(f"Question {index} is incorrect. 😢 Rigth answer is {options[answer_idx]}")

    check_button.on_click(check_answer)
    
    return question_dropdown, check_button, output

questions = [
    {
        "question": "What is Apache Spark?",
        "options": [
            "A proprietary distributed computing system.",
            "An open-source machine learning framework.",
            "A graph database system.",
            "An open-source distributed computing system."
        ],
        "answer_idx": 3
    },
    {
        "question": "Which of the following is NOT a key aspect of Apache Spark?",
        "options": [
            "Framework for distributed data computing.",
            "Designed to be executed in large scale clusters with lots of data.",
            "Multi-threaded in-memory database.",
            "Code control and analysis (DAG)."
        ],
        "answer_idx": 2
    },
    {
        "question": "What does the Driver Program in Spark's architecture do?",
        "options": [
            "It is responsible for executing the tasks assigned by the executors.",
            "Runs the main function of the application.",
            "Allocates resources (like memory and CPU) for Spark applications.",
            "Works on a slice of your data."
        ],
        "answer_idx": 1
    },
    {
        "question": "Which Cluster Manager is NOT supported by Apache Spark?",
        "options": [
            "Apache Mesos",
            "Hadoop YARN",
            "Apache Kafka",
            "Kubernetes"
        ],
        "answer_idx": 2
    },
    {
        "question": "Which Spark component is responsible for processing live data streams?",
        "options": [
            "Spark Core",
            "Spark Streaming",
            "Spark SQL",
            "MLlib"
        ],
        "answer_idx": 1
    },
    {
        "question": "What are the properties of RDD?",
        "options": [
            "Immutable",
            "Mutable",
            "Lazy Evaluations",
            "Fault Tolerant"
        ],
        "answer_idx": [0, 2, 3]
    },
    {
        "question": "If I have an RDD with 1.000.000.000 elements, should I use the 'collect()' function?",
        "options": [
            "Yes",
            "No"
        ],
        "answer_idx": 1
    }
]

question_widgets = []
for i, q in enumerate(questions):
    question_widgets += create_question(i+1, q["question"], q["options"], q["answer_idx"])

# Display the widgets
display(*question_widgets)

## Code Exercises

**Exercise 1: Create a SparkSession**

In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = ???

**Exercise 2 Load the provided list of Strings into an RDD**

In [None]:
str_list = ["Spark is such a cool piece of software", "I love Python", "The MapReduce model was revolutionary", "I like dogs"]

# Load list into RDD
rdd = ???