# Core Concepts

## Spark Architecture
Apache Spark is an open-source, framework-based, developed to process large amount of data, structured or unstructured using cluster computation.
The main advantage of Spark is to use the in-memory cluster computing, implicit data parallelism and fault tolerance. It is 100x faster than MapReduce, powerful for caching and disk persistence.

![some text](../images/Intro_Spark_Arch.png)

### Driver
It is responsible for instantiationg the Spark Driver and Session. It communicates with the cluster manager and requests for resources (memory, cpu, Executors). Also, it transforms the Spark operations into DAGs and schedule them.

### Cluster Manager
It is responsible for managing and allocate resources for the nodes.

### Executor
Runs on each worker node and also communicates with the driver, It executes the distributed tasks.


## Resilient Distributed Datasets
RDD (Resilient Distributed Dataset): Think of an RDD as a collection of data items split across different computers. Spark can perform calculations on these items in parallel, making it super fast for processing large amounts of data.

Shared Variables: When Spark does calculations, it usually sends a copy of the data it needs to each computer. But sometimes, we want to share data between all the computers or collect results from all of them. This is where shared variables come in.

- Broadcast Variables: These are like read-only copies of data that Spark sends to every computer once. This is useful for sharing large datasets that are used in many calculations.

- Accumulators: These are special variables that can only be added to. They're used to collect information from all the computers, like counting things or summing numbers.
In essence, RDDs are for distributing data, and shared variables help coordinate calculations across different computers in a Spark cluster.

### Creation
Parallelized collections are created by calling SparkContext’s parallelize method on an existing iterable or collection in your driver program.

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)). 

In [None]:
from pyspark import SparkContext

sc = SparkContext("local", appName="RDD_Example")

# Create a list of numbers
data = range(40)

# Create an RDD from the list, set the number of partitions
num_partitions = 4
rdd = sc.parallelize(data, num_partitions)

# Show the data in each partition
for i in range(num_partitions):
    print(f"Data in Partition {i} -> {rdd.glom().collect()[i]}")

### Transformations
Transformations in Spark are operations that create a new RDD from an existing one. Importantly, these operations are lazy, meaning they are not executed immediately. Instead, they are remembered and only executed when an action is triggered. This allows Spark to optimize the execution plan and improve performance.

Some common transformations: map, filter, flatMap, groupBy, join, union, distinct

In [None]:
# Example filter the odd numbers from the previous rdd
# The operation is not going to be performed until an action is called
rdd_2 = rdd.filter(lambda x: x % 2 != 0)

### Actions
Actions in Spark trigger the execution of transformations and return a result to the driver program.

Some common actions: collect(), count(), first(), take(n), reduce()

In [None]:
# Perform the action of the rdd_2 (odd numbers filtered) 
# Notice the  number of partitions are inherited
for i in range(num_partitions):
    print(f"Data in Partition {i} -> {rdd_2.glom().collect()[i]}")

In [None]:
# Finish the spark Context
sc.stop()

## DataFrames
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrame Transformation").getOrCreate()

# Create a DataFrame
data = [("Alice", 30), ("Bob", 35), ("Charlie", 40)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
df.show()

In [None]:
# Select only name column
#df.select("Name").show()

# Select people older than 32
df.filter(df['Age'] > 32).show()

## SparkSQL

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.


In [None]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

# Use SQL syntax 
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

In [None]:
# finish the spark session
spark.stop()

Next steps:

Data Ingestion: Learn to read data from various sources (CSV, JSON, Parquet, databases, etc.) into PySpark DataFrames.

Data Cleaning and Transformation: Master data cleaning techniques, handling missing values, outliers, and data transformations using PySpark functions.

Data Aggregation and Analysis: Perform aggregations, grouping, and statistical analysis on large datasets using PySpark.

Machine Learning with PySpark: Explore MLlib for basic machine learning tasks.

Performance Optimization: Learn techniques to optimize PySpark code for performance, including caching, partitioning, and broadcasting.

Distributed Computing Concepts: Understand concepts like partitioning, shuffling, and task scheduling.