# Exploring PySpark

This notebook covers some details on PySpark, and shows example code of how you can perform different functions / actions

#### 0. What is Apache Spark?

Spark is a big data processing framework (open-source), that improves upon previous Hadoop Map-Reduce solutions that existed, like Hive, by processing data **in-memory** and distributing the tasks amongst multiple workers (nodes), whilst being controlled and co-oridnated by a driver. This means smaller chuncks of data can be read, processed, and then collected together when needed, to perform data transformation or analytics.

The nodes in a cluster are abstracted, which means the individual nodes are not addressed directly.

#### 1. So what is `PySpark`?

PySpark is the Python API for Spark, which is originally written in/for Scala. It allows for the benefit of using Python with Spark, which is easier to pick up than a langauge such as Scala. This means you can you can use spark within your python scripting to leverage big data, and uncertake your transformations, analytics, ML etc.

#### 2. The Difference between `Local`, `Client` and `Cluster` mode

Local mode is simply running spark locally, like on your laptop. There is no submission to a cluster of machines. Only used for small, local development with limited data sizes.

In Client mode, the driver runs on the machine from where the spark application is submitted, so it could be your local laptop for example. The driver still coordinates & executes tasks on the cluster. The Client sends tasks to the cluster and takes results back. Its a good mode for development, and or debugging applications. It also provides easier access to logs from the application.

Cluster mode is where the driver is actually a node from the cluster, as well as the actual workers.<br>
The driver still does the same tasks.<br>
The client machine submits the Spark application to the cluster manager, and the cluster manager takes care of running the application on the cluster.<br>
This mode is suitable for production environments where the cluster manager handles resource allocation and scheduling of tasks.

#### 3. What is Fault Tolerance?

Spark ensures fault tolerance of data through it's lineage information. That allows it to recompute lost data rather than replicating that data across multiple nodes. At the core of fault tolerance is the RDDs (Resilient Distributed Datasets). RDDs are low-level, immutable and fault-tolerant collections of data that can be processed in parallel across a cluster. By maintaining the lineage of transformations applied to the data, it can re-compute that partition of lost data as needed, and thus not have to reprocess the entire dataset.

RDD lineage is represented as a DAG (Directed Acyclic Graph) of all the transformations applied to the base RDD. 

#### 4. RDD vs DataFrame ?

An RDD and a DataFrame are both storage organisation strategies for used by Apache Spark. An RDD is a collection of objects (data objects) across multiple nodes in a Spark cluster. A DataFrame is more similar to a standard SQL Database table, where an overlying schema on the data lays it out into columns and rows. The DataFrame API is useful, as with data laid out in columns, queries can be optimized for performance. 

An RDD distributes data across partitions across multiple nodes (servers etc.) as unstructured blocks of data. As this data is immutable, its never updated, but is recreated when changes are made. 

A DataFrame can take an RDD and add the schema structure to it.<br> 
Typically a DataFrame is best used for structured data (though can also be used for unstructured when needed), where as RDDs are more often used for Unstructured data (so you don't know the schemas etc that your data should conform to). 

`Datasets` : These offer a balance between the two.

Basically, DataFrames and Datasets are built upon RDDs, which is a core component of Spark.

#### 5. Performance : Spark 2.x vs Spark 3.x 

It used to be the case that, in Spark versions 2.x.x it was typically faster to use Scala over the Python or SQL APIs. Since Spark 3.x that is no longer true. Python & SQL, with the use of dataframes can even be faster now in some cases, or at the least, performance differences are negligible.

#### 6. What is Lazy Evaluation in Spark?

This is where as code runs, spark is not executing the transformations until an action is called, but instead building an execution plan to process the transformations in the most optimised way it can. Since Spark 3.0, it can also include measures now to rectify potential skew in the distributed data. But, as an action is called, for an example like a count() or a show(), or write data somewhere, then it executes the transformations and steps built into its plan, and thus its lazy evaluation. 

#### 7. What is Shuffling?

This is where, as we perform certain transformations or actions, data from across different nodes, needs to be moved to the same node as other corresponding data in order to perform the task, for example, joins, or group bys etc. This is known as shuffling, as data needs to move between nodes (shuffled) across the network. Shuffling isnt always avoidable, but steps can be taken to reduce it / minimise it. Things like bucketing and sorting (so you shuffle the data once upfront on certain key join keys you will often use, for example a Customer ID), can be used to have similar data in the same nodes and avoid future shuffling in your execution plan.

#### 8. YARN and why its used

YARN (Yet Another Resource Negotiator) is a resource management layer that serves as the cluster manager, so Spark can manage resources and schedule applications etc.<br>

This scheduler is in the form of identifying applications, jobs/tasks within them, what resources they need from the cluster, and are those resources available as they could be used by other competing jobs. For example, as tasks in one job finish that may release resources back to the cluster, then those resources can be targeted to another job in a queue waiting for them.

The application manager from YARN manages the acceptance, restart and completion of applications in the cluster. The driver acts as the application master in YARN mode. The ResourceManager allocates containers across various nodes in the cluster. The NodeManagers on those nodes launch the executors within those containers. The Spark driver coordinates with the executors to execute tasks. Executors process data and perform computations, storing intermediate results in memory or on disk as needed.

#### 9. Out of Memory Errors in Spark 

This is viewed through two lenses.<br>

The Driver, and the Executors. If the driver faces memory issues, which can happen during actions like `collect()` which pull all the data back to the driver node, is when the driver node does not have enough memory specified in the configs to hold the data in memory. Thus, your options would be to increase driver memory, or make code changes to work with smaller subsets of the data before pulling back to the driver.

For Executor memory, there can be a few different causes, but often it can be down to YARN memory overhead. This is where a portion of an executors memory is dedicated to off-heap storage. It stores things liike internal strings, internal objects, or objects for non scala languages lwhen using R or Python etc. So, if you see an error like `YARN killed the container due to memory limits` this might mean you need to reconfigure the (default) settings for how much executor memory is reserved for YARN memory.

High concurrency errors are where too many cores are assigned to each executor. If you have multiple cores on an executor, the memory of that executor is divided amongst them. This can potentially lead to memory exhaustion. The general guidelines for Apache Spark are four or five cores per executor, so that a machine's capacity isnt exhausted. 

Large partitions can also produce this issue. When a partition of your data, is significantly larger than others, it can lead to these issues. You may need to consider turning larger partitions into smaller ones, or increasing memory of the executors. 

#### 10. Debugging slow applications 

When you have a spark application that is performing slowly, there's a good chance a bottleneck exists somewhere.<br>
You can use the Spark UI and the Logs to try and identify which parts of the execution plan of the jobs are taking the longer time. You can follow the statistics captured on those tasks and use those from the UI to help debug.

Typically, this can help identify where one particular task could be taking a long time (larger partition compared to rest fo data), or where lots of shuffling might be occuring. You can then go back to your code, and look at things like calling an action after a certain transformation in the code, so you can help pinpoint issues, and then work on the code optimisation in the right place. 

Equally, you may want to use a UI for the cluster manager too, because it may be that your application is not getting the resources it requires, and is simply hanging in a accepted state, rather than running. The logs can help identify this.

#### 11. Starting a PySpark session interactively from the shell

So, in your shell, you can execute<br>
```
pyspark
```
which would start an interactive PySpark shell (assuming the relative installs and configs are place)<br>

<img src="./images/pyspark_shell.png">

You can then exit this shell using:

```
exit()
```

#### 12. Creating a local PySpark session in your Python Code

Note, when starting a spark session, it can take up to a few minutes at times to launch, depending on the setup being used

In [1]:
# imports 
from pyspark.sql import SparkSession 

spark = SparkSession.builder\
            .appName("my_local_spark_session")\
            .master("local")\
            .getOrCreate() 

# let's print the details of our local spark session 
print(spark.version) 

# close the spark session 
spark.stop() 

your 131072x1 screen size is bogus. expect trouble
24/05/15 14:58:11 WARN Utils: Your hostname, DCollins-Laptop1 resolves to a loopback address: 127.0.1.1; using 172.26.39.146 instead (on interface eth0)
24/05/15 14:58:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/15 14:58:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


3.4.1


#### 13. Spark Context

Spark Context establishes a connection to the spark cluster. It can connect to various cluster managers like YARN, Mesos, or a standalone Spark cluster. <br>

It's responsible for submitting jobs to the cluster, and handles the scheduling and distribution of tasks from your application to the cluster. Equally, it holds the configuration management of your spark application running on the cluster. 

It can be used to created RDDs etc. 

Basically, think of it as the main entry point for spark functionality, and was the traditional approach. SparkSession (introduced in Spark 2.x) is a unified interface that combines Spark's various functionalities into a single entry point. SparkSession integrates SparkContext and provides a higher level API for working with Spark.

In [6]:
# imports 
from pyspark.sql import SparkSession 

spark = SparkSession.builder\
            .appName("my_local_spark_session")\
            .master("local")\
            .getOrCreate() 

sc = spark.sparkContext 

print(f"Spark UI: {sc.uiWebUrl}") # Url of the Spark Web UI
print(f"Spark Application ID: {sc.applicationId}") # ID of the spark application 
print(f"Spark App Start Time: {sc.startTime}") # Start time of the application 
print(f"Spark default Paralleselism: {sc.defaultParallelism}") # Default parallelelism level 
print(f"Spark default min partitions: {sc.defaultMinPartitions}") # Default minimum partitions 

print("=" * 100)
# Status info
print(sc.statusTracker)

# close
spark.stop() 

24/05/15 12:33:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Spark UI: http://172.26.39.146:4041
Spark Application ID: local-1715772804723
Spark App Start Time: 1715772804647
Spark default Paralleselism: 1
Spark default min partitions: 1
<bound method SparkContext.statusTracker of <SparkContext master=local appName=my_local_spark_session>>


#### 14. Creating a more complex Spark application where you need to specify configs and provide JAR files for additional functionality 

* You can specify configs explicitly during the session builder of your application
* you can proivde things like JAR files or Python code to be sent to each node of the cluster during the application, so you can do additional functionality like JDBC connections to databases, or using external code like AWS Deequ

In [None]:
# below covers some PySpark examples, when working with a cluster etc
# imports
from pyspark.sql import SparkSession

useJarFiles = [
    's3://my_bucket_location/folder1/jars/dummy1.jar',
    's3://my_bucket_location/folder1/jars/dummy2.jar'
]
jarList = ",".join(useJarFiles) 

#Python version settings, this allows us to target an executor environment to match our driver 
pyspark_deps = f"s3://user/spark/shared/lib/pyspark4.8-deps/environment.tar.gz#environment"

# build spark session 
def getSpark(
        appName: str,
        driverMemory: str = "2G",
        executorMemory: str = "4G",
        executorCores: str = "5",
        queue: str = 'default',
        addJarFiles: list = [r"s3://user/spark/jdbc/jarsFiles/postgresql-42.6.0.jar"]
) -> object:
    """
    Simple function to return a spark session through object `spark`
    """
    spark = SparkSession\
            .builder\
            .appName(appName)\
            .enableHiveSupport()\
            .master("yarn")\
            .config("spark.driver.memory", driverMemory)\
            .config("spark.executor.memory", executorMemory)\
            .config("spark.yarn.queue", queue)\
            .config("spark.dynamicAllocation.enabled", "true")\
            .config("spark.dynamicAllocation.initialExecutors", "0")\
            .config("spark.dynamicAllocation.maxExecutors", "16")\
            .config("spark.dynamicAllocation.minExecutors", "1")\
            .config("spark.executors.cores", executorCores)\
            .config("spark.sql.hive.caseSensitiveInferenceMode", "INFER_ONLY")\
            .config("spark.sql.caseSensitive", "false")\
            .config("spark.sql.parquet.writeLegacyFormat", "true")\
            .config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
            .config("hive.exec.dynamic.partition.mode", "nonstrict")\
            .config("spark.shuffle.service.enabled", "true")\
            .config("spark.dynamicAllocation.InitialExecutors", "0")\
            .config("spark.yarn.dist.archives", pyspark_deps)\
            .config("spark.jars", addJarFiles)\
            .getOrCreate()
    return spark


try:
    spark = getSpark(appName='Pyspark_EMR_Test') 
    print("PySpark session available through `spark` object")
except Exception as e:
    print(e)

# show databases 
databases = spark.sql("SHOW DATABASES")
databases.toPandas() 

spark.stop()

#### 15. Reading in CSV files to a Spark DataFrame 

In [1]:
# import 
from pyspark.sql import SparkSession

# create session 
spark = SparkSession.builder\
            .appName("my_local_spark_session")\
            .master("local")\
            .getOrCreate() 
sc = spark.sparkContext 

your 131072x1 screen size is bogus. expect trouble
24/05/15 15:01:33 WARN Utils: Your hostname, DCollins-Laptop1 resolves to a loopback address: 127.0.1.1; using 172.26.39.146 instead (on interface eth0)
24/05/15 15:01:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/15 15:01:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Now, from the `/data` subfolder, read in the customer CSV file to a dataframe

In [6]:
filePath = "./data/customerMasterExtract.csv"
customer_df = (spark.read
    .option("delimiter", ",") # sets the delimiter to `,`
    .option("header", "true") # informs pyspark that row 1 should be treated as the column headings of the data 
    .option("inferSchema", "true") # lets spark infer the schema of the data itself, rather than us expliitly creating a schema
    .csv(filePath) 
)

customer_df.show(5, truncate=False) 

                                                                                

+----------+---------+--------+-------------+------------------------+--------+-------------------+----------+-------------------+
|customerID|firstName|lastName|rewardsMember|emailAddress            |postcode|profession         |dob       |customerJoined     |
+----------+---------+--------+-------------+------------------------+--------+-------------------+----------+-------------------+
|10000     |Helen    |Hope    |true         |gordon49@example.com    |N16 8GZ |Quantity surveyor  |1962-09-18|2002-01-28 13:27:36|
|10001     |George   |Hill    |false        |kgrant@example.org      |E35 0TP |Graphic Designer   |1984-10-02|2023-04-22 02:00:33|
|10002     |Hollie   |Morris  |true         |singhben@example.net    |M16 9GR |Roofer             |1985-08-19|2010-04-12 19:25:23|
|10003     |Carolyn  |Johnston|true         |barnesdawn@example.org  |TR8X 4YS|Pharmacy Technician|1958-11-02|1991-06-01 05:33:14|
|10004     |Roger    |Atkins  |true         |bernardstone@example.com|S50 0TD |Arch

#### 16. Creating a new column with a literal value

This is where you create a new column in the data, where you want all rows to have the same value.<br>
For example, lets take the above data frame and add a simple column called "cardHolder" and give every customer a default value of 'Y'

To do this, we need to import some extra PySpark methods/functions

*NOTE - In an actual script, you would do this all at the top, but for this demo, don't worry*

In [7]:
from pyspark.sql.functions import lit 

card_holder_customers = (customer_df   # specifies the base dataframe to transform from 
    .withColumn(                       # uses withColumn to create a new column 
        "cardHolder",                  # passes "cardHolder" as the new column name
        lit("Y")                       # gives each row the literal value 'Y' with the lit() method
    )
)

card_holder_customers.show(3, truncate=False) # using .show() calls an `action` which actually executes the transformation above

+----------+---------+--------+-------------+--------------------+--------+-----------------+----------+-------------------+----------+
|customerID|firstName|lastName|rewardsMember|emailAddress        |postcode|profession       |dob       |customerJoined     |cardHolder|
+----------+---------+--------+-------------+--------------------+--------+-----------------+----------+-------------------+----------+
|10000     |Helen    |Hope    |true         |gordon49@example.com|N16 8GZ |Quantity surveyor|1962-09-18|2002-01-28 13:27:36|Y         |
|10001     |George   |Hill    |false        |kgrant@example.org  |E35 0TP |Graphic Designer |1984-10-02|2023-04-22 02:00:33|Y         |
|10002     |Hollie   |Morris  |true         |singhben@example.net|M16 9GR |Roofer           |1985-08-19|2010-04-12 19:25:23|Y         |
+----------+---------+--------+-------------+--------------------+--------+-----------------+----------+-------------------+----------+
only showing top 3 rows



#### 17. Re-partition Data

We can explore how many partitions our data has, and we can re-partition it where required

In [8]:
# current number of partitions 
print(card_holder_customers.rdd.getNumPartitions()) 

1


Let's say, we want to partition on `profession`, which has relatively low cardinality

In [11]:
custs_partitioned_by_rewards = (
    card_holder_customers.repartition("profession")
)