In [1]:
import os
os.environ["JAVA_HOME"] = "/Library/Java/JavaVirtualMachines/openjdk-17.jdk/Contents/Home"

# Import findspark to help Jupyter locate your Spark installation
import findspark

# Initialize the findspark library — sets up environment so Spark works in notebooks
findspark.init()

# Import SparkSession, the main entry point to Spark functionality
from pyspark.sql import SparkSession

spark = (SparkSession.builder # Start Spark session builder
            .appName("SparkReadJob") # Set application name
            .config("spark.sql.shuffle.partitions", 2) # Set number of shuffle partitions (e.g. for groupBy, joins)
            .config("spark.default.parallelism", 2) # Set default parallelism
            .config("spark.sql.warehouse.dir", "spark-warehouse") # Set warehouse directory
            .config("spark.driver.extraJavaOptions", "--add-opens java.base/javax.security.auth=ALL-UNNAMED")
            .config("spark.executor.extraJavaOptions", "--add-opens java.base/javax.security.auth=ALL-UNNAMED")
            .enableHiveSupport() # Enable Hive support for Spark SQL
            .master("local[2]") # Run Spark locally using 2 CPU threads
            .getOrCreate()
)     

# Print the version of Spark you're using (e.g., "4.0.0")
print("✅ Spark is ready. Version:", spark.version)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/06 19:00:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ Spark is ready. Version: 4.0.0


### 04.02 Read Parquet Files into Spark
Read a non-partitioned Parquet file into Spark. Measure the time taken. Also look at the execution plan.

In [2]:
sales_parquet = spark\
                .read\
                .parquet("dummy_hdfs/raw_parquet")

#Display the results
sales_parquet.show(5)

#show the execution plan
print("\n--------------------------EXPLAIN--------------------------")
sales_parquet.explain(True)
print("-------------------------END EXPLAIN-----------------------\n")

                                                                                

+---+--------+--------+----------+--------+-----+---------------+
| ID|Customer| Product|      Date|Quantity| Rate|           Tags|
+---+--------+--------+----------+--------+-----+---------------+
|  1|   Apple|Keyboard|2019/11/21|       5|31.15|Discount:Urgent|
|  2|LinkedIn| Headset|2019/11/25|       5| 36.9|  Urgent:Pickup|
|  3|Facebook|Keyboard|2019/11/24|       5|49.89|           NULL|
|  4|  Google|  Webcam|2019/11/07|       4|34.21|       Discount|
|  5|LinkedIn|  Webcam|2019/11/21|       3|48.69|         Pickup|
+---+--------+--------+----------+--------+-----+---------------+
only showing top 5 rows

--------------------------EXPLAIN--------------------------
== Parsed Logical Plan ==
UnresolvedDataSource format: parquet, isStreaming: false, paths: 1 provided

== Analyzed Logical Plan ==
ID: int, Customer: string, Product: string, Date: string, Quantity: int, Rate: double, Tags: string
Relation [ID#0,Customer#1,Product#2,Date#3,Quantity#4,Rate#5,Tags#6] parquet

== Optimized

                                                                                

Interpreting the Spark Execution Plan for a Non-Partitioned Parquet File

- **Parsed Logical Plan**: Spark has detected a read from a data source (in this case, **Parquet**) with one provided path. This is the raw plan before schema resolution or optimization.

- **Analyzed Logical Plan**: Spark resolves column names and their types. Here, the schema is successfully inferred with columns like `ID`, `Customer`, `Product`, etc., along with their respective data types. This confirms the file was properly read and schema was successfully applied.

- **Optimized Logical Plan**: Spark applies logical optimizations (e.g., removing unnecessary projections or filters). In this case, the plan remains unchanged, as there are no filters or transformations to optimize.

- **Physical Plan**: This is the actual execution strategy Spark will use:
  - `FileScan parquet`: Spark is scanning a Parquet file using vectorized (batched) reads for performance.
  - `ColumnarToRow`: Converts data from a columnar format (Parquet) into Spark's internal row format for further processing.
  - `PushedFilters` and `PartitionFilters` are empty because the file was **not partitioned**, and no filtering was applied at read time.

This plan confirms that Spark is performing a direct, full read of the non-partitioned Parquet file, with no pruning or predicate pushdown optimizations in place.


### 04.03. Read Partitioned Data into Spark

In [3]:
sales_partitioned = spark\
                    .read\
                    .parquet("dummy_hdfs/partitioned_parquet/*")

#Display the results
sales_partitioned.show(5)

#show the execution plan
print("\n--------------------------EXPLAIN--------------------------")
sales_partitioned.explain()
print("-------------------------END EXPLAIN-----------------------\n")

25/07/06 19:03:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: dummy_hdfs/partitioned_parquet/*.
java.io.FileNotFoundException: File dummy_hdfs/partitioned_parquet/* does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:56)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
	at org.apache.spark.sql.catalyst.analysis.ResolveDataSource.org$apache$spark$sql$catalyst$analysis$ResolveDataSource$$loadV1BatchSource(ResolveDataSource.scala:143)
	at org.apache.spark.sql.catal

+---+--------+----------+--------+-----+--------------------+
| ID|Customer|      Date|Quantity| Rate|                Tags|
+---+--------+----------+--------+-----+--------------------+
|  6|  Google|2019/11/23|       5|40.58|                NULL|
|  8|  Google|2019/11/13|       1|46.79|Urgent:Discount:P...|
| 14|   Apple|2019/11/09|       4|40.27|            Discount|
| 15|   Apple|2019/11/25|       5|38.89|                NULL|
| 20|LinkedIn|2019/11/25|       4|36.77|       Urgent:Pickup|
+---+--------+----------+--------+-----+--------------------+
only showing top 5 rows

--------------------------EXPLAIN--------------------------
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [ID#30,Customer#31,Date#32,Quantity#33,Rate#34,Tags#35] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(5 paths)[file:/Users/bing/Downloads/Spark/Ex_Files_Big_Data_Analytics_Hadoop_Ap..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ID:int,Customer:strin

Reading a partitioned parquet file was much much faster!

In [5]:
#Read specific partition only
sales_headset = spark\
                    .read\
                    .parquet("dummy_hdfs/partitioned_parquet/Product=Headset")
sales_headset.show(5)

+---+--------+----------+--------+-----+--------------------+
| ID|Customer|      Date|Quantity| Rate|                Tags|
+---+--------+----------+--------+-----+--------------------+
|  2|LinkedIn|2019/11/25|       5| 36.9|       Urgent:Pickup|
| 10|LinkedIn|2019/11/09|       2|26.91|Urgent:Discount:P...|
| 11|Facebook|2019/11/26|       5|45.84|       Urgent:Pickup|
| 12|  Google|2019/11/05|       2|41.17|     Discount:Urgent|
| 17|   Apple|2019/11/09|       4|29.98|     Discount:Urgent|
+---+--------+----------+--------+-----+--------------------+
only showing top 5 rows


### 04.04 Read Bucketed Data into Spark

In [10]:
#Spark does not persist the Hive catalog between multiple Sparksession instances
#You can additionally use a Hive metastore if you want to persist catalog
#across SparkSession instances

#Read the bucketed table directly from disk
sales_bucketed = spark\
                    .read\
                    .parquet("spark-warehouse/product_bucket_table/*")

sales_bucketed.show(5)

#Convert into a temporary view
sales_bucketed.createOrReplaceTempView("product_bucket_table")

spark.sql("SELECT * FROM product_bucket_table WHERE Product='Webcam'").show(5)

25/07/06 19:11:12 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: spark-warehouse/product_bucket_table/*.
java.io.FileNotFoundException: File spark-warehouse/product_bucket_table/* does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:56)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
	at org.apache.spark.sql.catalyst.analysis.ResolveDataSource.org$apache$spark$sql$catalyst$analysis$ResolveDataSource$$loadV1BatchSource(ResolveDataSource.scala:143)
	at org.apache.spa

+---+--------+--------+----------+--------+-----+---------------+
| ID|Customer| Product|      Date|Quantity| Rate|           Tags|
+---+--------+--------+----------+--------+-----+---------------+
|  1|   Apple|Keyboard|2019/11/21|       5|31.15|Discount:Urgent|
|  3|Facebook|Keyboard|2019/11/24|       5|49.89|           NULL|
|  4|  Google|  Webcam|2019/11/07|       4|34.21|       Discount|
|  5|LinkedIn|  Webcam|2019/11/21|       3|48.69|         Pickup|
|  7|LinkedIn|  Webcam|2019/11/20|       4|37.19|           NULL|
+---+--------+--------+----------+--------+-----+---------------+
only showing top 5 rows
+---+--------+-------+----------+--------+-----+---------------+
| ID|Customer|Product|      Date|Quantity| Rate|           Tags|
+---+--------+-------+----------+--------+-----+---------------+
|  4|  Google| Webcam|2019/11/07|       4|34.21|       Discount|
|  5|LinkedIn| Webcam|2019/11/21|       3|48.69|         Pickup|
|  7|LinkedIn| Webcam|2019/11/20|       4|37.19|         

### Notes

- Spark programs run on a driver node, which works with Spark clusters to execute them
- The driver can be thought of as the "brain" of the Spark application
- Executors are the workers who do the actual data processing
- One cluster can consist of multiple executor nodes capable of executing the program in parallel
- When Data is loaded, it is converted to a dataframe or a resilient distributed data set (RDD), and during this conversion it is partitioned and individual partitions are assigned and moved into the executor nodes available
- When a transform operation is executed, these operations are pushed down to the executors, who execute the code locally on their partitions and create new ones with the result
- No data is moved between executors, and so transforms can be executed in parallel
- During shuffling as a result of groupby or reduce, the executors need to move data back and forth from one another
- Finally, data is collected back to the driver node and the partitions are merged and sent back to the driver
- From there, they can be stored in an external database
- Spark automatically optimizes the query plan, thanks to it's built in optimizer called Catalyst

- Spark is powerful because it supports both traditional DataFrame operations and SQL, giving users flexible ways to work with data. Its true utility lies in its ability to parallelize computation across a cluster, making it much faster and more efficient for handling large-scale data. 

Resilient Distributed Dataset is the core **data structure** in Apache Spark
- Immutable, distributed collection of objects (like a list or array)
- Spread across multiple machines in a cluster
- Designed for fault-tolerant and parallel computation 
- Paruqet and Avro are *read into* RDDs or Dataframes


### Understanding Parallelism in Spark Clusters

The number of **parallel operations** (i.e., tasks that can run simultaneously) in a Spark cluster is determined by:

`# of executor nodes × # of CPU cores per executor`

For example, if you have:
- 4 executor nodes
- Each node has 5 CPU cores

Then Spark can run up to **20 parallel tasks** at a time.


What happens when there are more partitions?

If the **number of partitions in your data exceeds this maximum parallelism**, Spark will:
- **Max out parallelism**, running tasks in parallel up to the available cores
- **Queue the remaining tasks**, which run as earlier tasks complete

This allows for **efficient utilization of cluster resources** — but does not exceed the system's physical limits.



Contention with Other Jobs

Keep in mind:
- Spark is not the only thing running in most real-world environments.
- If other Spark jobs or applications are running at the same time, they will **compete for the same executor and core resources**.
- This can lead to **slower execution** or **resource starvation** if not managed properly.

"Spark only executes code when an action such as Reduce or Collect is performed."
- Spark delays execution until it knows exactly what you want as a final result
- Instead of going straight from the jump and going through all the operations, Spark optimizes the plan and then a physical execution strategy
- Analogy: adding items into a cart (transformations) but nothing is actually delivered until we click "Place Order"