### Hadoop File Distributed File Systems (HDFS)

*HDFS (Hadoop Distributed File System)* is the primary storage system used by Hadoop.

It is designed to store **very large files across multiple machines**, and to provide **high-throughput access to this data**.

Core Concepts
- **Distributed**: Files are split into blocks and stored across a cluster of machines (called **DataNodes**). Each DataNode holds different parts of the data, but some blocks overlap due to replication
- **Reliable**: Each block is **replicated** (default = 3 copies) to guard against node failures.
- **Scalable**: Can handle petabytes of data by simply adding more machines.
- **Optimized for Write-Once, Read-Many**: Ideal for big data processing, not for frequent file updates.

Cluster
- A group of connected computers (nodes) that work together as if they were a single system
- typically includes 1 master node and multiple worker nodes
- A single DataNode can store blocks from multiples files, and files can be from completely different directories or data pipelines

NameNode (the brain)
- Stores **metadata**: file names, directory structure, block locations
- There is typically **one active NameNode** per cluster
- Runs on a specific machine (the master), where typically only one machine acts as the NameNode at a time (Standby NameNode is a backup)
- Keeps everything in memory for fast access

DataNodes (the workers)
- Actually store the data blocks
- Periodically report back to the NameNode
- Read/write data on instruction from clients or MapReduce/Spark jobs

How Data is Stored

Let’s say you upload a 1GB file to HDFS with a 128MB block size:
- File is split into **8 blocks**
- Each block is replicated **3 times** across different DataNodes
- The NameNode tracks **which blocks are on which machines**

Why Block-Based Storage?
- Blocks simplify storage management
- Enable **parallel processing**: multiple mappers or Spark tasks can read different blocks at once
- Support **fault tolerance** via replication

Key Characteristics of HDFS

| Feature             | Description |
|---------------------|-------------|
| Block size          | Default is 128MB or 256MB |
| Replication         | Default is 3 copies per block |
| Write model         | Write-once, append-only |
| Fault tolerance     | Survives DataNode failures |
| Designed for        | Streaming large files, batch processing |

Example
If you store a 512MB CSV file:
- It becomes **4 blocks** (128MB each)
- Each block is **stored 3 times** across the cluster
- You now have **12 block replicas total**


In [1]:
import os
os.environ["JAVA_HOME"] = "/Library/Java/JavaVirtualMachines/openjdk-17.jdk/Contents/Home"

# Import findspark to help Jupyter locate your Spark installation
import findspark

# Initialize the findspark library — sets up environment so Spark works in notebooks
findspark.init()

# Import SparkSession, the main entry point to Spark functionality
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder                              # Start Spark session builder
        .appName("MyApp")                             # Set application name
        .master("local[2]")                            # Run Spark locally using 2 CPU threads
        .config("spark.sql.shuffle.partitions", 2)    # Set number of shuffle partitions (e.g. for groupBy, joins)
        .config("spark.default.parallelism", 2)       # Set number of parallel tasks for RDD operations
        .config("spark.driver.extraJavaOptions", "--add-opens java.base/javax.security.auth=ALL-UNNAMED")
        .config("spark.executor.extraJavaOptions", "--add-opens java.base/javax.security.auth=ALL-UNNAMED")
        .getOrCreate()                                # Create or return the Spark session
)

# Print the version of Spark you're using (e.g., "4.0.0")
print("✅ Spark is ready. Version:", spark.version)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/06 19:08:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/06 19:08:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


✅ Spark is ready. Version: 4.0.0


#### Partitioning and Shuffling

- Partitioning is how Spark **splits data into smaller chunks** called partitions, and these are used on the application (processing) level
- Partitions are a logical chunk of data in memory
- Each partition is processed **independently and in parallel** by a core or executor.
- In Hadoop, similar logic applies with **HDFS blocks**.

What Triggers Partitioning?
- When reading data, in the case of reading data from HDFS blocks, will partition based on the number of blocks OR by using the input split
- When doing groupBy/join/etc. where Spark is required to 'shuffle' the data so all rows with the same key are in the same partition

Why Partitions Matter
- Enables **parallel processing** across cores/machines.
- Spark DataFrames and Resilient Distributed Datasets (RDDs) are automatically partitioned.
- More partitions = more parallelism (up to a point).
- Too many partitions = overhead; too few = bottlenecks.

Why Does Spark Shuffle?
- Operations like `groupBy`, `join`, and `distinct` require **reorganizing data**.
- Spark must move all rows with the same key (e.g. `"SF"`) into the same partition.
- This movement is called a **shuffle**.

Example: groupBy("city")
- Before shuffle:
    - "SF" might be split across 2 different partitions.
    - Spark **cannot compute total count of "SF"** unless they're in the same partition.

- After shuffle:
    - All "SF" rows go to the same partition.
    - Then Spark can safely compute: `"SF" → count 2`

### 03.01 Reading Files into Spark

Data can be read into Apache Spark data frames from a variety of data sources. 

Examples : 
- A flat file on a local disk
- A file from HDFS
- A Kafka Topic


In this example, we will read a CSV file in a HDFS folder into a Spark Data Frame.

In [2]:
#Read the raw CSV file int a Spark DataFrame
#    Use inferSchema to infer the schema automatically from the CSV file

# Use the SparkSession to read a CSV file and create a DataFrame
raw_sales_data = (
    spark.read                              # Access the DataFrameReader object from the Spark session
        .option("inferSchema", "true")      # Automatically infer data types (e.g., int, float) from the data
        .option("header", "true")           # Use the first row of the CSV as column headers
        .csv("datasets/sales_orders.csv")   # Path to the CSV file to read into a DataFrame
)


#Print the schema for verification
raw_sales_data.printSchema();

#Print the first 5 records for verification
raw_sales_data.show(5)

print("Number of partitions:", raw_sales_data.rdd.getNumPartitions())  # Print the number of partitions in the DataFrame/

root
 |-- ID: integer (nullable = true)
 |-- Customer: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Rate: double (nullable = true)
 |-- Tags: string (nullable = true)

+---+--------+--------+----------+--------+-----+---------------+
| ID|Customer| Product|      Date|Quantity| Rate|           Tags|
+---+--------+--------+----------+--------+-----+---------------+
|  1|   Apple|Keyboard|2019/11/21|       5|31.15|Discount:Urgent|
|  2|LinkedIn| Headset|2019/11/25|       5| 36.9|  Urgent:Pickup|
|  3|Facebook|Keyboard|2019/11/24|       5|49.89|           NULL|
|  4|  Google|  Webcam|2019/11/07|       4|34.21|       Discount|
|  5|LinkedIn|  Webcam|2019/11/21|       3|48.69|         Pickup|
+---+--------+--------+----------+--------+-----+---------------+
only showing top 5 rows
Number of partitions: 1


When reading a CSV file using `spark.read.csv()`, Spark automatically partitions the data under the hood. Even if you're just loading a single file, Spark splits it into partitions based on file size and configuration (e.g., block size or number of cores). These partitions are stored in memory and processed in parallel to improve performance. 

While you don’t manually define partitions in basic `.read()` calls, you can control them using methods like `.repartition()`, `.coalesce()`, or read-time options like `"maxPartitionBytes"`. The data isn’t saved anywhere permanently unless you explicitly write it, but Spark manages the in-memory partitions to handle large-scale data efficiently.

### 03.02 Writing to HDFS

Write the rawSalesData Data Frame into HDFS as a Parquet file. Use Parquet as the format since it enables splitting and filtering. Use GZIP as the compression codec. 

On completion, verify if the files are correctly through the filesystem

In [3]:
# Write the raw_sales_data DataFrame to disk as a compressed Parquet file
(
    raw_sales_data.write                           # Begin writing process
        .option("compression", "gzip")             # Compress using GZIP to save space
        .mode("overwrite")                         # Overwrite output if it already exists
        .parquet("dummy_hdfs/raw_parquet")         # Output path (local or mock HDFS path)
)

This code writes the `raw_sales_data` DataFrame to disk in **Parquet format**, which is a highly efficient, column-based file format optimized for analytics. It uses **GZIP compression** to reduce storage space without significantly impacting read performance. Parquet works especially well with big data tools like Spark because it supports **splitting**, **filtering**, and **predicate pushdown**, which help Spark avoid reading unnecessary data. The `.mode("overwrite")` setting ensures that the output directory is replaced if it already exists, and the resulting files are saved to the specified path (`dummy_hdfs/raw_parquet`).

### 03.03 Write to HDFS with partitioning

Write a partitioned Parquet file in HDFS. Partition will be done by Product. This will create one directory per unique product available in the raw CSV.

In [4]:
(
    raw_sales_data.write                        # Begin the write operation
        .option("compression", "gzip")          # Compress files using GZIP
        .partitionBy("Product")                 # Physically separate data into folders based on Product column
        .mode("overwrite")                      # Overwrite the folder if it already exists
        .parquet("dummy_hdfs/partitioned_parquet")  # Write out as Parquet files to this path
)

                                                                                

This code writes the `raw_sales_data` DataFrame to disk as a **partitioned Parquet file**, using **GZIP compression** for space efficiency. The `.partitionBy("Product")` command tells Spark to **organize the output files into subdirectories based on the unique values in the `Product` column**. Each subdirectory contains only the rows for that specific product, which can drastically speed up future queries and filtering operations. As before, `.mode("overwrite")` ensures that the output is replaced if it already exists, and the files are saved to the directory `dummy_hdfs/partitioned_parquet`.

Partitioned Parquet files are especially useful in large datasets, as they allow Spark to **skip entire folders** when filtering on partitioned columns — a performance feature known as **partition pruning**.

### 03.04 Writing to Hive with Bucketing

Create a Bucketed Hive table for orders. Bucketing will be done by Product. It will create 3 buckets based on the hash generated by Product. Hive tables can be queried through SQL.

In [6]:

# Write the DataFrame as a bucketed Parquet table
(
    raw_sales_data.write                         # Begin write operation
        .format("parquet")                       # Write the output in Parquet format
        .bucketBy(3, "Product")                  # Hash "Product" into 3 buckets for optimized joins/filtering
        .saveAsTable("product_bucket_table")     # Save as a Hive-compatible table in spark-warehouse/
)
            
#Spark Hive table is stored in spark-warehouse folder

# Show registered tables (including our bucketed table)
spark.sql("SHOW TABLES").show(5)

#Read bucketed data
# Query the bucketed table — should benefit from bucket pruning
(
    spark.sql("""
        SELECT * FROM product_bucket_table 
        WHERE Product = 'Mouse'
    """).show(5)
)

#While the files are persisted to disk

                                                                                

+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  default|product_bucket_table|      false|
+---------+--------------------+-----------+

+---+--------+-------+----------+--------+-----+--------------------+
| ID|Customer|Product|      Date|Quantity| Rate|                Tags|
+---+--------+-------+----------+--------+-----+--------------------+
|  6|  Google|  Mouse|2019/11/23|       5|40.58|                NULL|
|  8|  Google|  Mouse|2019/11/13|       1|46.79|Urgent:Discount:P...|
| 14|   Apple|  Mouse|2019/11/09|       4|40.27|            Discount|
| 15|   Apple|  Mouse|2019/11/25|       5|38.89|                NULL|
| 20|LinkedIn|  Mouse|2019/11/25|       4|36.77|       Urgent:Pickup|
+---+--------+-------+----------+--------+-----+--------------------+
only showing top 5 rows


This code writes the `raw_sales_data` DataFrame as a **Hive-style bucketed Parquet table** using the `Product` column. Unlike partitioning (which creates folders by value), **bucketing divides the data into a fixed number of files based on the hash of a column**. This allows Spark to optimize certain operations like joins and aggregations by pre-sorting and co-locating related rows. 

The table is saved in the default `spark-warehouse/` directory, and registered in Spark's catalog, making it queryable using standard SQL. Bucketed tables must be written using `.saveAsTable(...)` — which means you're creating a **managed table in Spark/Hive**, not just files in a directory.

**Apache Hive** is a data warehouse system built on top of Hadoop that allows users to run **SQL queries over large datasets stored in HDFS**. Hive organizes data into tables, supports partitioning and bucketing, and stores metadata (like schemas, table names, and partitions) in a central **Hive Metastore**.
- Hive contains A Metastore: a central database that stores table schemas and locations
- Hive does NOT contain the data itself, but instead stores the metadata of the tables, how theyre partitioned, and file paths to the actual data in HDFS

Even without Hive installed, **Spark can simulate Hive behavior** using its internal catalog. When you use `.saveAsTable("table_name")` in Spark:

- The DataFrame is saved as a **persistent table**, typically in the `spark-warehouse/` directory.
- The table is **registered in Spark’s internal catalog**, making it queryable via SQL (`spark.sql(...)`).
- The storage layout and format (e.g. Parquet, bucketed, partitioned) are **Hive-compatible**, meaning you could later point a real Hive Metastore to these files.

This gives Spark users the power of **Hive-style SQL tables** and **schema-on-read**, without needing to install Hive itself.

You can use `.saveAsTable()` alongside `.partitionBy()` or `.bucketBy()` to create **optimized table layouts** that benefit from partition pruning and bucketed joins.

**Can You Query HDFS Directly Using SQL?**
- No, not without a SQL engine.
- HDFS only stores data — it has **no query interface** or understanding of schemas.

**So How *Can* You Query HDFS with SQL?**

You need a **SQL engine** on top of HDFS that understands:
- How to read the files (Parquet, ORC, etc.)
- What the schema is
- How to interpret SQL

Common options:
| Tool              | Purpose                            |
|-------------------|------------------------------------|
| **Hive**          | SQL engine + metastore over HDFS   |
| **Spark SQL**     | SQL engine with in-memory schema   |
| **Trino / Presto**| Distributed SQL over Hive/HDFS     |
| **Impala / Drill**| Fast SQL on HDFS (batch/interactive) |

**Analogy**
| Layer             | Analogy                            |
|-------------------|------------------------------------|
| **HDFS**          | A hard drive full of .csv/.parquet |
| **Hive Metastore**| The table of contents + schema     |
| **SQL Engine**    | The tool (e.g. Spark) that reads data & runs SQL |

In [7]:
spark.catalog.listDatabases() # List all databases in the Spark catalog

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/bing/Downloads/Spark/Ex_Files_Big_Data_Analytics_Hadoop_Apache_Spark/Exercise%2520Files/spark-warehouse')]

25/07/05 02:47:04 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 924982 ms exceeds timeout 120000 ms
25/07/05 02:47:04 WARN SparkContext: Killing executors is not supported by current scheduler.
25/07/05 02:47:09 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:81)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:669)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1296)
	at o