# Reading a Specific Partition in Parquet with PySpark
This notebook demonstrates how to read a specific partition from a Parquet dataset using PySpark.

In [None]:
# Step 1: Start Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read Parquet Partition").getOrCreate()

## Sample DataFrame and Partitioned Write
Let's create a simple DataFrame and write it partitioned by the 'age' column.

In [None]:
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 28)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, schema=columns)
df.write.partitionBy("age").parquet("/tmp/output/parquet_partitioned", mode="overwrite")

## Method 1: Use `.filter()` to Load Specific Partition
This method uses partition pruning, which is efficient and preserves the partition column.

In [None]:
df_filtered = spark.read.parquet("/tmp/output/parquet_partitioned").filter("age = 28")
df_filtered.show()

## Method 2: Read Specific Partition Folder Directly
This method is faster if you know the exact partition path, but it does NOT include the partition column automatically.

In [None]:
df_direct = spark.read.parquet("/tmp/output/parquet_partitioned/age=28")
df_direct.show()

### Notes:
- `.filter()` is preferred for dynamic queries and automatic partition pruning.
- Direct path is useful for fast access when partition values are fixed and known.
- If you read from a direct partition folder, and need the partition value, you must add it manually.