# Writing Multiple Partitions in Parquet with PySpark
This notebook demonstrates how to write a DataFrame partitioned by multiple columns (e.g., 'country' and 'year') using PySpark.

In [2]:
# Step 1: Start Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Write Multiple Partitions").getOrCreate()

## Create a Sample DataFrame
This DataFrame simulates sales data across countries and years.

In [3]:
data = [
    ("USA", 2022, "Alice", 100),
    ("USA", 2023, "Bob", 200),
    ("Canada", 2022, "Charlie", 150),
    ("Canada", 2023, "David", 175),
    ("USA", 2022, "Eve", 120)
]
columns = ["country", "year", "name", "sales"]
df = spark.createDataFrame(data, schema=columns)
df.show()

+-------+----+-------+-----+
|country|year|   name|sales|
+-------+----+-------+-----+
|    USA|2022|  Alice|  100|
|    USA|2023|    Bob|  200|
| Canada|2022|Charlie|  150|
| Canada|2023|  David|  175|
|    USA|2022|    Eve|  120|
+-------+----+-------+-----+



## Write the DataFrame with Multiple Partitions
This will write the data partitioned by both 'country' and 'year' columns.

In [4]:
df.write \
    .partitionBy("country", "year") \
    .parquet("/tmp/output/multiple_partitions", mode="overwrite")

## Output Folder Structure
- Files will be organized as:
```
/tmp/output/multiple_partitions/
  ├── country=Canada/year=2022/
  ├── country=Canada/year=2023/
  ├── country=USA/year=2022/
  └── country=USA/year=2023/
```

Option 1: Read All Partitions (Full Dataset)

In [5]:
df_all = spark.read.parquet("/tmp/output/multiple_partitions")
df_all.show()

+-------+-----+-------+----+
|   name|sales|country|year|
+-------+-----+-------+----+
|Charlie|  150| Canada|2022|
|  Alice|  100|    USA|2022|
|  David|  175| Canada|2023|
|    Eve|  120|    USA|2022|
|    Bob|  200|    USA|2023|
+-------+-----+-------+----+



Option 2: Read a Specific Partition Using .filter() (Partition Pruning)

In [6]:
df_filtered = spark.read.parquet("/tmp/output/multiple_partitions").filter(
    "country = 'USA' AND year = 2022"
)
df_filtered.show()

+-----+-----+-------+----+
| name|sales|country|year|
+-----+-----+-------+----+
|Alice|  100|    USA|2022|
|  Eve|  120|    USA|2022|
+-----+-----+-------+----+

