<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/13_Explode%2C_Arrays%2C_Maps%2C_Structs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Handling Complex Data Types: Arrays, Maps, Structs

Spark DataFrames are powerful for semi-structured and nested data common in formats like JSON or Parquet.

### Creating Complex Types

*   `array()`: Creates an array column.
*   `map()`: Creates a map column.
*   `struct()`: Creates a struct column (similar to a row or object within a column).

**Example (Python):**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import array, map_from_entries, struct, col, lit
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType

spark = SparkSession.builder.appName("ComplexTypes").getOrCreate()

# Manual schema definition for clarity
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Scores", ArrayType(IntegerType()), True),
    StructField("AddressMap", MapType(StringType(), StringType()), True), # Map from list of tuples
    StructField("Vehicle", StructType([
        StructField("Color", StringType(), True),
        StructField("Type", StringType(), True)
    ]), True)
])

data = [
    ("Alice", [10, 20], {"city": "NY", "zip": "10001"}, ("Red", "Car")),
    ("Bob", [30], {"city": "LD"}, ("Blue", "Bike"))
]
columns = ["Name", "Scores", "AddressMap", "Vehicle"]

df = spark.createDataFrame(data, schema=schema)
print("Original DataFrame with complex types:")
df.show(truncate=False)
df.printSchema()

# Another way to create complex types using functions in select
df_created = spark.createDataFrame([
    (1, "Alice"),
    (2, "Bob")
]).select(
    col("_1").alias("ID"),
    col("_2").alias("Name"),
    array(lit("Apple"), lit("Banana")).alias("Fruits"), # ArrayType
    map_from_entries(array(struct(lit("key1"), lit("value1")), struct(lit("key2"), lit("value2")))).alias("Properties"), # MapType
    struct(lit("Main St").alias("Street"), lit("Anytown").alias("City")).alias("Location") # StructType
)
print("\nDataFrame created with functions:")
df_created.show(truncate=False)
df_created.printSchema()

# spark.stop() # Uncomment to stop SparkSession

Original DataFrame with complex types:
+-----+--------+--------------------------+------------+
|Name |Scores  |AddressMap                |Vehicle     |
+-----+--------+--------------------------+------------+
|Alice|[10, 20]|{zip -> 10001, city -> NY}|{Red, Car}  |
|Bob  |[30]    |{city -> LD}              |{Blue, Bike}|
+-----+--------+--------------------------+------------+

root
 |-- Name: string (nullable = true)
 |-- Scores: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- AddressMap: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- Vehicle: struct (nullable = true)
 |    |-- Color: string (nullable = true)
 |    |-- Type: string (nullable = true)


DataFrame created with functions:
+---+-----+---------------+--------------------------------+------------------+
|ID |Name |Fruits         |Properties                      |Location          |
+---+-----+---------------+--------------------------------+-

### Exploding Arrays & Maps into Rows: `explode()` and `explode_outer()`

`explode()` transforms an array or map column into individual rows for each element/key-value pair.

*   **`explode()`**: If an array/map column has N elements, it generates N rows for that array, duplicating other column values.
    *   **Behavior with null/empty:** Drops rows where the array/map column is `null` or empty.
*   **`explode_outer()` (Spark 2.4+)**: Similar to `explode()`, but it keeps rows even if the array/map column is `null` or empty. It will produce a `null` for the exploded column in these cases, similar to a `LEFT JOIN`.

**Example (Python):**

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, explode_outer, col

spark = SparkSession.builder.appName("ExplodeArrays").getOrCreate()

data = [
    ("Alice", ["reading", "hiking", "cooking"]),
    ("Bob", ["coding", "gaming"]),
    ("Charlie", []) # Empty array
]
columns = ["Name", "Hobbies"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show(truncate=False)
df.printSchema()

# Explode the 'Hobbies' array column using explode()
print("\nDataFrame after exploding 'Hobbies' (drops empty/null):")
df.withColumn("Hobby", explode(col("Hobbies"))).show()

# What happens if a column is null or empty with explode()?
data_with_null_array = [
    ("Alice", ["reading", "hiking"]),
    ("Bob", None), # Null array
    ("Charlie", []) # Empty array
]
df_null_array = spark.createDataFrame(data_with_null_array, columns)
print("\nDataFrame with null/empty arrays:")
df_null_array.show(truncate=False)

# When exploding a null or empty array, the row for that record is dropped by default.
print("\nDataFrame after exploding with null/empty arrays (using explode - rows dropped):")
df_null_array.withColumn("Hobby", explode(col("Hobbies"))).show()

# To keep rows with null/empty arrays, use 'explode_outer` (Spark 2.4+)
print("\nDataFrame after exploding with explode_outer (keeps rows):")
df_null_array.withColumn("Hobby", explode_outer(col("Hobbies"))).show()

# spark.stop() # Uncomment to stop SparkSession

Original DataFrame:
+-------+--------------------------+
|Name   |Hobbies                   |
+-------+--------------------------+
|Alice  |[reading, hiking, cooking]|
|Bob    |[coding, gaming]          |
|Charlie|[]                        |
+-------+--------------------------+

root
 |-- Name: string (nullable = true)
 |-- Hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)


DataFrame after exploding 'Hobbies' (drops empty/null):
+-----+--------------------+-------+
| Name|             Hobbies|  Hobby|
+-----+--------------------+-------+
|Alice|[reading, hiking,...|reading|
|Alice|[reading, hiking,...| hiking|
|Alice|[reading, hiking,...|cooking|
|  Bob|    [coding, gaming]| coding|
|  Bob|    [coding, gaming]| gaming|
+-----+--------------------+-------+


DataFrame with null/empty arrays:
+-------+-----------------+
|Name   |Hobbies          |
+-------+-----------------+
|Alice  |[reading, hiking]|
|Bob    |NULL             |
|Charlie|[]               

### Flattening Nested Schemas

Flattening transforms nested structures (StructType, or ArrayType of StructType) into a flatter structure with top-level columns, making data easier to query.

*   **For `StructType`:** Access elements using dot notation (`parent.child`) and select them as new columns.
*   **For `ArrayType` of `StructType`:** First, use `explode()` to create individual rows for each struct in the array, then access elements using dot notation.

**Example (Python):**

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

spark = SparkSession.builder.appName("FlatteningSchemas").getOrCreate()

# --- Flattening StructType ---
data_struct = [
    ("Alice", {"street": "Main St", "city": "NY"}),
    ("Bob", {"street": "Elm St", "city": "LD"})
]
schema_struct = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True)
    ]), True)
])
df_struct = spark.createDataFrame(data_struct, schema_struct)
print("Original DataFrame with StructType:")
df_struct.show(truncate=False)
df_struct.printSchema()

print("\nFlattening StructType:")
df_flattened_struct = df_struct.select(
    col("Name"),
    col("Address.street").alias("Street"),
    col("Address.city").alias("City")
)
df_flattened_struct.show(truncate=False)
df_flattened_struct.printSchema()

# --- Flattening ArrayType of StructType ---
data_array_struct = [
    ("ProductA", [{"feature_name": "Color", "value": "Red"}, {"feature_name": "Size", "value": "M"}]),
    ("ProductB", [{"feature_name": "Weight", "value": "1kg"}])
]
schema_array_struct = StructType([
    StructField("Product", StringType(), True),
    StructField("Features", ArrayType(StructType([
        StructField("feature_name", StringType(), True),
        StructField("value", StringType(), True)
    ])), True)
])
df_array_struct = spark.createDataFrame(data_array_struct, schema_array_struct)
print("\nOriginal DataFrame with ArrayType of StructType:")
df_array_struct.show(truncate=False)
df_array_struct.printSchema()

print("\nFlattening ArrayType of StructType (requires explode first):")
df_flattened_array_struct = df_array_struct.withColumn("exploded_features", explode(col("Features"))) \
    .select(
        col("Product"),
        col("exploded_features.feature_name"),
        col("exploded_features.value")
    )
df_flattened_array_struct.show(truncate=False)
df_flattened_array_struct.printSchema()

# spark.stop() # Uncomment to stop SparkSession

Original DataFrame with StructType:
+-----+-------------+
|Name |Address      |
+-----+-------------+
|Alice|{Main St, NY}|
|Bob  |{Elm St, LD} |
+-----+-------------+

root
 |-- Name: string (nullable = true)
 |-- Address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)


Flattening StructType:
+-----+-------+----+
|Name |Street |City|
+-----+-------+----+
|Alice|Main St|NY  |
|Bob  |Elm St |LD  |
+-----+-------+----+

root
 |-- Name: string (nullable = true)
 |-- Street: string (nullable = true)
 |-- City: string (nullable = true)


Original DataFrame with ArrayType of StructType:
+--------+-------------------------+
|Product |Features                 |
+--------+-------------------------+
|ProductA|[{Color, Red}, {Size, M}]|
|ProductB|[{Weight, 1kg}]          |
+--------+-------------------------+

root
 |-- Product: string (nullable = true)
 |-- Features: array (nullable = true)
 |    |-- element: struct (containsNull = t

### Accessing Fields

*   **`StructType`:** Use dot notation: `col("parent_struct.child_field")`.
*   **`MapType`:**
    *   `col("map_col")["key"]`: Direct access.
    *   `element_at(col("map_col"), "key")`: More robust, handles missing keys gracefully.
*   **`ArrayType`:**
    *   `col("array_col")[index]`: Direct element access (use with caution, as index might be out of bounds).
    *   `explode()`: For iterating over all elements as individual rows.

**Example (Python):**

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, element_at
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, MapType

spark = SparkSession.builder.appName("AccessingNestedFields").getOrCreate()

data = [
    ("Alice", {"street": "Main St", "city": "NY"}, ["A", "B"], {"email": "a@ex.com", "phone": "111"}),
    ("Bob", {"street": "Elm St", "city": "LD"}, ["C"], {"email": "b@ex.com"})
]
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True)
    ]), True),
    StructField("Grades", ArrayType(StringType()), True),
    StructField("Contact", MapType(StringType(), StringType()), True)
])
df = spark.createDataFrame(data, schema=schema)
print("Original DataFrame:")
df.show(truncate=False)
df.printSchema()

# Accessing StructType fields
print("\nAccessing StructType fields:")
df.select(
    col("Name"),
    col("Address.street").alias("Street"),
    col("Address.city").alias("City")
).show()

# Accessing MapType fields
print("\nAccessing MapType fields:")
df.select(
    col("Name"),
    col("Contact.email").alias("Email"), # Using dot notation (common for maps in some contexts)
    element_at(col("Contact"), "phone").alias("Phone") # More robust way for map keys
).show()

# Accessing ArrayType fields by index
print("\nAccessing ArrayType fields by index:")
df.select(
    col("Name"),
    col("Grades")[0].alias("FirstGrade") # Accessing first element
).show()

# Iterating ArrayType fields using explode
print("\nAccessing ArrayType fields using explode:")
df.withColumn("Grade", explode(col("Grades"))).show()

# spark.stop() # Uncomment to stop SparkSession

Original DataFrame:
+-----+-------------+------+---------------------------------+
|Name |Address      |Grades|Contact                          |
+-----+-------------+------+---------------------------------+
|Alice|{Main St, NY}|[A, B]|{phone -> 111, email -> a@ex.com}|
|Bob  |{Elm St, LD} |[C]   |{email -> b@ex.com}              |
+-----+-------------+------+---------------------------------+

root
 |-- Name: string (nullable = true)
 |-- Address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |-- Grades: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Contact: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)


Accessing StructType fields:
+-----+-------+----+
| Name| Street|City|
+-----+-------+----+
|Alice|Main St|  NY|
|  Bob| Elm St|  LD|
+-----+-------+----+


Accessing MapType fields:
+-----+--------+-----+
| Name|   Email|Phone|
+-----+-----