<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/06_Data_Types_%26_Schema.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Types & Schema Fundamentals

**Why it's important:** Defining a correct schema ensures data integrity, optimizes storage and computation, and helps Spark perform better by knowing exactly how to handle data.

Spark's `pyspark.sql.types` module provides classes for programmatically defining schemas.

*   **`StructType`**: Represents the schema of a DataFrame, which is essentially a list of `StructField` objects. It defines the structure of a row.
*   **`StructField`**: Represents a column within a `StructType`. It specifies the column's name, data type, and nullability.
    *   **Arguments:**

| Argument       | Type      | Description                                                    |
| :------------- | :-------- | :------------------------------------------------------------- |
| `name`         | `string`  | The name of the column.                                        |
| `dataType`     | `DataType`| The data type of the column (e.g., `StringType()`, `IntegerType()`, `BooleanType()`). |
| `nullable`     | `boolean` | Whether the column can contain null values (`True` for nullable, `False` otherwise). **Best practice: Be explicit with `False` for critical ID columns.** |

*   **`ArrayType(elementType, containsNull)`**: Represents an array (list) of elements of a specific type.
    *   `elementType`: The data type of the elements in the array.
    *   `containsNull`: Whether the array can contain null elements.
*   **`MapType(keyType, valueType, valueContainsNull)`**: Represents a map (dictionary) with key-value pairs.
    *   `keyType`: The data type of the keys.
    *   `valueType`: The data type of the values.
    *   `valueContainsNull`: Whether the map values can be null.

#### Example: Defining a Complex Schema (Python)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType

# Initialize SparkSession
spark = SparkSession.builder.appName("DataTypesSchema").getOrCreate()

# Define a complex schema for employee data
schema = StructType([
    StructField("employee_id", IntegerType(), False), # Not nullable, typically an ID
    StructField("name", StructType([                       # Nested StructType for name
        StructField("first", StringType(), True),
        StructField("last", StringType(), True)
    ]), True),
    StructField("skills", ArrayType(StringType(), True), True), # Array of strings for skills
    StructField("contact_info", MapType(StringType(), StringType(), True), True) # Map with string keys and values
])

# Create sample data that conforms to the schema
data = [(1, ("Alice", "Smith"), ["Python", "Spark"], {"email": "alice@example.com", "phone": "123-456-7890"}),
        (2, ("Bob", "Johnson"), ["Java", "SQL"], {"email": "bob@example.com"}),
        (3, ("Charlie", None), [], {}) # Example with null in nested struct, empty array, empty map
       ]

# Create a DataFrame with the defined schema
df = spark.createDataFrame(data, schema)

print("DataFrame with complex schema:")
df.show(truncate=False)
df.printSchema()

spark.stop()

DataFrame with complex schema:
+-----------+---------------+---------------+---------------------------------------------------+
|employee_id|name           |skills         |contact_info                                       |
+-----------+---------------+---------------+---------------------------------------------------+
|1          |{Alice, Smith} |[Python, Spark]|{phone -> 123-456-7890, email -> alice@example.com}|
|2          |{Bob, Johnson} |[Java, SQL]    |{email -> bob@example.com}                         |
|3          |{Charlie, NULL}|[]             |{}                                                 |
+-----------+---------------+---------------+---------------------------------------------------+

root
 |-- employee_id: integer (nullable = false)
 |-- name: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- last: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- contact_info: map (null

---

### `inferSchema` vs. Manual Schema Definition

When reading data from external files (like CSV, JSON), Spark can either try to guess the schema (`inferSchema=True`) or you can provide it explicitly.

| Feature         | `inferSchema=True` (Automatic)                                    | Manual Schema Definition                                       |
| :-------------- | :---------------------------------------------------------------- | :------------------------------------------------------------- |
| **Description** | Spark samples data to guess column names and data types.          | You explicitly define `StructType` and `StructField` for each column. |
| **Pros**        | Convenient, less manual work, quick for exploration/ad-hoc analysis. | **Fast** (no extra pass over data), **Robust** (prevents unexpected type changes), enables early detection of mismatches. |
| **Cons**        | **Slow** (requires an extra pass), **May infer incorrect types** (e.g., all strings if data is dirty, or `StringType` for IDs with some `NULL`s). Not suitable for production. | More verbose, requires knowing the schema beforehand.           |
| **Usage**       | `spark.read.csv("path.csv", header=True, inferSchema=True)`       | `spark.read.csv("path.csv", header=True, schema=my_schema)`    |
| **Best for**    | Quick prototyping, initial data exploration.                      | **Production ETL pipelines**, ensuring data quality and pipeline stability. |

#### Example: `inferSchema` vs. Manual Schema (Python)

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("InferVsManualSchema").getOrCreate()

# Create a dummy CSV file for demonstration
csv_content = """id,name,age,city
1,Alice,30,New York
2,Bob,25,London
3,Charlie,35,Paris
4,David,NULL,Berlin
5,Eve,,London""" # Example with a truly empty age

with open("people_data.csv", "w") as f:
    f.write(csv_content)

print("--- 1. Infer Schema ---")
df_inferred = spark.read.csv("people_data.csv", header=True, inferSchema=True)
print("DataFrame with Inferred Schema:")
df_inferred.printSchema()
df_inferred.show()
# Notice: 'age' might be inferred as IntegerType or StringType/DoubleType depending on Spark version
# and the presence of 'NULL'/'empty' values. 'NULL' (as a string) often leads to StringType.
# An empty string "" might be inferred as null for IntegerType if inferSchema handles it gracefully,
# but can sometimes lead to StringType if not. Here, Spark correctly infers 'age' as IntegerType
# even with 'NULL' string or empty string, converting them to actual nulls.

print("\n--- 2. Define Schema Manually ---")
manual_schema = StructType([
    StructField("id", IntegerType(), False),            # id should be Integer, not nullable
    StructField("name", StringType(), True),            # name can be string
    StructField("age", IntegerType(), True),            # age should be Integer, expecting nulls
    StructField("city", StringType(), True)             # city can be string
])

df_manual = spark.read.csv("people_data.csv", header=True, schema=manual_schema)
print("\nDataFrame with Manual Schema:")
df_manual.printSchema()
df_manual.show()

spark.stop()

--- 1. Infer Schema ---
DataFrame with Inferred Schema:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)

+---+-------+----+--------+
| id|   name| age|    city|
+---+-------+----+--------+
|  1|  Alice|  30|New York|
|  2|    Bob|  25|  London|
|  3|Charlie|  35|   Paris|
|  4|  David|NULL|  Berlin|
|  5|    Eve|NULL|  London|
+---+-------+----+--------+


--- 2. Define Schema Manually ---

DataFrame with Manual Schema:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)

+---+-------+----+--------+
| id|   name| age|    city|
+---+-------+----+--------+
|  1|  Alice|  30|New York|
|  2|    Bob|  25|  London|
|  3|Charlie|  35|   Paris|
|  4|  David|NULL|  Berlin|
|  5|    Eve|NULL|  London|
+---+-------+----+--------+



---

### `cast()` for Type Conversions

The `cast()` function, typically used with `col()` or `df.withColumn()`, is essential for converting a column from one data type to another. This is crucial for data cleaning, transformation, and ensuring data integrity.

*   **Syntax**: `col("column_name").cast(TargetDataType())`
*   **Behavior with invalid casts**: If a value cannot be cast to the target type (e.g., "abc" to `IntegerType()`), Spark will convert it to `null` without raising an error.

#### Example: `cast()` for Type Conversions (Python)

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType, StringType, DoubleType, DateType

spark = SparkSession.builder.appName("TypeConversions").getOrCreate()

data = [("1", "10.5", "2023-01-15"),
        ("2", "20.7", "2023-02-20"),
        ("3", "30.9", "2023-03-25")]
columns = ["id_str", "price_str", "date_str"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame Schema:")
df.printSchema()
df.show()

# Cast 'id_str' to IntegerType
df_casted = df.withColumn("id_int", col("id_str").cast(IntegerType()))

# Cast 'price_str' to DoubleType
df_casted = df_casted.withColumn("price_double", col("price_str").cast(DoubleType()))

# Cast 'date_str' to DateType
df_casted = df_casted.withColumn("date_date", col("date_str").cast(DateType()))

print("\nDataFrame After Casts:")
df_casted.printSchema()
df_casted.show()

# Handling invalid casts:
# If a value cannot be cast, it will result in a null.
data_invalid = [("1", "abc"), ("2", "123")]
df_invalid = spark.createDataFrame(data_invalid, ["id", "num_str"])

print("\nDataFrame with potential invalid casts:")
df_invalid.show()

# Attempt to cast a non-numeric string to IntegerType
df_invalid.withColumn("num_int", col("num_str").cast(IntegerType())).show() # "abc" will become null

spark.stop()


Original DataFrame Schema:
root
 |-- id_str: string (nullable = true)
 |-- price_str: string (nullable = true)
 |-- date_str: string (nullable = true)

+------+---------+----------+
|id_str|price_str|  date_str|
+------+---------+----------+
|     1|     10.5|2023-01-15|
|     2|     20.7|2023-02-20|
|     3|     30.9|2023-03-25|
+------+---------+----------+


DataFrame After Casts:
root
 |-- id_str: string (nullable = true)
 |-- price_str: string (nullable = true)
 |-- date_str: string (nullable = true)
 |-- id_int: integer (nullable = true)
 |-- price_double: double (nullable = true)
 |-- date_date: date (nullable = true)

+------+---------+----------+------+------------+----------+
|id_str|price_str|  date_str|id_int|price_double| date_date|
+------+---------+----------+------+------------+----------+
|     1|     10.5|2023-01-15|     1|        10.5|2023-01-15|
|     2|     20.7|2023-02-20|     2|        20.7|2023-02-20|
|     3|     30.9|2023-03-25|     3|        30.9|2023-03-25|


---

### Schema Evolution Awareness

Schema evolution is the ability to handle changes in the schema of data over time. This is a common challenge in big data systems, especially with semi-structured data or streaming data.

Spark, particularly with formats like Parquet, is generally good at handling schema evolution.

#### Common Scenarios and How Spark Handles Them:

| Scenario            | Spark's Handling (General)                                                                             | Data Engineer's Responsibility                                     |
| :------------------ | :----------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------- |
| **Adding New Columns** | New columns are included in the DataFrame's schema. Older records (without new columns) will have `null` values for those columns. | Ensure downstream systems can handle `null`s for new columns.      |
| **Dropping Columns** | Spark will ignore removed columns if reading with an existing schema. If `inferSchema` is used, the schema will update. | Update your reading schema if you want to explicitly exclude them. |
| **Changing Data Types** | Compatible changes (e.g., `Integer` to `Long`) are often handled. Incompatible changes (e.g., `String` to `Integer` if strings aren't valid numbers) will likely lead to `null` values or errors. | **Critical:** Anticipate and manage type changes with `cast()` or explicit schema definitions. Test thoroughly! |
| **Reordering Columns** | Spark (especially Parquet) generally handles column reordering gracefully as it identifies columns by name, not by position. | No major issues typically, but consistency is still good practice. |

#### Best Practices for Schema Evolution:

*   **Use Schema-Aware Formats**: Prefer formats like **Parquet** or **Avro** that are designed with schema evolution in mind. They store schema information with the data and handle additive changes efficiently.
*   **Define Schema Manually (Always for Production)**: While `inferSchema` is convenient for exploration, explicitly defining your schema gives you control, prevents unexpected type changes, and catches issues early.
*   **Handle Nulls Gracefully**: Be prepared for `null` values when new columns are added or when data quality issues lead to invalid casts.
*   **Version Control Schemas**: In complex data pipelines, store and version your schemas alongside your code to track changes and facilitate rollbacks.
*   **Monitor Data Quality**: Implement data quality checks (e.g., column data type checks, null checks) to detect unexpected schema changes or data type issues before they propagate.

#### Example (Conceptual):

Imagine you have a Parquet file with `(id INT, name STRING)`.
Later, new data arrives with `(id INT, name STRING, age INT)`.

When you read the combined data with Spark, the resulting DataFrame will have `(id INT, name STRING, age INT)`. The older records (which didn't have `age`) will have `null` for the `age` column.

This "awareness" means that as a Data Engineer, you should anticipate these changes and design your pipeline to be resilient to them, preventing unexpected errors or data quality issues.

---