`StructType` and `StructField` With Examples

PySpark `StructType` and `StructField` classes are used to programmatically specify the schema to
the DataFrame and create complex columns like nested struct, array, and map columns.
`StructType` is a collection of `StructField` that defines column name, column data type, boolean 
to specify if the field can be nullable or not metadata. Though PySpark infers the schema from the 
data, sometimes we may need to define our own column names and data types and this article explains
how to define simple, nested and complex schemas

1. `StructType` defines the structure of the DataFrame

PySpark provides from `pyspark.sql.types import StructType` class to define the structure of a DataFrame.
`StructType` is a collection or list of `StructField` objects. `StructType` is a collection or list of `StructField` objects.

2. `StructField` defines the metadata of the `DataFrame` column

`PySpark` provides `pyspark.sqltypes import StructField` class to define the columns which include column name
(`String`), column type (`DataType`), nullable column (`Boolean`) and metadata (`MetaData`)

3. Using PySpark StructType and StructField with DataFrame 

While creating a PySpark DataFrame we can specify the structure using StructType and StructField classes. `StructType` is a collection of
`StructField`'s which is used to define the column name, data type and a flag for a nullable or not. Using `StructField` we can also add 
nested struct schema. `ArrayType` for arrays, and `MapType` for key-value pairs which we will discuss in details. 

The below example demonstrates a very simple example of how to create a `StructType` and `StructField` on `DataFrame` and it's usage with sample data to support it.

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [("James", "", "Smith", "36636", "M", 3000),
        ("Michael", "Rose", "", "40288", "M", 4000),
        ("Robert","","Williams", "42114", "M", 4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","","F",-1)
       ]

schema = StructType([ \
                     StructField("firstname",StringType(),True), \
                     StructField("middlename",StringType(),True), \
                     StructField("lastname",StringType(),True), \
                     StructField("id", StringType(), True), \
                     StructField("geneder", StringType(), True), \
                     StructField("salary", IntegerType(), True) \
                    ])

df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- geneder: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+-------+------+
|firstname|middlename|lastname|id   |geneder|salary|
+---------+----------+--------+-----+-------+------+
|James    |          |Smith   |36636|M      |3000  |
|Michael  |Rose      |        |40288|M      |4000  |
|Robert   |          |Williams|42114|M      |4000  |
|Maria    |Anne      |Jones   |39192|F      |4000  |
|Jen      |Mary      |Brown   |     |F      |-1    |
+---------+----------+--------+-----+-------+------+



4. Defining Nested `StructType` object struct

While working on DataFrame we often need to work with the nested struct column and this can be defined using `StructType`.
In the below example column "name" data type is StructType which is nested. 


In [2]:
structureData = [
    (("James", "", "Smith"), "36636", "M", 3100),
    (("Michael", "Rose", ""), "40288", "M", 4300),
    (("Robert", "", "Williams"), "42114", "M", 1400),
    (("Maria", "Anne", "Jones"), "39192", "F", 5500),
    (("Jen", "Mary", "Brown"), "", "F", -1)
]
structureSchema = StructType([
    StructField('name', StructType([
        StructField('firstname', StringType(), True),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True)
    ])),
    StructField('id', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('salary', IntegerType(), True)
])

df2 = spark.createDataFrame(data=structureData, schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



5. Adding and Changing struct of the DataFrame

Using PySpark SQL function `struct`, we can change the struct of the existing DataFrame and add a new SturctType to it. 
The below example demonstrates how to copy the columns from one structure to another and adding a new column. PySpark 
`Column` class also provides some functions to work with the `StructType` column.

In [5]:
from pyspark.sql.functions import col,struct,when
updatedDF = df2.withColumn("OtherInfo",
                           struct(col("id").alias("identifier"),
                                  col("gender").alias("gender"),
                                  col("salary").alias("salary"),
                                  when(col("salary").cast(IntegerType()) < 2000, "Low").
                                  when(col("salary").cast(IntegerType()) < 4000, "Medium").
                                  otherwise("High").alias("Salary_Grade")
                                                         )).drop("id", "gender", "salary")

updatedDF.printSchema()
updatedDF.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- OtherInfo: struct (nullable = false)
 |    |-- identifier: string (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- Salary_Grade: string (nullable = false)

+--------------------+------------------------+
|name                |OtherInfo               |
+--------------------+------------------------+
|{James, , Smith}    |{36636, M, 3100, Medium}|
|{Michael, Rose, }   |{40288, M, 4300, High}  |
|{Robert, , Williams}|{42114, M, 1400, Low}   |
|{Maria, Anne, Jones}|{39192, F, 5500, High}  |
|{Jen, Mary, Brown}  |{, F, -1, Low}          |
+--------------------+------------------------+



Here, it copies `"gender"`, `"salary"`, and `"id"` to the new struct `"otherInfo"` and add's a new column `"Salary_Grade"`.

6. Using SQL `ArrayType` and `MapType`

SQL `StructType` also supports `ArrayType` and `MapType` to define the DataFrame columns for array and map collections respectively. 
On the below example, column `hobbies` defined as `ArrayType(StringType)` and `properties` defined as `MapType(StringType,StringType)`
meaning both key and value as String.

In [8]:
from pyspark.sql.types import ArrayType, MapType

arrayStructureSchema = StructType([
    StructField('name', StructType([
        StructField('firstname', StringType(), True),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True)
    ])),
    StructField('hobbies', ArrayType(StringType()), True),
    StructField('properties', MapType(StringType(), StringType()), True)
])

print(arrayStructureSchema)

StructType([StructField('name', StructType([StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True)]), True), StructField('hobbies', ArrayType(StringType(), True), True), StructField('properties', MapType(StringType(), StringType(), True), True)])


7. Creating `StructType` object struct from JSON file

If you have too many columns and the structure of the DataFrame changes now and then, it's a good practice to load the SQL `StructType` schema from JSON file.
You can get the schema by using `df2.schema.json()`, store this in a file and will use it to create a schema from this file.


In [9]:
print(df2.schema.json())

{"fields":[{"metadata":{},"name":"name","nullable":true,"type":{"fields":[{"metadata":{},"name":"firstname","nullable":true,"type":"string"},{"metadata":{},"name":"middlename","nullable":true,"type":"string"},{"metadata":{},"name":"lastname","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"gender","nullable":true,"type":"string"},{"metadata":{},"name":"salary","nullable":true,"type":"integer"}],"type":"struct"}


Here is the output of `print(df2.schema.json()`:
```
{
  "fields":
    [
       {
         "metadata":{},
         "name":"name",
         "nullable":true,
         "type":
             {"fields":
                  [
                      {
                         "metadata": {},
                         "name": "firstname",
                         "nullable": true,
                         "type": "string",
                      },
                      {
                         "metadata": {},
                         "name": "middlename",
                         "nullable": true,
                         "type": "string"
                      },
                      {
                         "metadata": {},
                         "name": "lastname",
                         "nullable": true,
                         "type": "string"
                      }
                  ],
                  "type": "struct"
            }
       },
       {
          "metadata":{},
          "name": "id",
          "nullable": true,
          "type": "string"
       },
       {
          "metadata":{},
          "name": "gender",
          "nullable": true,
          "type": "string"
       },
       {
          "metadata":{},
          "name": "salary",
          "nullable": true, 
          "type": "integer"
       }
    ],
    "type": "struct"
}
```