`StructType` and `StructField` With Examples

PySpark `StructType` and `StructField` classes are used to programmatically specify the schema to
the DataFrame and create complex columns like nested struct, array, and map columns.
`StructType` is a collection of `StructField` that defines column name, column data type, boolean 
to specify if the field can be nullable or not metadata. Though PySpark infers the schema from the 
data, sometimes we may need to define our own column names and data types and this article explains
how to define simple, nested and complex schemas

1. `StructType` defines the structure of the DataFrame

PySpark provides from `pyspark.sql.types import StructType` class to define the structure of a DataFrame.
`StructType` is a collection or list of `StructField` objects. `StructType` is a collection or list of `StructField` objects.

2. `StructField` defines the metadata of the `DataFrame` column

`PySpark` provides `pyspark.sqltypes import StructField` class to define the columns which include column name
(`String`), column type (`DataType`), nullable column (`Boolean`) and metadata (`MetaData`)

3. Using PySpark StructType and StructField with DataFrame 

While creating a PySpark DataFrame we can specify the structure using StructType and StructField classes. `StructType` is a collection of
`StructField`'s which is used to define the column name, data type and a flag for a nullable or not. Using `StructField` we can also add 
nested struct schema. `ArrayType` for arrays, and `MapType` for key-value pairs which we will discuss in details. 

The below example demonstrates a very simple example of how to create a `StructType` and `StructField` on `DataFrame` and it's usage with sample data to support it.

In [5]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [("James", "", "Smith", "36636", "M", 3000),
        ("Michael", "Rose", "", "40288", "M", 4000),
        ("Robert","","Williams", "42114", "M", 4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","","F",-1)
       ]

schema = StructType([ \
                     StructField("firstname",StringType(),True), \
                     StructField("middlename",StringType(),True), \
                     StructField("lastname",StringType(),True), \
                     StructField("id", StringType(), True), \
                     StructField("geneder", StringType(), True), \
                     StructField("salary", IntegerType(), True) \
                    ])

df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

ConnectionRefusedError: [Errno 111] Connection refused

4. Defining Nested `StructType` object struct

While working on DataFrame we often need to work with the nested struct column and this can be defined using `StructType`.
In the below example column "name" data type is StructType which is nested. 


In [4]:
structureData = [
    (("James", "", "Smith"), "36636", "M", 3100),
    (("Michael", "Rose", ""), "40288", "M", 4300),
    (("Robert", "", "Williams"), "42114", "M", 1400),
    (("Maria", "Anne", "Jones"), "39192", "F", 5500),
    (("Jen", "Mary", "Brown"), "", "F", -1)
]
structureSchema = StructType([
    StructField('name', StructType([
        StructField('firstname', StringType(), True),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True)
    ])),
    StructField('id', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('salary', IntegerType(), True)
])

df2 = spark.createDataFrame(data=structureData, schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Jav

Py4JError: SparkSession does not exist in the JVM