#PySpark StructType & StructField

---


**PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested struct, array, and map columns. StructType is a collection of StructField’s that defines column name, column data type, boolean to specify if the field can be nullable or not and metadata.**


---

**Though PySpark infers a schema from data, sometimes we may need to define our own column names and data types and this article explains how to define simple, nested, and complex schemas.**

##1. StructType – Defines the structure of the Dataframe

---


**PySpark provides from pyspark.sql.types import StructType class to define the structure of the DataFrame.**

***StructType is a collection or list of StructField objects.***

---

**PySpark printSchema() method on the DataFrame shows StructType columns as struct.**

##2. StructField – Defines the metadata of the DataFrame column


---


**PySpark provides pyspark.sql.types import StructField class to define the columns which include column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)**

##3. Using PySpark StructType & StructField with DataFrame


---

**While creating a PySpark DataFrame we can specify the structure using StructType and StructField classes. As specified in the introduction, StructType is a collection of StructField’s which is used to define the column name, data type, and a flag for nullable or not. Using StructField we can also add nested struct schema, ArrayType for arrays, and MapType for key-value pairs.**

In [0]:
import pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType


data = [("James","","Smith","36636","M",3000),
        ("Michael","Rose","","40288","M",4000),
        ("Robert","","Williams","42114","M",4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","","F",-1)
       ]

schema = StructType([
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True),
    StructField('id', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('salary', IntegerType(), True),
])


df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



##4. Defining Nested StructType object struct

---

**While working on DataFrame we often need to work with the nested struct column and this can be defined using StructType.**


---

***In the below example column “name” data type is StructType which is nested.***

In [0]:
structureData = [
    (('James','','Smith'),'36636','M',3100),
    (('Michael', 'Rose', ''), '40288', 'M',4300),
    (('Robert','','Williams'),'42114','M',1400),
    (('Maria','Anne','Jones'),'39192','F',5500),
    (('Jen','Mary','Brown'),'','F',-1)
]

structureSchema = StructType([
    StructField('name', StructType([
        StructField('firstname', StringType(), True),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True),
    ])),
    StructField('id', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('salary', IntegerType(), True)
])

df2 = spark.createDataFrame(data=structureData, schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



##5. Adding & Changing struct of the DataFrame


---

**Using PySpark SQL function struct(), we can change the struct of the existing DataFrame and add a new StructType to it. The below example demonstrates how to copy the columns from one structure to another and adding a new column. PySpark Column Class also provides some functions to work with the StructType column.**

In [0]:
from pyspark.sql.functions import col,struct,when

updateDF = df2.withColumn('OtherInfo',
            struct(col('id').alias('identifier'),
                   col('gender').alias('gender'),
                   col('salary').alias('salary'),
                   when(col('salary').cast(IntegerType()) < 2000, "Low")
                   .when(col('salary').cast(IntegerType())<4000, "Medium")
                   .otherwise('High').alias('Salary_Grade')                 
                )).drop('id', 'gender', 'salary')


updateDF.printSchema()
updateDF.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- OtherInfo: struct (nullable = false)
 |    |-- identifier: string (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- Salary_Grade: string (nullable = false)

+--------------------+------------------------+
|name                |OtherInfo               |
+--------------------+------------------------+
|{James, , Smith}    |{36636, M, 3100, Medium}|
|{Michael, Rose, }   |{40288, M, 4300, High}  |
|{Robert, , Williams}|{42114, M, 1400, Low}   |
|{Maria, Anne, Jones}|{39192, F, 5500, High}  |
|{Jen, Mary, Brown}  |{, F, -1, Low}          |
+--------------------+------------------------+



##6. Using SQL ArrayType and MapType


---


**SQL StructType also supports ArrayType and MapType to define the DataFrame columns for array and map collections respectively. On the below example, column hobbies defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.**

In [0]:
from pyspark.sql.types import ArrayType, MapType

arrayStructureSchema = StructType([
    StructField('name', StructType([
        StructField('firstname', StringType(), True),
        StructField('middlename', StringType(), True),
        StructField('lastname', StringType(), True)
])),
    StructField('hobbies', ArrayType(StringType()), True),
    StructField('properties', MapType(StringType(), StringType()), True)
])

arrayStructureDF = spark.createDataFrame(data=[], schema=arrayStructureSchema)
arrayStructureDF.printSchema()

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



##7. Creating StructType object struct from JSON file

---

**If you have too many columns and the structure of the DataFrame changes now and then, it’s a good practice to load the SQL StructType schema from JSON file. You can get the schema by using df2.schema.json() , store this in a file and will use it to create a the schema from this file.**

In [0]:
print(df2.schema.json())

json_data = df2.schema.json()


{"fields":[{"metadata":{},"name":"name","nullable":true,"type":{"fields":[{"metadata":{},"name":"firstname","nullable":true,"type":"string"},{"metadata":{},"name":"middlename","nullable":true,"type":"string"},{"metadata":{},"name":"lastname","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"gender","nullable":true,"type":"string"},{"metadata":{},"name":"salary","nullable":true,"type":"integer"}],"type":"struct"}


**Alternatively, you could also use df.schema.simpleString(), this will return an relatively simpler schema format.**

***Now let’s load the json file and use it to create a DataFrame.***

In [0]:
import json
schemaFromJson = StructType.fromJson(json.loads(json_data))

df3 = spark.createDataFrame(sc.parallelize(structureData), schemaFromJson)
df3.printSchema()


root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



##8. Creating StructType object struct from DDL String

---


**Like loading structure from JSON string, we can also create it from DLL ( by using T._parse_datatype_string(ddl_schema_string).**

In [0]:
import pyspark.sql.types as T


In [0]:
ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING,\
 `middle`: STRING>,`age` INT,`gender` STRING"

ddl_schema = T._parse_datatype_string(ddlSchemaStr)
ddl_schema

Out[275]: StructType([StructField('fullName', StructType([StructField('first', StringType(), True), StructField('last', StringType(), True), StructField('middle', StringType(), True)]), True), StructField('age', IntegerType(), True), StructField('gender', StringType(), True)])

##9. Checking if a Column Exists in a DataFrame

---

**If you want to perform some checks on metadata of the DataFrame, for example, if a column or field exists in a DataFrame or data type of column; we can easily do this using several functions on SQL StructType and StructField.**

In [0]:
print(df.schema.fieldNames().__contains__("firstname"))
print(df.schema.fields.__contains__(StructField("firstname",StringType(),True)))

True
True
