    1. StructType – Defines the structure of the Dataframe
    PySpark provides from pyspark.sql.types import StructType class to define the structure of the DataFrame.

    StructType is a collection or list of StructField objects.

    PySpark printSchema() method on the DataFrame shows StructType columns as struct.

    2. StructField – Defines the metadata of the DataFrame column

    PySpark provides pyspark.sql.types import StructField class to define the columns which include column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)


    3. Using PySpark StructType & StructField with DataFrame

    While creating a PySpark DataFrame we can specify the structure using StructType and StructField classes. As specified in the introduction, StructType is a collection of StructField’s which is used to define the column name, data type, and a flag for nullable or not. Using StructField we can also add nested struct schema, ArrayType for arrays, and MapType for key-value pairs which we will discuss in detail in later sections.

    The below example demonstrates a very simple example of how to create a StructType & StructField on DataFrame and it’s usage with sample data to support it.

In [3]:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



In [12]:
from pyspark.sql.types import StructType

data = [(1,23,4,5,5,3,533),(1,3,4,45,5,3,333),(1,3,4,5,5,3,333),(14,3,4,5,5,3,533),(1,3,4,5,5,33,3)]

columns = StructType([StructField("id",IntegerType(),True),
                     StructField("number",IntegerType(),True),
                     StructField('number3',IntegerType(),True),
                     StructField('number4',IntegerType(),True),
                     StructField('number5',IntegerType(),True),
                     StructField('number6',IntegerType(),True),
                     StructField('number7',IntegerType(),True)])

df = spark.createDataFrame(data,columns)

In [14]:
df.show()

+---+------+-------+-------+-------+-------+-------+
| id|number|number3|number4|number5|number6|number7|
+---+------+-------+-------+-------+-------+-------+
|  1|    23|      4|      5|      5|      3|    533|
|  1|     3|      4|     45|      5|      3|    333|
|  1|     3|      4|      5|      5|      3|    333|
| 14|     3|      4|      5|      5|      3|    533|
|  1|     3|      4|      5|      5|     33|      3|
+---+------+-------+-------+-------+-------+-------+

