#### **Struct Data Type**

- struct is a **complex data type** that allows the storage of **multiple fields** together within a **single column**.
- The **fields** within a **struct** can be of **different data types** and can be **nested as well**.
- **STRUCT** column contains an **ordered list** of columns called **entries**.
- Each row in the STRUCT column must have the **same keys**.
- STRUCTs are typically used to **nest multiple columns** into a **single column**, and the **nested column** can be of **any type, including other STRUCTs and LISTs**.

- How to access Nested StructType Columns?
- How to access Complex Nested StructType Columns?
- How to convert csv to struct type and access StructType Columns?

- struct column named **address** with fields **city and zip** can be defined as:

      StructType(List(
       StructField("city", StringType, true),
       StructField("zip", IntegerType, true)
      ))
- In this example, **address** is the **struct** column name, **city and zip** are the **fields** of the struct column, their respective data types are **StringType and IntegerType**, and all the fields are **nullable**.

In [0]:
from pyspark.sql.functions import lit, col, to_json
import pyspark.sql.functions as f
from pyspark.sql.types import StringType, StructType, StructField

In [0]:
%sql
SELECT named_struct('key1', 'value1', 'key2', 42) AS s;

s
"List(value1, 42)"


#### **1) Accessing Nested Struct Columns**

**Ex 01**

In [0]:
data = [('Rajesh', ('BMW', 'ADF', 'Data Engineer', 5)),
        ('Rajasekar', ('HONDA', 'ADB', 'Developer', 8)),
        ('Harish', ('MARUTI', 'AZURE', 'Testing', 9)),
        ('Kamalesh', ('BENZ', 'PYSPARK', 'Developer', 10)),
        ('Jagadish', ('FORD', 'PYTHON', 'ADE', 3)),
        ('Arijit', ('KIA', 'DEVOPS', 'CI/CD', 4))]

schema_sub = StructType([StructField('make', StringType()), StructField('Technology', StringType()),\
                         StructField('Designation', StringType()), StructField('Experience', StringType())])
                         
schema = StructType([StructField('Name', StringType()), StructField('Description', schema_sub)])

df_nest = spark.createDataFrame(data, schema)
display(df_nest)
df_nest.printSchema()

Name,Description
Rajesh,"List(BMW, ADF, Data Engineer, 5)"
Rajasekar,"List(HONDA, ADB, Developer, 8)"
Harish,"List(MARUTI, AZURE, Testing, 9)"
Kamalesh,"List(BENZ, PYSPARK, Developer, 10)"
Jagadish,"List(FORD, PYTHON, ADE, 3)"
Arijit,"List(KIA, DEVOPS, CI/CD, 4)"


root
 |-- Name: string (nullable = true)
 |-- Description: struct (nullable = true)
 |    |-- make: string (nullable = true)
 |    |-- Technology: string (nullable = true)
 |    |-- Designation: string (nullable = true)
 |    |-- Experience: string (nullable = true)



**a) Selecting the entire Struct Column**

In [0]:
# Selecting the entire nested 'address' column
df_nest.select("Description").display()

Description
"List(BMW, ADF, Data Engineer, 5)"
"List(HONDA, ADB, Developer, 8)"
"List(MARUTI, AZURE, Testing, 9)"
"List(BENZ, PYSPARK, Developer, 10)"
"List(FORD, PYTHON, ADE, 3)"
"List(KIA, DEVOPS, CI/CD, 4)"


**b) Selecting Specific Subfields of a Struct Column**

In [0]:
# Selecting 'make' subfield within the 'Description' struct column
df_nest.select("Description.make").display()

make
BMW
HONDA
MARUTI
BENZ
FORD
KIA


In [0]:
# Selecting 'Description' subfield within the 'Description' struct column
df_nest.select("Name", "Description.Technology").display()

Name,Technology
Rajesh,ADF
Rajasekar,ADB
Harish,AZURE
Kamalesh,PYSPARK
Jagadish,PYTHON
Arijit,DEVOPS


     df.select("name", df["Description.Technology"].alias("Technology"), df["Description.Experience"].alias("Experience"))
                                                      (or)
     df.select("Name", "Description.Technology", "Description.Experience").display()

In [0]:
# Split struct column into separate columns 
df_nest_01 = df_nest.select("name", df_nest["Description.Technology"].alias("Technology"), df_nest["Description.Experience"].alias("Experience"))
display(df_nest_01)

name,Technology,Experience
Rajesh,ADF,5
Rajasekar,ADB,8
Harish,AZURE,9
Kamalesh,PYSPARK,10
Jagadish,PYTHON,3
Arijit,DEVOPS,4


In [0]:
# Expand all attributes of "Description"
df_nest.select("*", "Description.*").display()

Name,Description,make,Technology,Designation,Experience
Rajesh,"List(BMW, ADF, Data Engineer, 5)",BMW,ADF,Data Engineer,5
Rajasekar,"List(HONDA, ADB, Developer, 8)",HONDA,ADB,Developer,8
Harish,"List(MARUTI, AZURE, Testing, 9)",MARUTI,AZURE,Testing,9
Kamalesh,"List(BENZ, PYSPARK, Developer, 10)",BENZ,PYSPARK,Developer,10
Jagadish,"List(FORD, PYTHON, ADE, 3)",FORD,PYTHON,ADE,3
Arijit,"List(KIA, DEVOPS, CI/CD, 4)",KIA,DEVOPS,CI/CD,4


**Ex 02**

In [0]:
# Data
data = [
        (("jagadish", None, "Smith", 35, 5, "buy"),"chennai","M"),
        (("Anand", "Rose", "", 30, 8, "sell"), "bangalore", "M"),
        (("Julia", "", "Williams", 25, 3, "buy"), "vizak", "F"),
        (("Mukesh", "Bhat", "Royal", 45, 8, "buy"), "madurai", "M"),
        (("Swetha", "Kumari", "Anand", 55, 15, "sell"), "mysore", "F"),
        (("Madan", "Mohan", "Nair", 22, 11, "buy"), "hyderabad", "M"),
        (("George", "", "Williams", 38, 7, "sell"), "London", "M"),
        (("Roshan", "Bhat", "", 41, 3, "buy"), "mandya", "M"),
        (("Sourabh", "Sharma", "", 27, 2, "sell"), "Nasik", "M"),
        (("Mohan", "Rao", "K", 42, 7, "buy"), "nizamabad", "M")
        ]

# Schema
schema_arr = StructType([
    StructField('Name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True),
         StructField('age', IntegerType(), True),
         StructField('experience', IntegerType(), True),
         StructField('status', StringType(), True)
         ])),
     StructField('city', StringType(), True),
     StructField('gender', StringType(), True)
     ])

# Create DataFrame
df_arr = spark.createDataFrame(data=data, schema=schema_arr)
df_arr.printSchema()
display(df_arr)

root
 |-- Name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |    |-- age: integer (nullable = true)
 |    |-- experience: integer (nullable = true)
 |    |-- status: string (nullable = true)
 |-- city: string (nullable = true)
 |-- gender: string (nullable = true)



Name,city,gender
"List(jagadish, null, Smith, 35, 5, buy)",chennai,M
"List(Anand, Rose, , 30, 8, sell)",bangalore,M
"List(Julia, , Williams, 25, 3, buy)",vizak,F
"List(Mukesh, Bhat, Royal, 45, 8, buy)",madurai,M
"List(Swetha, Kumari, Anand, 55, 15, sell)",mysore,F
"List(Madan, Mohan, Nair, 22, 11, buy)",hyderabad,M
"List(George, , Williams, 38, 7, sell)",London,M
"List(Roshan, Bhat, , 41, 3, buy)",mandya,M
"List(Sourabh, Sharma, , 27, 2, sell)",Nasik,M
"List(Mohan, Rao, K, 42, 7, buy)",nizamabad,M


In [0]:
# Select struct type
df_arr.select("Name").display()

Name
"List(jagadish, null, Smith, 35, 5, buy)"
"List(Anand, Rose, , 30, 8, sell)"
"List(Julia, , Williams, 25, 3, buy)"
"List(Mukesh, Bhat, Royal, 45, 8, buy)"
"List(Swetha, Kumari, Anand, 55, 15, sell)"
"List(Madan, Mohan, Nair, 22, 11, buy)"
"List(George, , Williams, 38, 7, sell)"
"List(Roshan, Bhat, , 41, 3, buy)"
"List(Sourabh, Sharma, , 27, 2, sell)"
"List(Mohan, Rao, K, 42, 7, buy)"


In [0]:
# Select columns from struct type
df_arr.select("name.firstname","name.lastname").display()

firstname,lastname
jagadish,Smith
Anand,
Julia,Williams
Mukesh,Royal
Swetha,Anand
Madan,Nair
George,Williams
Roshan,
Sourabh,
Mohan,K


In [0]:
# Extract all columns from struct type
df_arr.select("name.*").display()

firstname,middlename,lastname,age,experience,status
jagadish,,Smith,35,5,buy
Anand,Rose,,30,8,sell
Julia,,Williams,25,3,buy
Mukesh,Bhat,Royal,45,8,buy
Swetha,Kumari,Anand,55,15,sell
Madan,Mohan,Nair,22,11,buy
George,,Williams,38,7,sell
Roshan,Bhat,,41,3,buy
Sourabh,Sharma,,27,2,sell
Mohan,Rao,K,42,7,buy


In [0]:
# Extract all columns from struct type
df_arr.select("*", "Name.*").display()

Name,city,gender,firstname,middlename,lastname,age,experience,status
"List(jagadish, null, Smith, 35, 5, buy)",chennai,M,jagadish,,Smith,35,5,buy
"List(Anand, Rose, , 30, 8, sell)",bangalore,M,Anand,Rose,,30,8,sell
"List(Julia, , Williams, 25, 3, buy)",vizak,F,Julia,,Williams,25,3,buy
"List(Mukesh, Bhat, Royal, 45, 8, buy)",madurai,M,Mukesh,Bhat,Royal,45,8,buy
"List(Swetha, Kumari, Anand, 55, 15, sell)",mysore,F,Swetha,Kumari,Anand,55,15,sell
"List(Madan, Mohan, Nair, 22, 11, buy)",hyderabad,M,Madan,Mohan,Nair,22,11,buy
"List(George, , Williams, 38, 7, sell)",London,M,George,,Williams,38,7,sell
"List(Roshan, Bhat, , 41, 3, buy)",mandya,M,Roshan,Bhat,,41,3,buy
"List(Sourabh, Sharma, , 27, 2, sell)",Nasik,M,Sourabh,Sharma,,27,2,sell
"List(Mohan, Rao, K, 42, 7, buy)",nizamabad,M,Mohan,Rao,K,42,7,buy


**Ex 03**

In [0]:
data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}},
        {"id": 2, "customer_profile": {"name": "Tech", "age": 10}},
        {"id": 3, "customer_profile": {"name": "ADF", "age": 23}},
        {"id": 4, "customer_profile": {"name": "AWS", "age": 30}},
        {"id": 5, "customer_profile": {"name": "GCC", "age": 33}},
        {"id": 6, "customer_profile": {"name": "ADB", "age": 20}}]

customer_schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
])

df_schema = StructType([StructField("id", IntegerType(), True), StructField("customer_profile", customer_schema, False)])

df_dict = spark.createDataFrame(data, df_schema)
print(df_dict.schema)
display(df_dict)

StructType([StructField('id', IntegerType(), True), StructField('customer_profile', StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)]), False)])


id,customer_profile
1,"List(Kontext, 3)"
2,"List(Tech, 10)"
3,"List(ADF, 23)"
4,"List(AWS, 30)"
5,"List(GCC, 33)"
6,"List(ADB, 20)"


In [0]:
# expand the StructType column
df_dict.select('*', "customer_profile.name", "customer_profile.age").display()

id,customer_profile,name,age
1,"List(Kontext, 3)",Kontext,3
2,"List(Tech, 10)",Tech,10
3,"List(ADF, 23)",ADF,23
4,"List(AWS, 30)",AWS,30
5,"List(GCC, 33)",GCC,33
6,"List(ADB, 20)",ADB,20


In [0]:
# explode all attributes
df_dict.select('*', "customer_profile.*").display()

id,customer_profile,name,age
1,"List(Kontext, 3)",Kontext,3
2,"List(Tech, 10)",Tech,10
3,"List(ADF, 23)",ADF,23
4,"List(AWS, 30)",AWS,30
5,"List(GCC, 33)",GCC,33
6,"List(ADB, 20)",ADB,20


#### **2) Complex Nested Structures**

In [0]:
# Create data with nested structure
data = [("James", 34, ("1st Avenue", "New York", "US", ("123-456-7890", "james@example.com"))),
        ("Anna", 23, ("2nd Avenue", "San Francisco", "US", ("987-654-3210", "anna@example.com"))),
        ("Jeff", 45, ("3rd Avenue", "London", "UK", ("456-123-7890", "jeff@example.com"))),
        ("karthik", 34, ("#876", "Chennai", "India", ("9866773221", "karthik@example.com"))),
        ("Anusha", 43, ("48th Main", "Hyderabad", "India", ("9933445500", "anu@example.com"))),
        ("Jayesh", 45, ("3rd Cross", "Colombo", "SriLanka", ("8745612345", "yay@example.com"))),
        ("Paul", 38, ("River View", "Gothenberg", "Sweden", ("656-456-7890", "paul@example.com"))),
        ("Rajesh", 40, ("Cross word", "Berlin", "Germany", ("456-123-2678", "raj@example.com"))),
        ] 

# Defined schema for illustration purposes:
complex_schema = StructType([
    StructField("Name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True),
        StructField("country", StringType(), True),
        StructField("contact", StructType([
            StructField("phone", StringType(), True),
            StructField("email", StringType(), True)
        ]), True)
    ]), True)
])

# create DataFrame using createDataFrame method 
df_nested = spark.createDataFrame(data, complex_schema)
display(df_nested)

Name,age,address
James,34,"List(1st Avenue, New York, US, List(123-456-7890, james@example.com))"
Anna,23,"List(2nd Avenue, San Francisco, US, List(987-654-3210, anna@example.com))"
Jeff,45,"List(3rd Avenue, London, UK, List(456-123-7890, jeff@example.com))"
karthik,34,"List(#876, Chennai, India, List(9866773221, karthik@example.com))"
Anusha,43,"List(48th Main, Hyderabad, India, List(9933445500, anu@example.com))"
Jayesh,45,"List(3rd Cross, Colombo, SriLanka, List(8745612345, yay@example.com))"
Paul,38,"List(River View, Gothenberg, Sweden, List(656-456-7890, paul@example.com))"
Rajesh,40,"List(Cross word, Berlin, Germany, List(456-123-2678, raj@example.com))"


In [0]:
# Navigating through multiple levels and selecting 'phone'
df_nested.select("Name","address.contact.phone").display()

Name,phone
James,123-456-7890
Anna,987-654-3210
Jeff,456-123-7890
karthik,9866773221
Anusha,9933445500
Jayesh,8745612345
Paul,656-456-7890
Rajesh,456-123-2678


In [0]:
# Navigating through multiple levels and selecting 'phone' & '
df_nested.select("Name", "address.contact.phone", "address.contact.email").display()

Name,phone,email
James,123-456-7890,james@example.com
Anna,987-654-3210,anna@example.com
Jeff,456-123-7890,jeff@example.com
karthik,9866773221,karthik@example.com
Anusha,9933445500,anu@example.com
Jayesh,8745612345,yay@example.com
Paul,656-456-7890,paul@example.com
Rajesh,456-123-2678,raj@example.com


In [0]:
# Expand all columns of "contact"
df_nested.select("address.contact.*").display()

phone,email
123-456-7890,james@example.com
987-654-3210,anna@example.com
456-123-7890,jeff@example.com
9866773221,karthik@example.com
9933445500,anu@example.com
8745612345,yay@example.com
656-456-7890,paul@example.com
456-123-2678,raj@example.com


In [0]:
# Expand all columns of "address"
df_nested.select("address.*").display()

street,city,country,contact
1st Avenue,New York,US,"List(123-456-7890, james@example.com)"
2nd Avenue,San Francisco,US,"List(987-654-3210, anna@example.com)"
3rd Avenue,London,UK,"List(456-123-7890, jeff@example.com)"
#876,Chennai,India,"List(9866773221, karthik@example.com)"
48th Main,Hyderabad,India,"List(9933445500, anu@example.com)"
3rd Cross,Colombo,SriLanka,"List(8745612345, yay@example.com)"
River View,Gothenberg,Sweden,"List(656-456-7890, paul@example.com)"
Cross word,Berlin,Germany,"List(456-123-2678, raj@example.com)"


In [0]:
# Expand all columns of "address"
df_nested.select("address.*", "address.contact.*").display()

street,city,country,contact,phone,email
1st Avenue,New York,US,"List(123-456-7890, james@example.com)",123-456-7890,james@example.com
2nd Avenue,San Francisco,US,"List(987-654-3210, anna@example.com)",987-654-3210,anna@example.com
3rd Avenue,London,UK,"List(456-123-7890, jeff@example.com)",456-123-7890,jeff@example.com
#876,Chennai,India,"List(9866773221, karthik@example.com)",9866773221,karthik@example.com
48th Main,Hyderabad,India,"List(9933445500, anu@example.com)",9933445500,anu@example.com
3rd Cross,Colombo,SriLanka,"List(8745612345, yay@example.com)",8745612345,yay@example.com
River View,Gothenberg,Sweden,"List(656-456-7890, paul@example.com)",656-456-7890,paul@example.com
Cross word,Berlin,Germany,"List(456-123-2678, raj@example.com)",456-123-2678,raj@example.com


In [0]:
# Expand all columns of "address"
df_nested.select("address.*", "address.contact.*").drop('contact').display()

street,city,country,phone,email
1st Avenue,New York,US,123-456-7890,james@example.com
2nd Avenue,San Francisco,US,987-654-3210,anna@example.com
3rd Avenue,London,UK,456-123-7890,jeff@example.com
#876,Chennai,India,9866773221,karthik@example.com
48th Main,Hyderabad,India,9933445500,anu@example.com
3rd Cross,Colombo,SriLanka,8745612345,yay@example.com
River View,Gothenberg,Sweden,656-456-7890,paul@example.com
Cross word,Berlin,Germany,456-123-2678,raj@example.com


#### **3) convert csv to struct**

In [0]:
df_csv = spark.read.csv("dbfs:/FileStore/tables/to_json.csv", header=True, inferSchema=True)
display(df_csv.limit(10))

Id,Nick_Name,First_Name,Last_Name,Age,Type,Description,Commodity_Index,Sensex_Category,Label_Type,Effective_Date,Start_Date,End_Date,Currency,Ticket,Name,Sex
1,admin,John,Victor,30,Grade1,Baleno,DISCOUNT,Top,average,6-Feb-23,14-Jan-23,6-Feb-23,INR,A/5 21171,"Braund, Mr. Owen Harris",male
2,everest,Paul,Irish,35,Grade2,Engine_Base,DISCOUNT,Top,average,6-Feb-23,14-Jan-23,6-Feb-23,INR,PC 17599,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female
3,moon,Erram,Rammohan,29,Enginner1,Baleno,DISCOUNT,Top,average,8-Jan-24,7-Oct-23,8-Jan-24,INR,STON/O2. 3101282,"Heikkinen, Miss. Laina",female
4,service,Stalin,Rajesh,40,Minister,Engine_Base,DISCOUNT,Top,average,8-Jan-24,7-Oct-23,8-Jan-24,INR,113803,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
5,Builder,Golla,Rajasekar,43,Builder,Suzuki Swift,DISCOUNT,Top,average,6-Mar-23,7-Feb-23,6-Mar-23,INR,373450,"Allen, Mr. William Henry",male
6,Drinker,Karjala,Hari,33,Army,Suzuki Swift,DISCOUNT,Top,average,6-Mar-23,7-Feb-23,6-Mar-23,INR,330877,"Moran, Mr. James",male
7,Army,Koyi,Damodar,37,Bettalian,Wagon R,DISCOUNT,Top,average,6-Jan-25,9-Jan-24,6-Jan-25,INR,17463,"McCarthy, Mr. Timothy J",male
8,Marketing,Vemparla,Harish,55,Manager,Engine_Base,DISCOUNT,Top,average,6-Jan-25,9-Jan-24,6-Jan-25,INR,349909,"Palsson, Master. Gosta Leonard",male
9,Politician,Devineni,Umesh,58,Senior,Creta,DISCOUNT,Top,average,6-Apr-23,7-Mar-23,6-Apr-23,INR,347742,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female
10,Minister,Ponguru,Narayana,56,Education,Brezza,DISCOUNT,Top,average,6-Apr-23,7-Mar-23,6-Apr-23,INR,237736,"Nasser, Mrs. Nicholas (Adele Achem)",female


In [0]:
# Select all columns and structure them into a single column named 'pp_msg'
df_final = df_csv.select(f.struct('*').alias('pp_msg')).distinct()
display(df_final.limit(10))

pp_msg
"List(1, admin, John, Victor, 30, Grade1, Baleno, DISCOUNT, Top, average, 6-Feb-23, 14-Jan-23, 6-Feb-23, INR, A/5 21171, Braund, Mr. Owen Harris, male)"
"List(7, Army, Koyi, Damodar, 37, Bettalian, Wagon R, DISCOUNT, Top, average, 6-Jan-25, 9-Jan-24, 6-Jan-25, INR, 17463, McCarthy, Mr. Timothy J, male)"
"List(2, everest, Paul, Irish, 35, Grade2, Engine_Base, DISCOUNT, Top, average, 6-Feb-23, 14-Jan-23, 6-Feb-23, INR, PC 17599, Cumings, Mrs. John Bradley (Florence Briggs Thayer), female)"
"List(5, Builder, Golla, Rajasekar, 43, Builder, Suzuki Swift, DISCOUNT, Top, average, 6-Mar-23, 7-Feb-23, 6-Mar-23, INR, 373450, Allen, Mr. William Henry, male)"
"List(3, moon, Erram, Rammohan, 29, Enginner1, Baleno, DISCOUNT, Top, average, 8-Jan-24, 7-Oct-23, 8-Jan-24, INR, STON/O2. 3101282, Heikkinen, Miss. Laina, female)"
"List(4, service, Stalin, Rajesh, 40, Minister, Engine_Base, DISCOUNT, Top, average, 8-Jan-24, 7-Oct-23, 8-Jan-24, INR, 113803, Futrelle, Mrs. Jacques Heath (Lily May Peel), female)"
"List(8, Marketing, Vemparla, Harish, 55, Manager, Engine_Base, DISCOUNT, Top, average, 6-Jan-25, 9-Jan-24, 6-Jan-25, INR, 349909, Palsson, Master. Gosta Leonard, male)"
"List(6, Drinker, Karjala, Hari, 33, Army, Suzuki Swift, DISCOUNT, Top, average, 6-Mar-23, 7-Feb-23, 6-Mar-23, INR, 330877, Moran, Mr. James, male)"
"List(9, Politician, Devineni, Umesh, 58, Senior, Creta, DISCOUNT, Top, average, 6-Apr-23, 7-Mar-23, 6-Apr-23, INR, 347742, Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg), female)"
"List(10, Minister, Ponguru, Narayana, 56, Education, Brezza, DISCOUNT, Top, average, 6-Apr-23, 7-Mar-23, 6-Apr-23, INR, 237736, Nasser, Mrs. Nicholas (Adele Achem), female)"
