#### How to compare your DataFrame schema?
- **schema validation** snippet, used to **compare your DataFrame schema** (df.schema.fields) against an **expected schema** (expected_schema dictionary).

In [0]:
df = spark.read.csv("/Volumes/@azureadb/pyspark/unionby/company_level.csv", header=True, inferSchema=True)
display(df.limit(5))

start_date,product_url,category,default_group,source_target,cloud_flatform,session_id,session_name,status_name,status_type,sessions,product_id,load datetime
2025-08-25,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876543,first_visit,first_visit,Not Available,5,409516064,2025-09-02T19:10:35
2025-08-26,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876544,purchase,organic,Not Available,12,409516064,2025-09-02T19:10:36
2025-08-27,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876545,search,network,Not Available,16,409516064,2025-09-02T19:10:37
2025-08-28,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876546,search,scroll,Not Available,22,409516064,2025-09-02T19:10:38
2025-08-29,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876547,search,organic,Not Available,25,409516064,2025-09-02T19:10:39


**Schema Validation Script**

In [0]:
expected_schema = {
    "start_date": "StringType()",
    "product_url": "StringType()",
    "category": "IntegerType()",
    "cloud_flatform": "IntegerType()",
    "default_group": "StringType()",
    "source_target": "StringType()",
    "product": "StringType()",
    "product_version": "StringType()",
    "session_id": "IntegerType()",
    "session_name": "StringType()",
    "status_name": "IntegerType()",
    "status_type": "StringType()",
    "sessions": "IntegerType()",
    "load datetime": "StringType()",
    "load date": "StringType()",
    "load time": "DoubleType()"
}

for field in df.schema.fields:
    expected_type = expected_schema.get(field.name)
    if expected_type:
        if str(field.dataType) == expected_type:
            print(f"✅ {field.name} → matches expected type {expected_type}")
        else:
            print(f"⚠️ {field.name} → expected {expected_type}, found {field.dataType}")
    else:
        print(f"ℹ️ Extra column: {field.name}")

⚠️ start_date → expected StringType(), found DateType()
✅ product_url → matches expected type StringType()
⚠️ category → expected IntegerType(), found StringType()
✅ default_group → matches expected type StringType()
✅ source_target → matches expected type StringType()
⚠️ cloud_flatform → expected IntegerType(), found StringType()
✅ session_id → matches expected type IntegerType()
✅ session_name → matches expected type StringType()
⚠️ status_name → expected IntegerType(), found StringType()
✅ status_type → matches expected type StringType()
✅ sessions → matches expected type IntegerType()
ℹ️ Extra column: product_id
✅ load datetime → matches expected type StringType()


- **df.schema.fields** gives a **list of StructField objects**, each representing a column.

     df.schema.fields

     [
      StructField('start_date', DateType(), True)
      StructField('product_url', StringType(), True)
      StructField('category', StringType(), True)
      StructField('default_group', StringType(), True)
      StructField('source_target', StringType(), True)
      StructField('cloud_flatform', StringType(), True)
      StructField('session_id', IntegerType(), True)
      StructField('session_name', StringType(), True)
      StructField('status_name', StringType(), True)
      StructField('status_type', StringType(), True)
      StructField('sessions', IntegerType(), True)
      StructField('product_id', IntegerType(), True)
      StructField('load datetime', StringType(), True)
     ]

     for field in df.schema.fields:
         print(field)
     -------------------------------
     StructField('start_date', DateType(), True)
     StructField('product_url', StringType(), True)
     StructField('category', StringType(), True)
     StructField('default_group', StringType(), True)
     StructField('source_target', StringType(), True)
     StructField('cloud_flatform', StringType(), True)
     StructField('session_id', IntegerType(), True)
     StructField('session_name', StringType(), True)
     StructField('status_name', StringType(), True)
     StructField('status_type', StringType(), True)
     StructField('sessions', IntegerType(), True)
     StructField('product_id', IntegerType(), True)
     StructField('load datetime', StringType(), True)

In [0]:
for field in df.schema.fields:
    expected_type = expected_schema.get(field.name)
    print(expected_type)

StringType()
StringType()
IntegerType()
StringType()
StringType()
IntegerType()
IntegerType()
StringType()
IntegerType()
StringType()
IntegerType()
None
StringType()


     expected_schema.get(field.name)

     expected_schema.get(start_date) => StringType()
     expected_schema.get(product_url) => StringType()
     expected_schema.get(category) => IntegerType()
     expected_schema.get(default_group) => StringType()
     expected_schema.get(source_target) => StringType()
     expected_schema.get(cloud_flatform) => IntegerType()
     expected_schema.get(session_id) => IntegerType()
     expected_schema.get(session_name) => StringType()
     expected_schema.get(status_name) => IntegerType()
     expected_schema.get(status_type) => StringType()
     expected_schema.get(sessions) => IntegerType()
     expected_schema.get(product_id) => xxxxxx
     expected_schema.get(load datetime) => StringType()