**How to verify schema mismatches between two tables?**
- How to find which **columns are present** in **source_tbl but missing in target_tbl**.
- It’s useful when you’re **comparing two dataframes** (during **ETL validation or schema comparison**).

**Scenario:**
- when validating **schema consistency** before loading data from one system to another:

  - **Source table:** Data extracted from a production database.
  - **Target table:** Destination schema in a data warehouse.

- If your **source table** has **additional columns** like **'created_at', 'updated_at', or 'is_active'** that aren’t in the **target schema**, this code helps **identify** them quickly before you attempt to **merge or insert** data—preventing errors and ensuring data compatibility.

In [0]:
source_tbl = spark.read.csv("/Volumes/@azureadb/pyspark/unionby/company_level.csv", header=True, inferSchema=True)
display(source_tbl.limit(5))
print("Column Names: ", source_tbl.columns)
print("No of Columns: ", len(source_tbl.columns))

start_date,product_url,category,default_group,source_target,cloud_flatform,session_id,session_name,status_name,status_type,sessions,product_id,load datetime
2025-08-25,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876543,first_visit,first_visit,Not Available,5,409516064,2025-09-02T19:10:35
2025-08-26,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876544,purchase,organic,Not Available,12,409516064,2025-09-02T19:10:36
2025-08-27,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876545,search,network,Not Available,16,409516064,2025-09-02T19:10:37
2025-08-28,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876546,search,scroll,Not Available,22,409516064,2025-09-02T19:10:38
2025-08-29,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876547,search,organic,Not Available,25,409516064,2025-09-02T19:10:39


Column Names:  ['start_date', 'product_url', 'category', 'default_group', 'source_target', 'cloud_flatform', 'session_id', 'session_name', 'status_name', 'status_type', 'sessions', 'product_id', 'load datetime']
No of Columns:  13


In [0]:
target_tbl = spark.read.csv("/Volumes/@azureadb/pyspark/unionby/device_level.csv", header=True, inferSchema=True)
display(target_tbl.limit(5))
print("Column Names: ", target_tbl.columns)
print("No of Columns: ", len(target_tbl.columns))

start_date,product_url,category,default_group,source_target,cloud_flatform,status_name,status_type,sessions,product_id,load datetime
2025-08-25,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,first_visit,Not Available,55,409516064,2025-09-02T19:10:35
2025-08-26,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,organic,Not Available,12,409516064,2025-09-02T19:10:36
2025-08-27,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,network,Not Available,16,409516064,2025-09-02T19:10:37
2025-08-28,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,scroll,Not Available,22,409516064,2025-09-02T19:10:38
2025-08-29,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,organic,Not Available,25,409516064,2025-09-02T19:10:39


Column Names:  ['start_date', 'product_url', 'category', 'default_group', 'source_target', 'cloud_flatform', 'status_name', 'status_type', 'sessions', 'product_id', 'load datetime']
No of Columns:  11


**1) Source Has Extra Columns**

     source_tbl.columns = ['id', 'name', 'age', 'city']
     target_tbl.columns = ['id', 'name']
     
     Output:
     Missing columns in Target: {'age', 'city'}

| source_tbl_columns | target_tbl_columns |
|--------------------|--------------------|
|  start_date        |   start_date       |
|  product_url       |   product_url      |
|  category          |   category         | 
|  default_group     |   default_group    |
|  source_target     |   source_target    |
|  cloud_flatform    |   cloud_flatform   |
|  session_id        |                    |
|  session_name      |                    |
|  status_name       |   status_name      |
|  status_type       |   status_type      |
|  sessions          |   sessions         |
|  product_id        |   product_id       |
|  load datetime     |   load datetime    |

In [0]:
missing_cols_in_target = set(source_tbl.columns) - set(target_tbl.columns)
print(f"Missing columns in Target: {missing_cols_in_target}") 

Missing columns in Target: {'session_name', 'session_id'}


- **session_name, session_id** are present in **source_tbl** but **not in target_tbl**.

In [0]:
missing_cols_in_source = set(target_tbl.columns) - set(source_tbl.columns)
print(f"Missing columns in Source: {missing_cols_in_source}") 

Missing columns in Source: set()


%md
| target_tbl_columns | source_tbl_columns |
|--------------------|--------------------|
|  start_date        |   start_date       |
|  product_url       |   product_url      |
|  category          |   category         | 
|  default_group     |   default_group    |
|  source_target     |   source_target    |
|  cloud_flatform    |   cloud_flatform   |
|                    |   session_id       |
|                    |   session_name     |
|  status_name       |   status_name      |
|  status_type       |   status_type      |
|  sessions          |   sessions         |
|  product_id        |   product_id       |
|  load datetime     |   load datetime    |

In [0]:
def findMissingColumns(source_tbl, target_tbl):
    missing_cols = set(source_tbl.columns) - set(target_tbl.columns)
    return missing_cols

In [0]:
findMissingColumns(source_tbl, target_tbl)

{'session_id', 'session_name'}

In [0]:
findMissingColumns(target_tbl, source_tbl)

set()

**2) Target Has Extra Columns**

     source_tbl.columns = ["x", "y"]
     target_tbl.columns = ["x", "y", "z"]

     missing_cols_in_target = set(source_tbl.columns) - set(target_tbl.columns)
     print(f"Missing columns in Target: {missing_cols_in_target}")

     Missing columns in Target: set()

- This code finds which **columns are present** in **source_tbl but missing in target_tbl**.

- Even though **target** has an **extra column 'z'**, this code only checks for **columns missing in the target**, so it **ignores extra ones**.


**3) No Missing Columns**

     source_tbl.columns = ["a", "b", "c"]
     target_tbl.columns = ["a", "b", "c"]

     missing_cols_in_target = set(source_tbl.columns) - set(target_tbl.columns)
     print(f"Missing columns in Target: {missing_cols_in_target}")

     Missing columns in Target: set()

**Both have the same columns, so the result is an empty set.**
