#### UNIT TEST
- **How to check if column is available in schema or not?**

In [0]:
df = spark.read.csv("/Volumes/@azureadb/pyspark/unionby/company_level.csv", header=True, inferSchema=True)
display(df.limit(5))
print("List of Columns in df: ", df.columns)

start_date,product_url,category,default_group,source_target,cloud_flatform,session_id,session_name,status_name,status_type,sessions,product_id,load datetime
2025-08-25,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876543,first_visit,first_visit,Not Available,5,409516064,2025-09-02T19:10:35
2025-08-26,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876544,purchase,organic,Not Available,12,409516064,2025-09-02T19:10:36
2025-08-27,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876545,search,network,Not Available,16,409516064,2025-09-02T19:10:37
2025-08-28,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876546,search,scroll,Not Available,22,409516064,2025-09-02T19:10:38
2025-08-29,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876547,search,organic,Not Available,25,409516064,2025-09-02T19:10:39


List of Columns in df:  ['start_date', 'product_url', 'category', 'default_group', 'source_target', 'cloud_flatform', 'session_id', 'session_name', 'status_name', 'status_type', 'sessions', 'product_id', 'load datetime']


**1) Check if a single column exists using in Operator?**

In [0]:
display('cloud_flatform' in df.columns)

True

In [0]:
display('region' in df.columns)

False

In [0]:
display('country' in df.columns)

False

In [0]:
display('sales_organisation_id' in df.columns)

False

In [0]:
if "source_target" in df.columns:
    print("\nColumn 'source_target' is available in df")
else:
    print("\nColumn 'source_target' is not available in df")


Column 'source_target' is available in df


In [0]:
if "sales_id" in df.columns:
    print("\nColumn 'sales_id' is available in df")
else:
    print("\nColumn 'sales_id' is not available in df")


Column 'sales_id' is not available in df


- The simplest method is using the **in** operator on **df.columns**.
- **df.columns** returns a **list of column** names (e.g. **['region', 'country', 'sales_organisation_id']**)
- The **in** operator checks if your **column name** is in that **list**.

**2) Check if multiple columns exist**

In [0]:
required_cols = ["region", "country", "sales_organisation_id"]

columns_exist = all(col in df.columns for col in required_cols)
display(columns_exist)

False

- **col in df.columns** → returns **True** if that column **exists**, otherwise **False**.
- **all()** → returns **True** only **if all the conditions** inside are **True**.
- ✅ **columns_exist = True** → if **all required columns exist** in the DataFrame.
- ❌ **columns_exist = False** → if **any one** of them is **missing**.

In [0]:
required_cols = ["product_url", "product_id", "session_name"]

if all(col in df.columns for col in required_cols):
    print("All required columns exist in DataFrame")
else:
    missing = [col for col in required_cols if col not in df.columns]
    print(f"Missing columns: {missing}")

All required columns exist in DataFrame


In [0]:
required_cols = ["region", "country", "sales_organisation_id"]

if all(col in df.columns for col in required_cols):
    print("All required columns exist in DataFrame")
else:
    missing = [col for col in required_cols if col not in df.columns]
    print(f"Missing columns: {missing}")

Missing columns: ['region', 'country', 'sales_organisation_id']


In [0]:
required_cols = ["region", "country", "session_name"]

if all(col in df.columns for col in required_cols):
    print("All required columns exist in DataFrame")
else:
    missing = [col for col in required_cols if col not in df.columns]
    print(f"Missing columns: {missing}")

Missing columns: ['region', 'country']


     if all(col in df.columns for col in required_cols):
- **df.columns** returns a **list of all column names** in the PySpark DataFrame df.

      Example: ["region", "country", "sales_organisation_id", "sales_amount"]

- **col in df.columns** checks whether **each column name** from **required_cols** is present in that **list**.
- The **all()** function returns:
  - **True:** if every required **column** is **present**.
  - **False:** if any **column** is **missing**.

     else:
         missing = [col for col in required_cols if col not in df.columns]
         print(f"Missing columns: {missing}")

- Create a **new list** called **missing**.
- It **collects all columns** from **required_cols** that are **not found in df.columns**.


In [0]:
columns_to_check = ['ObjectID', 'ID', 'Name', 'Date', 'contract_source', 'currency_code', 'refunded_amount', 'start_date', 'product_url', 'category', 'default_group', 'source_target', 'cloud_flatform', 'session_id', 'session_name', 'status_name', 'status_type', 'sessions', 'product_id', 'load datetime']

# List all columns from columns_to_check that are NOT present in df.columns
[col for col in columns_to_check if col not in df.columns]

['ObjectID',
 'ID',
 'Name',
 'Date',
 'contract_source',
 'currency_code',
 'refunded_amount']

**3) Using DataFrame schema (if you need to inspect types)**
- If you also want to verify **column names** and their **data types**, use **df.schema.fields**.
- This is useful when you’re working with **nested or complex schemas**.

In [0]:
df.schema.fields

[StructField('start_date', DateType(), True),
 StructField('product_url', StringType(), True),
 StructField('category', StringType(), True),
 StructField('default_group', StringType(), True),
 StructField('source_target', StringType(), True),
 StructField('cloud_flatform', StringType(), True),
 StructField('session_id', IntegerType(), True),
 StructField('session_name', StringType(), True),
 StructField('status_name', StringType(), True),
 StructField('status_type', StringType(), True),
 StructField('sessions', IntegerType(), True),
 StructField('product_id', IntegerType(), True),
 StructField('load datetime', StringType(), True)]

In [0]:
if any(field.name == "category" for field in df.schema.fields):
    print("\nColumn 'category' is available in df.schema.fields")
else:
    print("\nColumn 'category' is not available in df.schema.fields")


Column 'category' is available in df.schema.fields


In [0]:
columns_to_check = ['status_name', 'status_type']

for column in columns_to_check:
    if any(field.name == column for field in df.schema.fields):
        print(f"Column '{column}' is available in df.Schema")
    else:
        print(f"Column '{column}' is not available in df.Schema")

Column 'status_name' is available in df.Schema
Column 'status_type' is available in df.Schema


In [0]:
columns_to_check = ['ObjectID', 'ID', 'Name', 'Date', 'contract_source', 'currency_code', 'refunded_amount', 'start_date', 'product_url', 'category', 'default_group', 'source_target', 'cloud_flatform', 'session_id', 'session_name', 'status_name', 'status_type', 'sessions', 'product_id', 'load datetime']

schema_field_names = [field.name for field in df.schema.fields]
non_available_columns = [col for col in columns_to_check if col not in schema_field_names]

print("Schema filed Names: \n", schema_field_names)
print("\nList all columns from columns_to_check that are NOT present in df.columns: \n", non_available_columns)
print("\nPrint Struct Field: \n", df.schema.fields)

if any(field.name == "sales_organisation_id" for field in df.schema.fields):
    print("\nColumn 'sales_organisation_id' is available in table_Schema.")
else:
    print("\nColumn 'sales_organisation_id' is not available in table_Schema.")

Schema filed Names: 
 ['start_date', 'product_url', 'category', 'default_group', 'source_target', 'cloud_flatform', 'session_id', 'session_name', 'status_name', 'status_type', 'sessions', 'product_id', 'load datetime']

List all columns from columns_to_check that are NOT present in df.columns: 
 ['ObjectID', 'ID', 'Name', 'Date', 'contract_source', 'currency_code', 'refunded_amount']

Print Struct Field: 
 [StructField('start_date', DateType(), True), StructField('product_url', StringType(), True), StructField('category', StringType(), True), StructField('default_group', StringType(), True), StructField('source_target', StringType(), True), StructField('cloud_flatform', StringType(), True), StructField('session_id', IntegerType(), True), StructField('session_name', StringType(), True), StructField('status_name', StringType(), True), StructField('status_type', StringType(), True), StructField('sessions', IntegerType(), True), StructField('product_id', IntegerType(), True), StructField('

**4) Using a function to reuse the check**

In [0]:
def check_columns(df, cols):
    missing = [col for col in cols if col not in df.columns]
    if missing:
        print(f"Missing columns: {missing}")
        return False
    print("All columns are present")
    return True

# Example usage
check_columns(df, ["region", "sales_organisation_id"])

Missing columns: ['region', 'sales_organisation_id']


False

**5) Check and safely select only existing columns**

- If you’re not sure **all columns exist** and want to **select only valid ones**.

In [0]:
safe_cols = [col for col in ["region", "sessions", "invalid_col", "status_type"] if col in df.columns]
df.select(*safe_cols).display()

sessions,status_type
5,Not Available
12,Not Available
16,Not Available
22,Not Available
25,Not Available
4,Not Available
9,Not Available
8,Not Available
7,Not Available
6,Not Available


| Task                       | Best Method                                              | Example             |
| -------------------------- | -------------------------------------------------------- | ------------------- |
| Check one column           | `"col" in df.columns`                                    | ✅ Simple and fast   |
| Check multiple columns     | `all(col in df.columns for col in cols)`                 | ✅ Detect missing    |
| Check with schema info     | `any(field.name == "col" for field in df.schema.fields)` | ✅ Include type info |
| Automatically skip missing | `[c for c in cols if c in df.columns]`                   | ✅ Safe selection    |