Implementing Type 2 for SCD handling is fairly complex. In type 2 a new record is inserted with the latest values and previous records are marked as invalid. To keep track of the validity of records 3 additional columns are used. effective_date, expiration_date and current_flag.

When the new record gets inserted effective_date is current_date, expiration_date is ‘9999–12–31’ and current_flag will be set to True. If some record got deleted in source data then its expiration_date is set to current_date and current_flag is False. If record get’s updated then record with old values expiration_date will be current_date and current_flag will be false. At the same time, new records will have effective_date as current_date, expiration_date as ‘9999–12–31’, and current_flag as True. To maintain a unique key column surrogate key is created which will be used as a foreign key in fact tables. It just becomes easy to filter records on the boolean column so I included it. It is possible to use only 2 columns i.e. effective_date and expiration_date to handle SCD Type 2. Surrogate key plays important role in maintaining link between fact and dimension table.

the implementation of SCD Type 2 using PySpark with the following steps:

`Checking Columns Presence`: Verify that all columns from the target DataFrame are present in the source DataFrame.

`Applying Hash Calculation`: Calculate a hash value based on the target columns to identify changes in data.

`Identifying New Records`: Perform a left anti-join to identify new records in the source DataFrame that do not exist in the target DataFrame.

`Performing Left Join`: Join the source and target DataFrames using a left join on the specified join keys.

`Filtering Records`: Filter the joined DataFrame to identify unchanged, updated, and obsolete records based on hash values and join keys.

`Handling New Records`: Create new records for new entries in the source DataFrame and assign appropriate SCD2 metadata.

`Handling Updated Records`: Update existing records in the target DataFrame with new values from the source DataFrame and assign appropriate SCD2 metadata.

`Handling Obsolete Records`: Flag obsolete records in the target DataFrame with an end date and set a flag indicating their status.

`Combining DataFrames`: Combine new, updated, and unchanged DataFrames to generate the final result DataFrame.

In [None]:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip


class SparkSessionManager:
    def __init__(self, app_name, spark_conf=None):
        self.app_name = app_name
        self.spark_conf = spark_conf if spark_conf else {}

    def create_session(self):
        # Create a SparkSession
        spark_builder = SparkSession.builder.appName(self.app_name) \
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

        # Apply Spark configuration
        for key, value in self.spark_conf.items():
            spark_builder.config(key, value)

        spark = configure_spark_with_delta_pip(spark_builder).getOrCreate()
        return spark

In [None]:
from pyspark.sql import DataFrame
from pyspark.sql.functions import concat_ws, md5, col, current_date, lit

from utils.logger import Logger
from utils.spark_session import SparkSessionManager


class SCDHandler:
    def __init__(self):
        self.spark = SparkSessionManager(self.__class__.__name__).create_session()
        self.logger = Logger(self.__class__.__name__)

    def check_columns_presence(self, source_df, target_df, metadata_cols):
        """
        Check if all columns from the target DataFrame are present in the source DataFrame.

        Args:
            source_df (pyspark.sql.DataFrame): Source DataFrame.
            target_df (pyspark.sql.DataFrame): Target DataFrame.

        Raises:
            Exception: If columns are missing in the source DataFrame.

        Returns:
            None
        """
        cols_missing = set([cols for cols in target_df.columns if cols not in source_df.columns]) - set(metadata_cols)
        if cols_missing:
            raise Exception(f"Cols missing in source DataFrame: {cols_missing}")

    def apply_hash_and_alias(self, source_df, target_df, metadata_cols) -> ([DataFrame, DataFrame]):
        """
        Apply hash calculation and alias to source and target DataFrames.

        Args:
            source_df (pyspark.sql.DataFrame): Source DataFrame.
            target_df (pyspark.sql.DataFrame): Target DataFrame.
            metadata_cols (list): List of metadata columns to exclude from hash calculation.

        Returns:
            tuple: Tuple containing aliased source DataFrame and aliased target DataFrame.
        """
        # Extract columns from target DataFrame excluding metadata columns
        tgt_cols = [x for x in target_df.columns if x not in metadata_cols]

        # Calculate hash expression
        hash_expr = md5(concat_ws("|", *[col(c) for c in tgt_cols]))

        # Apply hash calculation and alias to source and target DataFrames
        source_df = source_df.withColumn("hash_value", hash_expr).alias("source_df")
        target_df = target_df.withColumn("hash_value", hash_expr).alias("target_df")

        return source_df, target_df


    def scd_2(self, source_df, target_df, join_keys, metadata_cols=None) -> DataFrame:
        if metadata_cols is None:
            metadata_cols = ['eff_start_date', 'eff_end_date', 'flag']
        tgt_cols = [x for x in target_df.columns]
        self.check_columns_presence(source_df, target_df, metadata_cols)
        # Apply hash calculation and alias
        source_df, target_df = self.apply_hash_and_alias(source_df, target_df, metadata_cols)

        # Identify new records
        join_cond = [source_df[join_key] == target_df[join_key] for join_key in join_keys]
        new_df = source_df.join(target_df, join_cond, 'left_anti')

        base_df = target_df.join(source_df, join_cond, 'left')

        # Filter unchanged records or same records
        unchanged_filter_expr = " AND ".join([f"source_df.{key} IS NULL" for key in join_keys])
        unchanged_df = base_df.filter(f"({unchanged_filter_expr}) OR "
                                      f"(source_df.hash_value = target_df.hash_value)") \
            .select("target_df.*")

        # identify updated records
        delta_filter_expr = " and ".join([f"source_df.{key} IS NOT NULL" for key in join_keys])
        updated_df = base_df.filter(f"{delta_filter_expr} AND "
                                    f"source_df.hash_value != target_df.hash_value")

        # pick updated records from source_df for new entry
        updated_new_df = updated_df.select("source_df.*")

        # pick updated records from target_df for obsolete entry
        obsolete_df = updated_df.select("target_df.*") \
            .withColumn("eff_end_date", current_date()) \
            .withColumn("flag", lit(0))

        # union : new & updated records and add scd2 meta-deta
        delta_df = new_df.union(updated_new_df) \
            .withColumn("eff_start_date", current_date()) \
            .withColumn("eff_end_date", lit(None)) \
            .withColumn("flag", lit(1))

        # union all datasets : delta_df + obsolete_df + unchanged_df
        result_df = unchanged_df.select(tgt_cols). \
            unionByName(delta_df.select(tgt_cols)). \
            unionByName(obsolete_df.select(tgt_cols))

        return result_df

In [None]:
#
## Constants
DATE_FORMAT = "yyyy-MM-dd"
EOW_DATE = "9999-12-31"
KEY_LIST = ["customerid"]
type2_cols = ["CompanyName", "EmailAddress", "Phone", "ZipCode"]
scd2_cols = ["effective_date","expiration_date","current_flag"]


In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import * 
from pyspark.sql.window import Window
from os import path, listdir

spark = SparkSession.builder.master("local[*]").appName("scd-type2-implementation").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

In [None]:
hist_customer_data = [
    (1,'Mr.','Manish','Agrwal','Amazon','manishk@amazone.com','+9198209371','411014'),
    (2,'Mr.','Vikash','Kumar','Citi','vikash@citi.com','+9198209372','411015'),
    (3,'Mr.','Shilpa','Sexsena','Infosys','shilpa.k@infy.com','+9198209372','411016'),
    (4,'Mr.','Rakesh','Dhaker','UBS','rd@ubs.com','+9198209372','411017'),
    (5,'Mr.','Ayush','Kapoor','Nexflix','ak@netflix.com','+9198209372','411018'),
    (6,'Mr.','Pritesh','Soni','Wipro','ps@wipro.com','+9198209372','411019'),
    (7,'Mr.','Manoj','Sain','HCL','ms@hcl.com','+9198209372','411010'),
    (8,'Ms','Nikita','Gangniak','Sony','ng@sony.com','+9198209372','411011'),
    (9,'Mr.','Shubham','Khurana','LG','sk@lg.com','+9198209372','411012'),
    (10,'Mr.','Omkar','Shrma','Samsung','omkar@samsung.com','+9198209372','411013')
]

## Modified the source data so that we can cover all scenarios. 
# Cutomerid 6 got deleted. 
# New customer Rahul Jain is added with customerid 11. 
# Customerid 2 changed his company and email.
curr_customer_data = [
    (1,'Mr.','Manish','Agrwal','Amazon','manishk@amazone.com','+9198209371','411014'),
    (2,'Mr.','Vikash','Kumar','JPMC','vikash@jpmc.com','+9198209372','411015'),
    (3,'Ms.','Shilpa','Sexsena','Infosys','shilpa.k@infy.com','+9198209372','411016'),
    (4,'Mr.','Rakesh','Dhaker','UBS','rd@ubs.com','+9198209372','411017'),
    (5,'Mr.','Ayush','Kapoor','Nexflix','ak@netflix.com','+9198209372','411018'),
    (7,'Mr.','Manoj','Sain','HCL','ms@hcl.com','+9198209372','411010'),
    (8,'Ms.','Nikita','Gangniak','Sony','ng@sony.com','+9198209372','411011'),
    (9,'Mr.','Shubham','Khurana','LG','sk@lg.com','+9198209372','411012'),
    (11,'Mr.','Rahul','Jain','Samsung','rahul@samsung.com','+9198209372','411013')
]

# CustomerID,Title,FirstName,LastName,CompanyName,EmailAddress,Phone,ZipCode
customer_schema= ['CustomerID','Title','FirstName','LastName','CompanyName','EmailAddress','Phone','ZipCode']


In [30]:
def column_renamer(df, suffix, append):
    """
    input:
        df: dataframe
        suffix: suffix to be appended to column name
        append: boolean value 
                if true append suffix else remove suffix
    
    output:
        df: df with renamed column
    """
    if append:
        new_column_names = list(map(lambda x: x+suffix, df.columns))
    else:
        new_column_names = list(map(lambda x: x.replace(suffix,""), df.columns))
    return df.toDF(*new_column_names)

def get_hash(df, keys_list):
    """
    input:
        df: dataframe
        key_list: list of columns to be hashed    
    output:
        df: df with hashed column
    """
    columns = [col(column) for column in keys_list]
    if columns:
        return df.withColumn("hash_md5", md5(concat_ws("", *columns)))
    else:
        return df.withColumn("hash_md5", md5(lit(1)))

##### Current Data

In [31]:
# Form next run we need to compare current data with history data
# e.g. comparing today's data with yesterday's data

# Create df_current using SOURCE_PATH
df_current = spark.createDataFrame(data=curr_customer_data,schema=customer_schema)
df_current.show()


+----------+-----+---------+--------+-----------+-------------------+-----------+-------+
|CustomerID|Title|FirstName|LastName|CompanyName|       EmailAddress|      Phone|ZipCode|
+----------+-----+---------+--------+-----------+-------------------+-----------+-------+
|         1|  Mr.|   Manish|  Agrwal|     Amazon|manishk@amazone.com|+9198209371| 411014|
|         2|  Mr.|   Vikash|   Kumar|       JPMC|    vikash@jpmc.com|+9198209372| 411018|
|         3|  Ms.|   Shilpa| Sexsena|    Infosys|  shilpa.k@infy.com|+9198209372| 411016|
|         4|  Mr.|   Rakesh|  Dhaker|        UBS|         rd@ubs.com|+9198209372| 411017|
|         5|  Mr.|    Ayush|  Kapoor|    Nexflix|     ak@netflix.com|+9198209372| 411018|
|         7|  Mr.|    Manoj|    Sain|        HCL|         ms@hcl.com|+9198209372| 411010|
|         8|  Ms.|   Nikita|Gangniak|       Sony|        ng@sony.com|+9198209372| 411011|
|         9|  Mr.|  Shubham| Khurana|         LG|          sk@lg.com|+9198209372| 411012|
|        1

#### History Data

In [32]:
# During first run will be loaded with all records set to effective_date = current_date() 
# expiration_date = "9999-12-31" and current_flag = True
# as there is no previous day data sk_customer_id is sarrogate key for dataframe

window_spec  = Window.orderBy("customerid")

#
df_history = spark.createDataFrame(data=hist_customer_data,schema=customer_schema)\
                .withColumn("sk_customer_id",row_number().over(window_spec))\
                .withColumn("effective_date",date_format(current_date(), DATE_FORMAT))\
                .withColumn("expiration_date",date_format(lit(EOW_DATE), DATE_FORMAT))\
                .withColumn("current_flag", lit(True))


df_history.show()

25/10/17 09:10:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:10:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:10:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+----------+-----+---------+--------+-----------+-------------------+-----------+-------+--------------+--------------+---------------+------------+
|CustomerID|Title|FirstName|LastName|CompanyName|       EmailAddress|      Phone|ZipCode|sk_customer_id|effective_date|expiration_date|current_flag|
+----------+-----+---------+--------+-----------+-------------------+-----------+-------+--------------+--------------+---------------+------------+
|         1|  Mr.|   Manish|  Agrwal|     Amazon|manishk@amazone.com|+9198209371| 411014|             1|    2025-10-17|     9999-12-31|        true|
|         2|  Mr.|   Vikash|   Kumar|       Citi|    vikash@citi.com|+9198209372| 411015|             2|    2025-10-17|     9999-12-31|        true|
|         3|  Mr.|   Shilpa| Sexsena|    Infosys|  shilpa.k@infy.com|+9198209372| 411016|             3|    2025-10-17|     9999-12-31|        true|
|         4|  Mr.|   Rakesh|  Dhaker|        UBS|         rd@ubs.com|+9198209372| 411017|             4|  

In [33]:
# Find the max size of sarrogate key in df_history 
# It will be used to create sarrogate key for new and updated records
max_sk = df_history.agg({"sk_customer_id": "max"}).collect()[0][0]

# filter out open records from df_history
# we don't need to do any changes in closed records
df_history_open = df_history.where(col("current_flag"))
df_history_closed = df_history.where(col("current_flag")==lit(False))

# Generate hash for type2 columns and rename column names 
# with _history and _current as suffix 
df_history_open_hash = column_renamer(get_hash(df_history_open, type2_cols), suffix="_history", append=True)
df_current_hash = column_renamer(get_hash(df_current, type2_cols), suffix="_current", append=True)

# Apply full outer join to history_open and current dataframes
# Create a new column which will be used to flag records
# 1. If hash_md5_current & hash_md5_history are same then NOCHANGE
# 2. If CustomerID_current is null then DELETE
# 3. If CustomerID_history is null then INSERT
# 4. Else UPDATE
df_merged = df_history_open_hash\
            .join(df_current_hash, col("CustomerID_current") ==  col("CustomerID_history"), how="full_outer")\
            .withColumn("Action", when(col("hash_md5_current") == col("hash_md5_history")  , 'NOCHANGE')\
            .when(col("CustomerID_current").isNull(), 'DELETE')\
            .when(col("CustomerID_history").isNull(), 'INSERT')\
            .otherwise('UPDATE'))

df_merged.show()

25/10/17 09:14:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+------------------+-------------+-----------------+----------------+-------------------+--------------------+-------------+---------------+----------------------+----------------------+-----------------------+--------------------+--------------------+------------------+-------------+-----------------+----------------+-------------------+--------------------+-------------+---------------+--------------------+--------+
|CustomerID_history|Title_history|FirstName_history|LastName_history|CompanyName_history|EmailAddress_history|Phone_history|ZipCode_history|sk_customer_id_history|effective_date_history|expiration_date_history|current_flag_history|    hash_md5_history|CustomerID_current|Title_current|FirstName_current|LastName_current|CompanyName_current|EmailAddress_current|Phone_current|ZipCode_current|    hash_md5_current|  Action|
+------------------+-------------+-----------------+----------------+-------------------+--------------------+-------------+---------------+----------------

25/10/17 09:14:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


In [34]:
window_spec  = Window.orderBy("customerid")

# Filter records with action NOCHANGE remove suffix '_history' from column names 
# and select columns same as df_history_open
df_nochange = column_renamer(df_merged.filter(col("action") == 'NOCHANGE'), suffix="_history", append=False)\
                .select(df_history_open.columns)

df_nochange.show()

25/10/17 09:14:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:14:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+----------+-----+---------+--------+-----------+-------------------+-----------+-------+--------------+--------------+---------------+------------+
|CustomerID|Title|FirstName|LastName|CompanyName|       EmailAddress|      Phone|ZipCode|sk_customer_id|effective_date|expiration_date|current_flag|
+----------+-----+---------+--------+-----------+-------------------+-----------+-------+--------------+--------------+---------------+------------+
|         1|  Mr.|   Manish|  Agrwal|     Amazon|manishk@amazone.com|+9198209371| 411014|             1|    2025-10-17|     9999-12-31|        true|
|         3|  Mr.|   Shilpa| Sexsena|    Infosys|  shilpa.k@infy.com|+9198209372| 411016|             3|    2025-10-17|     9999-12-31|        true|
|         4|  Mr.|   Rakesh|  Dhaker|        UBS|         rd@ubs.com|+9198209372| 411017|             4|    2025-10-17|     9999-12-31|        true|
|         5|  Mr.|    Ayush|  Kapoor|    Nexflix|     ak@netflix.com|+9198209372| 411018|             5|  

In [35]:
# Filter records with action INSERT, remove suffix _current from column names
# and select columns same as df_current
# add effective date as current_date, expiration date as EOW_DATE and current flag as True
# to create sarrogate key row numbers are created and added with max_sk value
df_insert = column_renamer(df_merged.filter(col("action") == 'INSERT'), suffix="_current", append=False)\
                .select(df_current.columns)\
                .withColumn("effective_date",date_format(current_date(),DATE_FORMAT))\
                .withColumn("expiration_date",date_format(lit(EOW_DATE),DATE_FORMAT))\
                .withColumn("row_number",row_number().over(window_spec))\
                .withColumn("sk_customer_id",col("row_number")+ max_sk)\
                .withColumn("current_flag", lit(True))\
                .drop("row_number")

df_insert.show()

25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+----------+-----+---------+--------+-----------+-----------------+-----------+-------+--------------+---------------+--------------+------------+
|CustomerID|Title|FirstName|LastName|CompanyName|     EmailAddress|      Phone|ZipCode|effective_date|expiration_date|sk_customer_id|current_flag|
+----------+-----+---------+--------+-----------+-----------------+-----------+-------+--------------+---------------+--------------+------------+
|        11|  Mr.|    Rahul|    Jain|    Samsung|rahul@samsung.com|+9198209372| 411013|    2025-10-17|     9999-12-31|            11|        true|
+----------+-----+---------+--------+-----------+-----------------+-----------+-------+--------------+---------------+--------------+------------+



25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


In [36]:
# max_sk value is updated using max value from df_insert
max_sk = df_insert.agg({"sk_customer_id": "max"}).collect()[0][0]

# Filter records with action DELETE remove suffix '_history' from column names 
# and select columns same as df_history_open
# set expiration date to current_date and current_flag to false
df_deleted = column_renamer(df_merged.filter(col("action") == 'DELETE'), suffix="_history", append=False)\
                .select(df_history_open.columns)\
                .withColumn("expiration_date", date_format(current_date(),DATE_FORMAT))\
                .withColumn("current_flag", lit(False))

df_deleted.show()

25/10/17 09:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:36 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 0

+----------+-----+---------+--------+-----------+-----------------+-----------+-------+--------------+--------------+---------------+------------+
|CustomerID|Title|FirstName|LastName|CompanyName|     EmailAddress|      Phone|ZipCode|sk_customer_id|effective_date|expiration_date|current_flag|
+----------+-----+---------+--------+-----------+-----------------+-----------+-------+--------------+--------------+---------------+------------+
|         6|  Mr.|  Pritesh|    Soni|      Wipro|     ps@wipro.com|+9198209372| 411019|             6|    2025-10-17|     2025-10-17|       false|
|        10|  Mr.|    Omkar|   Shrma|    Samsung|omkar@samsung.com|+9198209372| 411013|            10|    2025-10-17|     2025-10-17|       false|
+----------+-----+---------+--------+-----------+-----------------+-----------+-------+--------------+--------------+---------------+------------+



In [37]:
# Filter records with action UPDATE remove suffix '_history' from column names 
# and select columns same as df_history_open
# set expiration date to current_date and current_flag to false
#
# select columns same as df_current
# set effective_date as current_date and expiration_date as EOW_DATE
# current_flag as true
# use similar logic to create sequencial sarrogate key for updated records 
# union both parts into one dataframe
df_update = column_renamer(df_merged.filter(col("action") == 'UPDATE'), suffix="_history", append=False)\
                .select(df_history_open.columns)\
                .withColumn("expiration_date", date_format(current_date(),DATE_FORMAT))\
                .withColumn("current_flag", lit(False))\
            .unionByName(
            column_renamer(df_merged.filter(col("action") == 'UPDATE'), suffix="_current", append=False)\
                .select(df_current.columns)\
                .withColumn("effective_date",date_format(current_date(),DATE_FORMAT))\
                .withColumn("expiration_date",date_format(lit(EOW_DATE),DATE_FORMAT))\
                .withColumn("row_number",row_number().over(window_spec))\
                .withColumn("sk_customer_id",col("row_number")+ max_sk)\
                .withColumn("current_flag", lit(True))\
                .drop("row_number")
                )

df_update.show()

25/10/17 09:15:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:15:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 0

+----------+-----+---------+--------+-----------+---------------+-----------+-------+--------------+--------------+---------------+------------+
|CustomerID|Title|FirstName|LastName|CompanyName|   EmailAddress|      Phone|ZipCode|sk_customer_id|effective_date|expiration_date|current_flag|
+----------+-----+---------+--------+-----------+---------------+-----------+-------+--------------+--------------+---------------+------------+
|         2|  Mr.|   Vikash|   Kumar|       Citi|vikash@citi.com|+9198209372| 411015|             2|    2025-10-17|     2025-10-17|       false|
|         2|  Mr.|   Vikash|   Kumar|       JPMC|vikash@jpmc.com|+9198209372| 411018|            12|    2025-10-17|     9999-12-31|        true|
+----------+-----+---------+--------+-----------+---------------+-----------+-------+--------------+--------------+---------------+------------+



25/10/17 09:15:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

In [38]:
# Create final dataframe to create union table of all tables
df_final = df_history_closed\
            .unionByName(df_nochange)\
            .unionByName(df_insert)\
            .unionByName(df_deleted)\
            .unionByName(df_update)

df_final.show()

25/10/17 09:16:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:16:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:16:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:16:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:16:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 09:16:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/17 0

+----------+-----+---------+--------+-----------+-------------------+-----------+-------+--------------+--------------+---------------+------------+
|CustomerID|Title|FirstName|LastName|CompanyName|       EmailAddress|      Phone|ZipCode|sk_customer_id|effective_date|expiration_date|current_flag|
+----------+-----+---------+--------+-----------+-------------------+-----------+-------+--------------+--------------+---------------+------------+
|         1|  Mr.|   Manish|  Agrwal|     Amazon|manishk@amazone.com|+9198209371| 411014|             1|    2025-10-17|     9999-12-31|        true|
|         3|  Mr.|   Shilpa| Sexsena|    Infosys|  shilpa.k@infy.com|+9198209372| 411016|             3|    2025-10-17|     9999-12-31|        true|
|         4|  Mr.|   Rakesh|  Dhaker|        UBS|         rd@ubs.com|+9198209372| 411017|             4|    2025-10-17|     9999-12-31|        true|
|         5|  Mr.|    Ayush|  Kapoor|    Nexflix|     ak@netflix.com|+9198209372| 411018|             5|  

                                                                                