# SCD Type1 Implementation in Spark

## What is SCD Type 1
- SCD stands for Slowly Changing Dimension

## INNER JOIN
- Inner join two dataframes to find the “emp_id” that is in both employee.csv & employee_delta.csv

In [None]:
employees_df = spark.read.csv("/FileStore/tables/employee.csv", header="true", inferSchema="true")
employees_delta_df = spark.read.csv("/FileStore/tables/employee_delta.csv", header="true", inferSchema="true")
 
#IDENTIFY RECORDS THAT ARE IN BOTH WITH AN "INNER JOIN"
 
emp_updated = employees_df.join(employees_delta_df, employees_df.emp_id == employees_delta_df.emp_id, 'inner' )
emp_updated.show();

## LEFT OUTER JOIN
- Left outer join to identify the records that don’t need any change

- We need to filter out records that are not in “employee_delta.csv”.

In [None]:
emp_no_change_df = employees_df.join(employees_delta_df, employees_df.emp_id == employees_delta_df.emp_id, 'leftouter')\
  .filter(employees_delta_df.emp_id.isNull()) \
  .select(employees_df.emp_id, employees_df.emp_name, employees_df.emp_city, employees_df.emp_salary)
 
emp_no_change_df.show()

## RIGHT OUTER JOIN
- We need to filter out records that are in “employee.csv”

In [None]:
emp_new_df = employees_df.join(employees_delta_df, employees_df.emp_id == employees_delta_df.emp_id, 'rightouter')\
  .filter(employees_df.emp_id.isNull()) \
  .select(employees_delta_df.emp_id, employees_delta_df.emp_name, employees_delta_df.emp_city, employees_delta_df.emp_salary)
 
emp_new_df.show()

## UNION ALL
- Union all three dataframes – emp_updated,emp_no_change_df, and emp_new_df to give us the final values.

In [3]:
emp_final = emp_updated.unionAll(emp_no_change_df).unionAll(emp_new_df).orderBy('emp_id')
 
emp_final.show()

NameError: name 'emp_updated' is not defined

## functools reduce(…) function
- Alternatively, we can also use the “reduce” function from the functools library, which has the higher order functions (i.e. Functional Programming).

In [None]:
from functools import reduce
from pyspark.sql import DataFrame
 
def unionall(*df):
  return reduce(DataFrame.unionAll, df)
         
emp_final = unionall(emp_updated, emp_no_change_df, emp_new_df).orderBy('emp_id')
emp_final.show()

## SQL – show tables

In [None]:
%sql
show tables;