#### **exceptAll**

- The exceptAll function in PySpark is used to find the **difference between two DataFrames** while preserving `duplicates`. This means that it returns all the rows that **exist** in the **first DataFrame** but **do not appear** in the **second DataFrame**, even if there are **duplicate rows**.

#### **Use Cases**

**Data Validation:**
- When performing data validation `between two datasets`, you can use exceptAll to `identify discrepancies and missing records`, even if `duplicates` exist.

**Data Cleansing:**
- During data cleansing processes, you may want to find and `remove duplicates or redundant records`. exceptAll can help identify such records.

**Data Synchronization:**
- When dealing with data synchronization between `different data sources or systems`, exceptAll can assist in identifying changes or discrepancies.

**Data Quality Monitoring:**
- For monitoring data quality in a `streaming or batch processing` pipeline, exceptAll can help detect `anomalies and inconsistencies`.

**Syntax**

     DataFrame.exceptAll(other)

**DataFrame:** The source DataFrame from which you want to find the difference.

**other:** The DataFrame you want to compare against.

**CASE 01:** 
- exceptAll on two dataframes

In [0]:
# How to find all the orders that exist in df1 but do not appear in df2
# Create two DataFrames
data1 = [("Ramesh", "ADF", "Grade1", 20, 5),
         ("Kamal", "ADB", "Grade2", 25, 8),
         ("Bibin", "SQL", "Grade3", 28, 3),
         ("Bharath", "Git", "Grade4", 32, 5),
         ("Ramesh", "ADF", "Grade1", 35, 2),
         ("Ramesh", "ADF", "Grade1", 38, 6),
         ("Bibin", "SQL", "Grade3", 36, 4),
         ("Bibin", "SQL", "Grade3", 23, 7),]

data2 = [("Ramesh", "ADF", "Grade1", 20, 5),
         ("Bibin", "SQL", "Grade3", 28, 3)]

columns = ["Customer", "Tech", "Level", "Age", "Experience"]

df1 = spark.createDataFrame(data1, columns)
df2 = spark.createDataFrame(data2, columns)

display(df1)
display(df2)

Customer,Tech,Level,Age,Experience
Ramesh,ADF,Grade1,20,5
Kamal,ADB,Grade2,25,8
Bibin,SQL,Grade3,28,3
Bharath,Git,Grade4,32,5
Ramesh,ADF,Grade1,35,2
Ramesh,ADF,Grade1,38,6
Bibin,SQL,Grade3,36,4
Bibin,SQL,Grade3,23,7


Customer,Tech,Level,Age,Experience
Ramesh,ADF,Grade1,20,5
Bibin,SQL,Grade3,28,3


In [0]:
# Use exceptAll to find the difference
result = df1.exceptAll(df2)

# Show the result
display(result)

Customer,Tech,Level,Age,Experience
Kamal,ADB,Grade2,25,8
Bharath,Git,Grade4,32,5
Ramesh,ADF,Grade1,35,2
Ramesh,ADF,Grade1,38,6
Bibin,SQL,Grade3,36,4
Bibin,SQL,Grade3,23,7


**CASE 02:** 
- exceptAll on two dataframes with required columns

In [0]:
# Assuming df1 and df2 are your DataFrames
columns_to_compare = ["Customer", "Tech", "Level"]

# Select the specific columns from each DataFrame
df1_selected = df1.select(columns_to_compare)
df2_selected = df2.select(columns_to_compare)

# Use exceptAll to find rows in df1 that are not in df2
result_df = df1_selected.exceptAll(df2_selected)

# Display the result
display(result_df)

Customer,Tech,Level
Ramesh,ADF,Grade1
Ramesh,ADF,Grade1
Kamal,ADB,Grade2
Bibin,SQL,Grade3
Bibin,SQL,Grade3
Bharath,Git,Grade4


**CASE 03:** 
- exceptAll on three dataframes

In [0]:
df1 = spark.createDataFrame([(1, "apple"), (2, "banana"), (4, "grape"), (5, "melon"), (3, "orange")], ["id", "fruit"])
df2 = spark.createDataFrame([(1, "apple"), (2, "banana")], ["id", "fruit"])
df5 = spark.createDataFrame([(4, "grape"), (5, "melon"), (6, "watermelon")], ["id", "fruit"])

# Find differences between df1 and df2, then find differences between that result and df5
# To find rows in df1 not in df2, and then further filter those results by removing rows that are also in df5
df1_df2 = df1.exceptAll(df2)
display(df1_df2)

result = df1.exceptAll(df2).exceptAll(df5)
display(result)

id,fruit
4,grape
5,melon
3,orange


id,fruit
3,orange


**CASE 04**
- How to return a `new DataFrame` that `exist` in one DataFrame but `not` in the other DataFrame.

In [0]:
df_rev01 = spark.read.csv("/FileStore/tables/exceptAll_rev01.csv", header=True, inferSchema=True)
display(df_rev01.limit(10))

Company_Name,Cust_Id,Cust_Name,Category,Start_Date,Start_Cust_Date,End_Date,Updated_Date,Cust_Value,Cust_Type,Exchange,Location,Last_Date_UTC,Cust_Category,Index
Sony,20,Naresh,Standard,3-Feb-23,1730000000000.0,1730000000000.0,1730000000000.0,30,STD,EUR,IND,1720000000000.0,SETTL,True
Sony,21,kamal,Standard,6-Feb-23,1730000000000.0,1730000000000.0,1730000000000.0,25,STD,EUR,IND,1720000000000.0,TOI,False
Sony,22,kajal,Standard,9-Feb-23,1730000000000.0,1730000000000.0,1730000000000.0,28,STD,EUR,IND,1720000000000.0,TOI,False
Sony,23,kiran,Standard,3-Jan-24,1730000000000.0,1730000000000.0,1730000000000.0,31,STD,EUR,IND,1720000000000.0,TOI,False
Sony,24,sam,Standard,8-Jan-24,1730000000000.0,1730000000000.0,1730000000000.0,34,STD,EUR,IND,1720000000000.0,TOI,False
Sony,25,sourab,Standard,9-Jan-24,1730000000000.0,1740000000000.0,1730000000000.0,37,STD,EUR,IND,1720000000000.0,TOI,True
Sony,26,jai,Upper,3-Mar-23,1730000000000.0,1740000000000.0,1730000000000.0,40,STD,EUR,IND,1720000000000.0,TOI,True
BPL,27,sree,Upper,6-Mar-23,1730000000000.0,1730000000000.0,1730000000000.0,43,STD,EUR,IND,1720000000000.0,SETTL,True
BPL,28,sreenath,Upper,9-Mar-23,1730000000000.0,1740000000000.0,1730000000000.0,46,STD,EUR,IND,1720000000000.0,SETTL,True
BPL,29,kamaesh,Upper,3-Jan-25,1740000000000.0,1740000000000.0,1730000000000.0,49,STD,EUR,IND,1720000000000.0,SETTL,False


In [0]:
df_rev02 = spark.read.csv("/FileStore/tables/exceptAll_rev02.csv", header=True, inferSchema=True)
display(df_rev02.limit(10))

Company_Name,Cust_Id,Cust_Name,Category,Start_Date,Start_Cust_Date,End_Date,Updated_Date,Cust_Value,Cust_Type,Exchange,Location,Last_Date_UTC,Cust_Category,Index
Sony,20,Naresh,Standard,3-Feb-23,1730000000000.0,1730000000000.0,1730000000000.0,30,STD,EUR,IND,1720000000000.0,SETTL,True
Sony,21,kamal,Standard,6-Feb-23,1730000000000.0,1730000000000.0,1730000000000.0,25,STD,EUR,IND,1720000000000.0,TOI,False
Sony,22,kajal,Standard,9-Feb-23,1730000000000.0,1730000000000.0,1730000000000.0,28,STD,EUR,IND,1720000000000.0,TOI,False
Sony,23,kiran,Standard,3-Jan-24,1730000000000.0,1730000000000.0,1730000000000.0,31,STD,EUR,IND,1720000000000.0,TOI,False
Sony,32,sam,Standard,13/8/2024,1780000000000.0,1730000000000.0,1730000000000.0,45,STD,EUR,IND,1790000000000.0,TOI,True
Sony,25,sourab,Standard,9-Jan-24,1730000000000.0,1740000000000.0,1730000000000.0,37,STD,EUR,IND,1720000000000.0,TOI,True
Sony,35,jaji,Lower,3-Mar-23,1730000000000.0,1740000000000.0,1730000000000.0,50,STD,EUR,IND,1720000000000.0,TOI,False
BPL,27,sree,Upper,6-Mar-23,1730000000000.0,1730000000000.0,1730000000000.0,43,STD,EUR,IND,1720000000000.0,SETTL,True
BPL,28,sreenath,Upper,9-Mar-23,1730000000000.0,1740000000000.0,1730000000000.0,46,STD,EUR,IND,1720000000000.0,SETTL,True
BPL,29,kamaesh,Upper,3-Jan-25,1740000000000.0,1740000000000.0,1730000000000.0,49,STD,EUR,IND,1720000000000.0,SETTL,False


In [0]:
print("No of Rows in Rev01:", df_rev01.count())
print("No of distinct Rows in Rev01:", df_rev01.distinct().count())

print("\nNo of Rows in Rev02:", df_rev02.count())
print("No of distinct Rows in Rev02:", df_rev02.distinct().count())

No of Rows in Rev01: 49
No of distinct Rows in Rev01: 49

No of Rows in Rev02: 69
No of distinct Rows in Rev02: 49


In [0]:
# Use exceptAll to find the difference
New_Records = df_rev01.exceptAll(df_rev02)

# Show the result
display(New_Records)

Company_Name,Cust_Id,Cust_Name,Category,Start_Date,Start_Cust_Date,End_Date,Updated_Date,Cust_Value,Cust_Type,Exchange,Location,Last_Date_UTC,Cust_Category,Index
BP,57,sourab,Premium,6-Mar-23,1760000000000.0,1760000000000.0,1730000000000.0,133,STD,EUR,UK,1720000000000.0,EDA,False
BP,56,sam,Premium,3-Mar-23,1760000000000.0,1760000000000.0,1730000000000.0,130,STD,EUR,UK,1720000000000.0,EDA,False
BP,55,vidish,Premium,9-Jan-24,1760000000000.0,1760000000000.0,1730000000000.0,127,STD,EUR,UK,1720000000000.0,EDA,False
Reliance,65,sweta,Lower,3-Feb-23,1710000000000.0,1720000000000.0,1730000000000.0,157,STD,EUR,SL,1720000000000.0,SETTL,True
Sony,26,jai,Upper,3-Mar-23,1730000000000.0,1740000000000.0,1730000000000.0,40,STD,EUR,IND,1720000000000.0,TOI,True
Sony,24,sam,Standard,8-Jan-24,1730000000000.0,1730000000000.0,1730000000000.0,34,STD,EUR,IND,1720000000000.0,TOI,False
Reliance,67,ramesh,Lower,9-Feb-23,1710000000000.0,1720000000000.0,1730000000000.0,163,STD,EUR,SL,1720000000000.0,SETTL,False
BP,58,jai,Premium,9-Mar-23,1760000000000.0,1760000000000.0,1730000000000.0,136,STD,EUR,UK,1720000000000.0,EDA,True
Reliance,66,narendra,Lower,6-Feb-23,1710000000000.0,1720000000000.0,1730000000000.0,160,STD,EUR,SL,1720000000000.0,SETTL,False
BPL,30,david,Upper,6-Jan-25,1740000000000.0,1740000000000.0,1730000000000.0,52,STD,EUR,IND,1720000000000.0,SETTL,False
