#### **union(), unionAll() & unionByName()**

**PySpark “2.0.0”**

- **union()**
  - **Removes** the **duplicate records** from resultant dataframe until **spark version 2.0.0**. So duplicate can be removed manually by **dropDuplicates()**.

- **unionAll()**
  - Same as union but **retains duplicate records** as well in resultant dataframe.

**union():** 
- Combines **two DataFrames** have the **same column order and schema**.

    - **union() and unionAll()** transformations are used to **merge two or more DataFrame’s** of the **same schema or structure**.
    - The output includes `all rows from both DataFrames` and **duplicates are retained**.
    - If schemas are `not the same it returns an error`.

**unionAll():**
  - Alias for union(), behaves the same.
  - **unionAll()** method is **deprecated** since **PySpark “2.0.0”** version and **recommends** using the **union()** method.

**unionByName():**
- Combines two DataFrames by matching **column names**, even **if column order differs**.
- To deal with the DataFrames of **different schemas** we need to use **unionByName()** transformation.

**When to Use What?**

- Use **union() or unionAll()** when **schemas and column orders** are the **same**.
- Use **unionByName()** when **column names** are the **same** but their **order** might be **different**.

**Key Differences**

         |   Function    |    Same Schema	|  Same Column Order	|  Matches Column Names  |
         |---------------|------------------|-----------------------|------------------------|
         | union()	     |   ✅ Required	|   ✅ Required	        |   ❌ No Matching       |
         | unionAll()    |   ✅ Required    |   ✅ Required	        |   ❌ No Matching       |
         | unionByName() |   ✅ Required	|   ❌ Not Required	|   ✅ Matches Columns   |


**Syntax**

     df1.union(df2)
     df1.unionAll(df2)
     df1.unionByName(df3)

In [0]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.master("local").getOrCreate()

# Get Spark version
print(spark.sparkContext.version)

3.5.0


In [0]:
simpleData = [("Kiran", "Sales", "AP", 890000, 24, 35000), \
              ("Mohan", "Admin", "TN", 756000, 36, 45000), \
              ("Robert", "Marketing", "KA", 567000, 33, 35000), \
              ("Swetha", "Finance", "PNB", 598000, 26, 99000), \
              ("Kamalesh", "IT", "TS", 8946000, 31, 56000), \
              ("Mathew", "Maintenance", "KL", 667000, 28, 467000), \
              ("Santhosh", "Sales", "MH", 873000, 24, 734000),\
              ("Swetha", "Finance", "PNB", 598000, 26, 99000), \
              ("Mohan", "Admin", "TN", 756000, 36, 45000)
              ]

columns= ["employee_name", "department", "state", "salary", "age", "bonus"]

df1 = spark.createDataFrame(data = simpleData, schema = columns)
df1.printSchema()
display(df1)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)



employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000


In [0]:
# Create DataFrame2
simpleData2 = [("Kailash", "Sales", "RJ", 96600, 30, 15500), \
               ("Somesh", "Finance", "UP", 88000, 22, 27800), \
               ("Jennifer", "Support", "TN", 59000, 43, 35500), \
               ("Kumar", "Marketing", "CA", 768000, 28, 945000), \
               ("Sandya", "IT", "PNB", 789000, 37, 678900), \
               ("Swaroop", "Admin", "KL", 679000, 24, 478000), \
               ("Joseph", "Finance", "DL", 789000, 29, 456700), \
               ("Rashi", "Maintenance", "TS", 467800, 23, 872300), \
               ("Krishna", "Backend", "AP", 945670, 39, 435000),\
               ("Sandya", "IT", "PNB", 789000, 37, 678900), \
               ("Swaroop", "Admin", "KL", 679000, 24, 478000)
               ]
columns2= ["employee_name", "department", "state", "salary", "age", "bonus"]

df2 = spark.createDataFrame(data = simpleData2, schema = columns2)

df2.printSchema()
display(df2)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)



employee_name,department,state,salary,age,bonus
Kailash,Sales,RJ,96600,30,15500
Somesh,Finance,UP,88000,22,27800
Jennifer,Support,TN,59000,43,35500
Kumar,Marketing,CA,768000,28,945000
Sandya,IT,PNB,789000,37,678900
Swaroop,Admin,KL,679000,24,478000
Joseph,Finance,DL,789000,29,456700
Rashi,Maintenance,TS,467800,23,872300
Krishna,Backend,AP,945670,39,435000
Sandya,IT,PNB,789000,37,678900


In [0]:
# dataframe with different order of columns as compared to df1
simpleData4 = [("Kailash", 96600, "Sales", 30, "RJ", 15500), \
               ("Somesh", 88000, "Finance", 22, "UP", 27800), \
               ("Jennifer", 59000, "Support", 43, "TN", 35500), \
               ("Kumar", 768000, "Marketing", 28, "CA", 945000), \
               ("Sandya", 789000, "IT", 37, "PNB", 678900), \
               ("Swaroop", 679000, "Admin", 24, "KL", 478000), \
               ("Joseph", 789000, "Finance", 29, "DL", 456700), \
               ("Rashi", 467800, "Maintenance", 23, "TS", 872300), \
               ("Krishna", 945670, "Backend", 39, "AP", 435000)
               ]
columns4 = ["employee_name", "salary", "department", "age", "state", "bonus"]

df4 = spark.createDataFrame(data = simpleData4, schema = columns4)

df4.printSchema()
display(df4)

root
 |-- employee_name: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- department: string (nullable = true)
 |-- age: long (nullable = true)
 |-- state: string (nullable = true)
 |-- bonus: long (nullable = true)



employee_name,salary,department,age,state,bonus
Kailash,96600,Sales,30,RJ,15500
Somesh,88000,Finance,22,UP,27800
Jennifer,59000,Support,43,TN,35500
Kumar,768000,Marketing,28,CA,945000
Sandya,789000,IT,37,PNB,678900
Swaroop,679000,Admin,24,KL,478000
Joseph,789000,Finance,29,DL,456700
Rashi,467800,Maintenance,23,TS,872300
Krishna,945670,Backend,39,AP,435000


In [0]:
# dataframe with different order of columns and count as compared to df1
simpleData5 = [("Kailash", 96600, "Sales", 30, "RJ", 15500, 12345), \
               ("Somesh", 88000, "Finance", 22, "UP", 27800, 67890), \
               ("Jennifer", 59000, "Support", 43, "TN", 35500, 14789), \
               ("Kumar", 768000, "Marketing", 28, "CA", 945000, 98765), \
               ("Sandya", 789000, "IT", 37, "PNB", 678900, 85432), \
               ("Swaroop", 679000, "Admin", 24, "KL", 478000, 74321), \
               ("Joseph", 789000, "Finance", 29, "DL", 456700, 45980), \
               ("Rashi", 467800, "Maintenance", 23, "TS", 872300, 517132), \
               ("Krishna", 945670, "Backend", 39, "AP", 435000, 560103)
               ]
columns5 = ["employee_name", "salary", "department", "age", "state", "bonus", "pincode"]

df5 = spark.createDataFrame(data = simpleData5, schema = columns5)

df5.printSchema()
display(df5)

root
 |-- employee_name: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- department: string (nullable = true)
 |-- age: long (nullable = true)
 |-- state: string (nullable = true)
 |-- bonus: long (nullable = true)
 |-- pincode: long (nullable = true)



employee_name,salary,department,age,state,bonus,pincode
Kailash,96600,Sales,30,RJ,15500,12345
Somesh,88000,Finance,22,UP,27800,67890
Jennifer,59000,Support,43,TN,35500,14789
Kumar,768000,Marketing,28,CA,945000,98765
Sandya,789000,IT,37,PNB,678900,85432
Swaroop,679000,Admin,24,KL,478000,74321
Joseph,789000,Finance,29,DL,456700,45980
Rashi,467800,Maintenance,23,TS,872300,517132
Krishna,945670,Backend,39,AP,435000,560103


**1) Merge two or more DataFrames using union**
- union() method merges **two DataFrames** and returns the new DataFrame with **all rows** from two Dataframes regardless of **duplicate data**.

      same column names
      same order
      same column count


In [0]:
# union() to merge two DataFrames has same column names, order and count
# df1: "employee_name", "department", "state", "salary", "age", "bonus"
# df2: "employee_name", "department", "state", "salary", "age", "bonus"
unionDF = df1.union(df2)
unionDF.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,Sales,RJ,96600,30,15500


In [0]:
display(unionDF.dropDuplicates())

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Kailash,Sales,RJ,96600,30,15500
Somesh,Finance,UP,88000,22,27800
Kumar,Marketing,CA,768000,28,945000


In [0]:
df2_col_re = df2.withColumnRenamed("bonus", "bonus_new")
display(df2_col_re)
unionDFNew = df1.union(df2_col_re)
unionDFNew.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,Sales,RJ,96600,30,15500


In [0]:
df2_mltcol_re = df2.withColumnRenamed("employee_name", "EName")\
                   .withColumnRenamed("department", "dept")\
                   .withColumnRenamed("state", "country")\
                   .withColumnRenamed("salary", "commission")\
                   .withColumnRenamed("bonus_new", "age")\
                   .withColumnRenamed("age", "bonus")
                   
unionDFMult = df1.union(df2_mltcol_re)
unionDFMult.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,Sales,RJ,96600,30,15500


In [0]:
df3 = df2.withColumn("hike", df2.bonus*100)
display(df3)

employee_name,department,state,salary,age,bonus,hike
Kailash,Sales,RJ,96600,30,15500,1550000
Somesh,Finance,UP,88000,22,27800,2780000
Jennifer,Support,TN,59000,43,35500,3550000
Kumar,Marketing,CA,768000,28,945000,94500000
Sandya,IT,PNB,789000,37,678900,67890000
Swaroop,Admin,KL,679000,24,478000,47800000
Joseph,Finance,DL,789000,29,456700,45670000
Rashi,Maintenance,TS,467800,23,872300,87230000
Krishna,Backend,AP,945670,39,435000,43500000
Sandya,IT,PNB,789000,37,678900,67890000


In [0]:
# union() to merge two DataFrames of different count
# df1 and df2 has same column names and order but with one extra column in df2
# df1: "employee_name", "department", "state", "salary", "age", "bonus"
# df3: "employee_name", "department", "state", "salary", "age", "bonus", "pincode"
unionDF1 = df1.union(df3)
unionDF1.display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3903420035330707>, line 5[0m
[1;32m      1[0m [38;5;66;03m# union() to merge two DataFrames of different count[39;00m
[1;32m      2[0m [38;5;66;03m# df1 and df2 has same column names and order but with one extra column in df2[39;00m
[1;32m      3[0m [38;5;66;03m# df1: "employee_name", "department", "state", "salary", "age", "bonus"[39;00m
[1;32m      4[0m [38;5;66;03m# df3: "employee_name", "department", "state", "salary", "age", "bonus", "pincode"[39;00m
[0;32m----> 5[0m unionDF1 [38;5;241m=[39m df1[38;5;241m.[39munion(df3)
[1;32m      6[0m unionDF1[38;5;241m.[39mdisplay()

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:47[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     45[0m start [38;5;241m=[39m time

In [0]:
df3 = df3.select("employee_name", "department", "state", "salary", "age")
display(df3)

employee_name,department,state,salary,age
Kailash,Sales,RJ,96600,30
Somesh,Finance,UP,88000,22
Jennifer,Support,TN,59000,43
Kumar,Marketing,CA,768000,28
Sandya,IT,PNB,789000,37
Swaroop,Admin,KL,679000,24
Joseph,Finance,DL,789000,29
Rashi,Maintenance,TS,467800,23
Krishna,Backend,AP,945670,39
Sandya,IT,PNB,789000,37


In [0]:
# union() to merge two DataFrames of different count
# df1 and df3 has same column names and order but with one less column than df2
# df1: "employee_name", "department", "state", "salary", "age", "bonus"
# df3: "employee_name", "department", "state", "salary", "age"
unionDF2 = df1.union(df3)
unionDF2.display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-2742766298581904>, line 5[0m
[1;32m      1[0m [38;5;66;03m# union() to merge two DataFrames of different count[39;00m
[1;32m      2[0m [38;5;66;03m# df1 and df3 has same column names and order but with one less column than df2[39;00m
[1;32m      3[0m [38;5;66;03m# df1: "employee_name", "department", "state", "salary", "age", "bonus"[39;00m
[1;32m      4[0m [38;5;66;03m# df3: "employee_name", "department", "state", "salary", "age"[39;00m
[0;32m----> 5[0m unionDF2 [38;5;241m=[39m df1[38;5;241m.[39munion(df3)
[1;32m      6[0m unionDF2[38;5;241m.[39mdisplay()

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:47[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     45[0m start [38;5;241m=[39m time[38;5;241m.[39mper

In [0]:
# union() to merge two DataFrames of different order of columns
# df1: "employee_name", "department", "state", "salary", "age", "bonus"
# df4: "employee_name", "salary", "department", "age", "state", "bonus"
unionDF3 = df1.union(df4)
unionDF3.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,96600,Sales,30,RJ,15500


**2) Merge DataFrames using unionAll**
- DataFrame **unionAll()** method is **deprecated** since **PySpark “2.0.0”** version and recommends using the **union()** method.

In [0]:
# unionAll() to merge two DataFrames
unionAllDF = df1.unionAll(df2)
unionAllDF.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,Sales,RJ,96600,30,15500


**Merge without Duplicates**

- Since the union() method returns **all rows without distinct records**, we will use the distinct() function to return just one record when a **duplicate exists**.

In [0]:
# Remove duplicates after union() using distinct()
disDF = df1.union(df2).distinct()
display(disDF)

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Kailash,Sales,RJ,96600,30,15500
Somesh,Finance,UP,88000,22,27800
Kumar,Marketing,CA,768000,28,945000


**3) unionByName**

In [0]:
# same column names, count and order
unionByName = df1.unionByName(df2)
unionByName.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,Sales,RJ,96600,30,15500


In [0]:
# unionByName to merge two DataFrames has same column count and different order of columns
# df1: "employee_name", "department", "state", "salary", "age", "bonus"
# df4: "employee_name", "salary", "department", "age", "state", "bonus"
unionByName1 = df1.unionByName(df4)
unionByName1.display()

employee_name,department,state,salary,age,bonus
Kiran,Sales,AP,890000,24,35000
Mohan,Admin,TN,756000,36,45000
Robert,Marketing,KA,567000,33,35000
Swetha,Finance,PNB,598000,26,99000
Kamalesh,IT,TS,8946000,31,56000
Mathew,Maintenance,KL,667000,28,467000
Santhosh,Sales,MH,873000,24,734000
Swetha,Finance,PNB,598000,26,99000
Mohan,Admin,TN,756000,36,45000
Kailash,Sales,RJ,96600,30,15500


In [0]:
# unionByName to merge two DataFrames has different column count and different order of columns
# df1: "employee_name", "department", "state", "salary", "age", "bonus"
# df5: "employee_name", "salary", "department", "age", "state", "bonus", "pincode"
unionByName2 = df1.unionByName(df5)
unionByName2.display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3903420035330711>, line 4[0m
[1;32m      1[0m [38;5;66;03m# unionByName to merge two DataFrames has different column count and different order of columns[39;00m
[1;32m      2[0m [38;5;66;03m# df1: "employee_name", "department", "state", "salary", "age", "bonus"[39;00m
[1;32m      3[0m [38;5;66;03m# df5: "employee_name", "salary", "department", "age", "state", "bonus", "pincode"[39;00m
[0;32m----> 4[0m unionByName2 [38;5;241m=[39m df1[38;5;241m.[39munionByName(df5)
[1;32m      5[0m unionByName2[38;5;241m.[39mdisplay()

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:47[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     45[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     46[0m [38;5;28;01