#### unionByName()

- **union()** requires both DataFrames to have the **same schema** in the **same order**.
- **unionByName()** allows unioning by **matching column names** instead of relying on **order**.
- use With **allowMissingColumns=True**, it will handle **mismatched schemas** by **filling missing columns** with **null**.
- Use when **column order** differs **between DataFrames**.

- The **unionByName()** function in PySpark is used to **combine two or more DataFrames** based on their **column names**, rather than their **positional order**.
- This is a key **distinction** from the standard **union() or unionAll()** methods, which require the DataFrames to have **identical schemas** in terms of **both column names and order**.

#### Content
- Why need to use **unioByName** instead of **union**?
- **unionByName**
  - same schema
  - different column order
  - different schemas
  - Union Multiple DataFrames
  - Union of two DataFrames with few common columns
  - Union of two DataFrames with completely different columns

**1) Why need to use unioByName instead of union?**

In [0]:
# First DataFrame
data = [("Andrew", "New York", 25),
        ("Bibin", "Chicago", 30),
        ("Alex", "Delhi", 22),
        ("Chetan", "Nasik", 23),
        ("Albert", "Kolkatta", 27)
        ]

columns = ["Name", "Country", "Age"]

df1_union = spark.createDataFrame(data, columns)
df1_union.display()

Name,Country,Age
Andrew,New York,25
Bibin,Chicago,30
Alex,Delhi,22
Chetan,Nasik,23
Albert,Kolkatta,27


In [0]:
# First DataFrame
data = [("Abhi", "INDIA", 20),
        ("Johny", "UK", 22),
        ("Charan", "SWEDEN", 25),
        ("Harish", "NORWAY", 28),
        ("Kiran", "GERMANY", 29)
        ]

columns = ["Name", "Country", "Age"]

df2_union = spark.createDataFrame(data, columns)
df2_union.display()

Name,Country,Age
Abhi,INDIA,20
Johny,UK,22
Charan,SWEDEN,25
Harish,NORWAY,28
Kiran,GERMANY,29


In [0]:
# Order of columns are same
# df1 & df2: Name, Country, Age
df1_union.union(df2_union).display()

Name,Country,Age
Andrew,New York,25
Bibin,Chicago,30
Alex,Delhi,22
Chetan,Nasik,23
Albert,Kolkatta,27
Abhi,INDIA,20
Johny,UK,22
Charan,SWEDEN,25
Harish,NORWAY,28
Kiran,GERMANY,29


In [0]:
# First DataFrame
data = [("Andrew", 31, "INDIA"),
        ("Bibin", 32, "CANADA"),
        ("Alex", 33, "USA"),
        ("Chetan", 34, "Nepal"),
        ("Albert", 35, "UK")
        ]

columns = ["Name", "Age", "Country"]

df3_union = spark.createDataFrame(data, columns)
df3_union.display()

Name,Age,Country
Andrew,31,INDIA
Bibin,32,CANADA
Alex,33,USA
Chetan,34,Nepal
Albert,35,UK


In [0]:
# Order of columns are not same
# df1: Name, Country, Age
# df3: Name, Age, Country

df1_union.union(df3_union).display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNumberFormatException[0m                     Traceback (most recent call last)
File [0;32m<command-6468574538333445>, line 5[0m
[1;32m      1[0m [38;5;66;03m# Order of columns are not same[39;00m
[1;32m      2[0m [38;5;66;03m# df1: Name, Country, Age[39;00m
[1;32m      3[0m [38;5;66;03m# df3: Name, Age, Country[39;00m
[0;32m----> 5[0m df1_union[38;5;241m.[39munion(df3_union)[38;5;241m.[39mdisplay()

File [0;32m/databricks/python_shell/lib/dbruntime/monkey_patches.py:72[0m, in [0;36mapply_dataframe_display_patch.<locals>.df_display[0;34m(df, *args, **kwargs)[0m
[1;32m     68[0m [38;5;28;01mdef[39;00m [38;5;21mdf_display[39m(df, [38;5;241m*[39margs, [38;5;241m*[39m[38;5;241m*[39mkwargs):
[1;32m     69[0m [38;5;250m    [39m[38;5;124;03m"""[39;00m
[1;32m     70[0m [38;5;124;03m    df.display() is an alias for display(df). Run help(display) for more in

In [0]:
# Order of columns are not same
# df1: Name, Country, Age
# df3: Name, Age, Country

df1_union.unionByName(df3_union).display()

Name,Country,Age
Andrew,New York,25
Bibin,Chicago,30
Alex,Delhi,22
Chetan,Nasik,23
Albert,Kolkatta,27
Andrew,INDIA,31
Bibin,CANADA,32
Alex,USA,33
Chetan,Nepal,34
Albert,UK,35


**2) Same schema**

In [0]:
# First DataFrame
data = [(1, "Andrew", 25, "F", "New York"),
        (2, "Bibin", 30, "M", "Chicago"),
        (3, "Alex", 22, "F", "Delhi"),
        (4, "Chetan", 23, "M", "Nasik"),
        (5, "Albert", 27, "F", "Kolkatta"),
        (6, "Mansoor", 32, "M", "Pune"),
        (7, "Diliph", 29, "F", "Cochin"),
        (8, "Manish", 35, "M", "Hyderabad")
        ]

columns = ["id", "name", "age", "gender", "city"]

df1 = spark.createDataFrame(data, columns)
df1.display()

id,name,age,gender,city
1,Andrew,25,F,New York
2,Bibin,30,M,Chicago
3,Alex,22,F,Delhi
4,Chetan,23,M,Nasik
5,Albert,27,F,Kolkatta
6,Mansoor,32,M,Pune
7,Diliph,29,F,Cochin
8,Manish,35,M,Hyderabad


In [0]:
# First DataFrame
data = [(9, "Ananth", 22, "F", "Bangalore"),
        (10, "Bole", 31, "M", "Vizak"),
        (11, "Balu", 24, "F", "Delhi"),
        (12, "Farid", 28, "M", "Nasik"),
        (13, "Sony", 29, "F", "Amaravati"),
        (14, "Nitin", 35, "M", "Pune"),
        (15, "Praveen", 26, "F", "Cochin"),
        (16, "Swaroop", 37, "M", "Hyderabad")
        ]

columns = ["id", "name", "age", "gender", "city"]

df2 = spark.createDataFrame(data, columns)
df2.display()

id,name,age,gender,city
9,Ananth,22,F,Bangalore
10,Bole,31,M,Vizak
11,Balu,24,F,Delhi
12,Farid,28,M,Nasik
13,Sony,29,F,Amaravati
14,Nitin,35,M,Pune
15,Praveen,26,F,Cochin
16,Swaroop,37,M,Hyderabad


In [0]:
df_union_same = df1.unionByName(df2)
df_union_same.display()

id,name,age,gender,city
1,Andrew,25,F,New York
2,Bibin,30,M,Chicago
3,Alex,22,F,Delhi
4,Chetan,23,M,Nasik
5,Albert,27,F,Kolkatta
6,Mansoor,32,M,Pune
7,Diliph,29,F,Cochin
8,Manish,35,M,Hyderabad
9,Ananth,22,F,Bangalore
10,Bole,31,M,Vizak


**3) Different Column Order**

In [0]:
# First DataFrame
data = [("Alekhya", 9, "F", 32, "New York"),
        ("Yash", 10, "M", 35, "Chicago"),
        ("Firoj", 11, "M", 37, "Delhi"),
        ("Gowthami", 12, "F", 39, "Nasik"),
        ("Anandi", 13, "F", 41, "Kolkatta"),
        ("Manohar", 14, "M", 43, "Pune"),
        ("Deepti", 15, "F", 45, "Cochin"),
        ("Mohan", 16, "M", 47, "Hyderabad")
        ]

columns = ["name", "id", "gender", "age", "city"]

df3 = spark.createDataFrame(data, columns)
df3.display()

name,id,gender,age,city
Alekhya,9,F,32,New York
Yash,10,M,35,Chicago
Firoj,11,M,37,Delhi
Gowthami,12,F,39,Nasik
Anandi,13,F,41,Kolkatta
Manohar,14,M,43,Pune
Deepti,15,F,45,Cochin
Mohan,16,M,47,Hyderabad


In [0]:
# unionByName() matches by column names, so order doesn’t matter
df_union_diff = df1.unionByName(df3)

df_union_diff.display()

id,name,age,gender,city
1,Andrew,25,F,New York
2,Bibin,30,M,Chicago
3,Alex,22,F,Delhi
4,Chetan,23,M,Nasik
5,Albert,27,F,Kolkatta
6,Mansoor,32,M,Pune
7,Diliph,29,F,Cochin
8,Manish,35,M,Hyderabad
9,Alekhya,32,F,New York
10,Yash,35,M,Chicago


**4) Different schemas**
- allowMissingColumns=True

In [0]:
# First DataFrame
data = [(1, "Andrew", 25, "F", "New York", "sony"),
        (2, "Bibin", 30, "M", "Chicago", "iphone"),
        (3, "Alex", 22, "F", "Delhi", "bpl"),
        (4, "Chetan", 23, "M", "Nasik", "bmw"),
        (5, "Albert", 27, "F", "Kolkatta", "vimson"),
        (6, "Mansoor", 32, "M", "Pune", "samsung"),
        (7, "Diliph", 29, "F", "Cochin", "snowflake"),
        (8, "Manish", 35, "M", "Hyderabad", "azure")
        ]

columns = ["id", "name", "age", "gender", "city", "product"]

df5 = spark.createDataFrame(data, columns)
df5.display()

id,name,age,gender,city,product
1,Andrew,25,F,New York,sony
2,Bibin,30,M,Chicago,iphone
3,Alex,22,F,Delhi,bpl
4,Chetan,23,M,Nasik,bmw
5,Albert,27,F,Kolkatta,vimson
6,Mansoor,32,M,Pune,samsung
7,Diliph,29,F,Cochin,snowflake
8,Manish,35,M,Hyderabad,azure


In [0]:
# unionByName with allowMissingColumns=True fills missing columns with null
df_union_diff = df1.unionByName(df5, allowMissingColumns=True)
df_union_diff.display()

id,name,age,gender,city,product
1,Andrew,25,F,New York,
2,Bibin,30,M,Chicago,
3,Alex,22,F,Delhi,
4,Chetan,23,M,Nasik,
5,Albert,27,F,Kolkatta,
6,Mansoor,32,M,Pune,
7,Diliph,29,F,Cochin,
8,Manish,35,M,Hyderabad,
1,Andrew,25,F,New York,sony
2,Bibin,30,M,Chicago,iphone


**5) Union Multiple DataFrames**

In [0]:
# Use reduce() to union multiple DataFrames safely
from functools import reduce

dfs = [df1, df2, df5]
df_union = reduce(lambda a, b: a.unionByName(b, allowMissingColumns=True), dfs)

df_union.display()

id,name,age,gender,city,product
1,Andrew,25,F,New York,
2,Bibin,30,M,Chicago,
3,Alex,22,F,Delhi,
4,Chetan,23,M,Nasik,
5,Albert,27,F,Kolkatta,
6,Mansoor,32,M,Pune,
7,Diliph,29,F,Cochin,
8,Manish,35,M,Hyderabad,
9,Ananth,22,F,Bangalore,
10,Bole,31,M,Vizak,


**6) Union of two DataFrames with few common columns**

In [0]:
# First DataFrame (3 columns)
df1_comm = spark.createDataFrame(
    [
        (1, 2, 3),
        (10, 20, 30),
        (100, 200, 300)
    ],
    ["col0", "col1", "col2"]
)

# Second DataFrame (4 columns, some overlap with df1)
df2_comm = spark.createDataFrame(
    [
        (4, 5, 6, 7),
        (40, 50, 60, 70),
        (400, 500, 600, 700)
    ],
    ["col1", "col2", "col3", "col4"]
)

# Union by name with allowMissingColumns=True
df_union = df1_comm.unionByName(df2_comm, allowMissingColumns=True)

# Show final result
df_union.display()

col0,col1,col2,col3,col4
1.0,2,3,,
10.0,20,30,,
100.0,200,300,,
,4,5,6.0,7.0
,40,50,60.0,70.0
,400,500,600.0,700.0


**7) Union of two DataFrames with completely different columns**

In [0]:
df1_dff_compl = spark.createDataFrame([[0, 1, 2], [3, 4, 5], [8, 9, 3], [10, 13, 15]], ["col0", "col1", "col2"])
df2_dff_compl = spark.createDataFrame([[3, 4, 5], [12, 14, 17], [20, 22, 25], [31, 35, 39]], ["col3", "col4", "col5"])

df1_dff_compl.unionByName(df2_dff_compl, allowMissingColumns=True).display()

col0,col1,col2,col3,col4,col5
0.0,1.0,2.0,,,
3.0,4.0,5.0,,,
8.0,9.0,3.0,,,
10.0,13.0,15.0,,,
,,,3.0,4.0,5.0
,,,12.0,14.0,17.0
,,,20.0,22.0,25.0
,,,31.0,35.0,39.0
