### Definition:
- union() and unionAll() are used to combine two DtaFrames with the schema in PySpark.
- union() combines rows from both DataFrames and, by default, includes duplicates.
- unionAll() does the same as union(). In older versions, unionAll() included duplicates, but in modern versions, union() and and unionAll() behave the same.
- To mimic SQL UNION behavior (which removes duplicates), we use union(). distinct().

In [3]:
# import SparkSession and Create Spark Session
from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.appName("PySparkUnionFunction").getOrCreate()


25/09/01 13:02:11 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [4]:
# create sample dataframes
data1 = [
    (1, "Dipankar", 27),
    (2, "Shankha", 26),
    (3, "Tuhin, one of my highter secondary school friend, who is a cousin brother of my another secondary school friend", 27)
]

df1 = spark.createDataFrame(data1).toDF("id", "name", "age")
df1.show()




+---+--------------------+---+
| id|                name|age|
+---+--------------------+---+
|  1|            Dipankar| 27|
|  2|             Shankha| 26|
|  3|Tuhin, one of my ...| 27|
+---+--------------------+---+



                                                                                

In [5]:
# create sample dataframes
data2 = [
    (1, "Dipankar", 27), # duplicate row for demonstration
    (2, "Akash", 29),
    (3, "Soukarjya", 25)
]
columns = ["id", "name", "age"]

df2 = spark.createDataFrame(data2, schema=columns)
df2.show()


[Stage 5:>                                                          (0 + 3) / 3]

+---+---------+---+
| id|     name|age|
+---+---------+---+
|  1| Dipankar| 27|
|  2|    Akash| 29|
|  3|Soukarjya| 25|
+---+---------+---+



                                                                                

In [6]:
# union() Example - Combines rows (duplicates INCLUDED)
print("Result of union() (duplicate included): ")
df_union = df1.union(df2)
df_union.show()


Result of union() (duplicate included): 


[Stage 8:>                                                          (0 + 3) / 3]

+---+--------------------+---+
| id|                name|age|
+---+--------------------+---+
|  1|            Dipankar| 27|
|  2|             Shankha| 26|
|  3|Tuhin, one of my ...| 27|
|  1|            Dipankar| 27|
|  2|               Akash| 29|
|  3|           Soukarjya| 25|
+---+--------------------+---+



                                                                                

In [7]:
# union() + distinct() Example - Removes duplicates (like SQL UNION)
print("Result of union() with distinct() (duplicates removed): ")
df_union_distinct = df_union.distinct()
df_union_distinct.show()


Result of union() with distinct() (duplicates removed): 




+---+--------------------+---+
| id|                name|age|
+---+--------------------+---+
|  1|            Dipankar| 27|
|  2|             Shankha| 26|
|  3|Tuhin, one of my ...| 27|
|  2|               Akash| 29|
|  3|           Soukarjya| 25|
+---+--------------------+---+



                                                                                

In [8]:
# unionAll() Example - works same as union() in modern PySpark
print("Result of unionAll() (duplicates included): ")
df_union_all = df1.unionAll(df2)
df_union_all.show()


Result of unionAll() (duplicates included): 


[Stage 14:>                                                         (0 + 3) / 3]

+---+--------------------+---+
| id|                name|age|
+---+--------------------+---+
|  1|            Dipankar| 27|
|  2|             Shankha| 26|
|  3|Tuhin, one of my ...| 27|
|  1|            Dipankar| 27|
|  2|               Akash| 29|
|  3|           Soukarjya| 25|
+---+--------------------+---+



                                                                                

In [9]:
# Row count comparison
print("Row count after union (with duplicates): ", df_union.count())
print("Row count after union with distinct(): ", df_union_distinct.count())
print("Row count after unionAll (duplicates included): ", df_union_all.count())


                                                                                

Row count after union (with duplicates):  6


                                                                                

Row count after union with distinct():  5




Row count after unionAll (duplicates included):  6


                                                                                