### union in PySpark behaves like UNION ALL in SQL — it combines two DataFrames including duplicates (does not remove duplicates).

In [0]:
# Sample DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie")], ["id", "name"])

In [0]:
# UNION ALL (just union)
union_all_df = df1.union(df2)
print("UNION ALL Result:")
union_all_df.show()

UNION ALL Result:
+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  2|    Bob|
|  3|Charlie|
+---+-------+



In [0]:
# UNION (remove duplicates)
union_df = df1.union(df2).distinct()
print("UNION Result (with distinct):")
union_df.show()

UNION Result (with distinct):
+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
+---+-------+



## Things to Note

### Schemas must match :- The DataFrames you want to union must have the same schema (same columns, same data types, and in the same order).

Column order matters
Data types must be compatible :- For example, unioning a column of type int with a column of type string will error.

# Using Spark SQL

In [0]:
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")

In [0]:
%sql
SELECT * FROM table1
UNION ALL
SELECT * FROM table2


id,name
1,Alice
2,Bob
2,Bob
3,Charlie


In [0]:
%sql
SELECT * FROM table1
UNION
SELECT * FROM table2


id,name
1,Alice
2,Bob
3,Charlie


In [0]:
union_all=spark.sql("""
SELECT * FROM table1
UNION ALL
SELECT * FROM table2                    
""").show()

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  2|    Bob|
|  3|Charlie|
+---+-------+



In [0]:
union=spark.sql("""
SELECT * FROM table1
UNION 
SELECT * FROM table2                    
""").show()

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
+---+-------+

