- Title: Union DataFrames in PySpark
- Slug: pyspark-dataframe-union
- Date: 2019-12-20
- Category: Computer Science
- Tags: programming, Python, HPC, high performance computing, PySpark, DataFrame, union
- Author: Ben Du

In [1]:
import pandas as pd
import findspark
# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init('/opt/spark')

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName('PySpark Example').enableHiveSupport().getOrCreate()

In [6]:
df_p = pd.DataFrame(data=[
    ["Ben", 2, 30],
    ["Dan", 4, 25],
    ["Will", 1, 26],
], columns=["name", "id", "age"])
df_p

Unnamed: 0,name,id,age
0,Ben,2,30
1,Dan,4,25
2,Will,1,26


In [9]:
df1 = spark.createDataFrame(df_p)
df1.show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
| Dan|  4| 25|
|Will|  1| 26|
+----+---+---+



In [12]:
df2 = df1.filter(col("age") >= 30)
df2.show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
+----+---+---+



In [13]:
df3 = df1.filter(col("name") == "Dan")
df3.show()

+----+---+---+
|name| id|age|
+----+---+---+
| Dan|  4| 25|
+----+---+---+



Union 2 PySpark DataFrames.
Notice that `pyspark.sql.DataFrame.union` does not dedup by default (since Spark 2.0).

In [14]:
df1.union(df2).show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
| Dan|  4| 25|
|Will|  1| 26|
| Ben|  2| 30|
+----+---+---+



Union multiple PySpark DataFrames at once using `functools.reduce`.

In [21]:
from functools import reduce

reduce(DataFrame.union, [df1, df2, df3]).show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
| Dan|  4| 25|
|Will|  1| 26|
| Ben|  2| 30|
| Dan|  4| 25|
+----+---+---+



## References

https://stackoverflow.com/questions/37612622/spark-unionall-multiple-dataframes

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions