Skip to content

Commit 8bd3770

Browse files
itholicHyukjinKwon
authored andcommitted
[SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark
### What changes were proposed in this pull request? This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not. ### Why are the changes needed? To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new examples with new argument in the docstring. ### How was this patch tested? Doctest added and manually tested ``` $ python/run-tests --testnames pyspark.sql.dataframe Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.dataframe (35s) Finished test(/.../python3): pyspark.sql.dataframe (35s) Tests passed in 35 seconds ``` Closes #29657 from itholic/SPARK-32798. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
1 parent c43460c commit 8bd3770

File tree

1 file changed

+23
-3
lines changed

1 file changed

+23
-3
lines changed

python/pyspark/sql/dataframe.py

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1548,7 +1548,7 @@ def unionAll(self, other):
15481548
return self.union(other)
15491549

15501550
@since(2.3)
1551-
def unionByName(self, other):
1551+
def unionByName(self, other, allowMissingColumns=False):
15521552
""" Returns a new :class:`DataFrame` containing union of rows in this and another
15531553
:class:`DataFrame`.
15541554
@@ -1567,8 +1567,28 @@ def unionByName(self, other):
15671567
| 1| 2| 3|
15681568
| 6| 4| 5|
15691569
+----+----+----+
1570-
"""
1571-
return DataFrame(self._jdf.unionByName(other._jdf), self.sql_ctx)
1570+
1571+
When the parameter `allowMissingColumns` is ``True``,
1572+
this function allows different set of column names between two :class:`DataFrame`\\s.
1573+
Missing columns at each side, will be filled with null values.
1574+
The missing columns at left :class:`DataFrame` will be added at the end in the schema
1575+
of the union result:
1576+
1577+
>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
1578+
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])
1579+
>>> df1.unionByName(df2, allowMissingColumns=True).show()
1580+
+----+----+----+----+
1581+
|col0|col1|col2|col3|
1582+
+----+----+----+----+
1583+
| 1| 2| 3|null|
1584+
|null| 4| 5| 6|
1585+
+----+----+----+----+
1586+
1587+
.. versionchanged:: 3.1.0
1588+
Added optional argument `allowMissingColumns` to specify whether to allow
1589+
missing columns.
1590+
"""
1591+
return DataFrame(self._jdf.unionByName(other._jdf, allowMissingColumns), self.sql_ctx)
15721592

15731593
@since(1.3)
15741594
def intersect(self, other):

0 commit comments

Comments
 (0)