[SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark

itholic · HyukjinKwon · commit 8bd3770552cc · 2020-09-08T09:41:02.000+09:00
### What changes were proposed in this pull request? This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not. ### Why are the changes needed? To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new examples with new argument in the docstring. ### How was this patch tested? Doctest added and manually tested ``` $ python/run-tests --testnames pyspark.sql.dataframe Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.dataframe (35s) Finished test(/.../python3): pyspark.sql.dataframe (35s) Tests passed in 35 seconds ``` Closes #29657 from itholic/SPARK-32798. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
@@ -1548,7 +1548,7 @@ def unionAll(self, other):
         return self.union(other)
 
     @since(2.3)
-    def unionByName(self, other):
+    def unionByName(self, other, allowMissingColumns=False):
         """ Returns a new :class:`DataFrame` containing union of rows in this and another
         :class:`DataFrame`.
 
@@ -1567,8 +1567,28 @@ def unionByName(self, other):
         |   1|   2|   3|
         |   6|   4|   5|
         +----+----+----+
-        """
-        return DataFrame(self._jdf.unionByName(other._jdf), self.sql_ctx)
+
+        When the parameter `allowMissingColumns` is ``True``,
+        this function allows different set of column names between two :class:`DataFrame`\\s.
+        Missing columns at each side, will be filled with null values.
+        The missing columns at left :class:`DataFrame` will be added at the end in the schema
+        of the union result:
+
+        >>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
+        >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])
+        >>> df1.unionByName(df2, allowMissingColumns=True).show()
+        +----+----+----+----+
+        |col0|col1|col2|col3|
+        +----+----+----+----+
+        |   1|   2|   3|null|
+        |null|   4|   5|   6|
+        +----+----+----+----+
+
+        .. versionchanged:: 3.1.0
+           Added optional argument `allowMissingColumns` to specify whether to allow
+           missing columns.
+        """
+        return DataFrame(self._jdf.unionByName(other._jdf, allowMissingColumns), self.sql_ctx)
 
     @since(1.3)
     def intersect(self, other):