Skip to content
Permalink
Branch: master
Commits on Oct 10, 2019
  1. [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now

    HyukjinKwon committed Oct 10, 2019
    ### What changes were proposed in this pull request?
    
    This PR proposes to use Arrow R 0.14.1 for now in AppVeyor to make tests passed.
    
    ### Why are the changes needed?
    
    To make build passed with Arrow. It doesn't work with setting `ARROW_PRE_0_15_IPC_FORMAT` to `1` to allow Arrow R 0.15 compatibility.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    AppVeyor
    
    Closes #26041 from HyukjinKwon/investigate.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Oct 4, 2019
  1. [SPARK-29286][PYTHON][TESTS] Uses UTF-8 with 'replace' on errors at P…

    HyukjinKwon authored and dongjoon-hyun committed Oct 4, 2019
    …ython testing script
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to let Python 2 uses UTF-8, instead of ASCII, with permissively replacing non-UDF-8 unicodes into unicode points in Python testing script.
    
    ### Why are the changes needed?
    
    When Python 2 is used to run the Python testing script, with `decode(encoding='ascii')`, it fails whenever non-ascii codes are printed out.
    
    ### Does this PR introduce any user-facing change?
    
    To dev, it will enable to support to print out non-ASCII characters.
    
    ### How was this patch tested?
    
    Jenkins will test it for our existing test codes. Also, manually tested with UTF-8 output.
    
    Closes #26021 from HyukjinKwon/SPARK-29286.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Oct 3, 2019
  1. [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (t…

    HyukjinKwon committed Oct 3, 2019
    …est it in AppVeyor build)
    
    ### What changes were proposed in this pull request?
    
    This PR proposes:
    
    1. Use `is.data.frame` to check if it is a DataFrame.
    2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.
    
    ### Why are the changes needed?
    
    1. To support SparkR with Arrow 0.14
    2. To check if there's any regression and if it works correctly.
    
    ### Does this PR introduce any user-facing change?
    
    ```r
    df <- createDataFrame(mtcars)
    collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
    ```
    
    **Before:**
    
    ```
    Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
      invalid 'n' argument
    ```
    
    **After:**
    
    ```
       gear
    1     5
    2     5
    3     5
    4     4
    5     4
    6     4
    7     4
    8     5
    9     5
    ...
    ```
    
    ### How was this patch tested?
    
    AppVeyor
    
    Closes #25993 from HyukjinKwon/arrow-r-appveyor.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  2. [SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGr…

    HyukjinKwon committed Oct 3, 2019
    …oup arrow runner and its plan
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to avoid abstract classes introduced at #24965 but instead uses trait and object.
    
    - `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in
    
        **Before:**
    
        ```
        BasePythonRunner
        ├── BaseArrowPythonRunner
        │   ├── ArrowPythonRunner
        │   └── CoGroupedArrowPythonRunner
        ├── PythonRunner
        └── PythonUDFRunner
        ```
    
        **After:**
    
        ```
        └── BasePythonRunner
            ├── ArrowPythonRunner
            ├── CoGroupedArrowPythonRunner
            ├── PythonRunner
            └── PythonUDFRunner
        ```
    - `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple
    
        **Before:**
    
        ```
        └── BasePandasGroupExec
            ├── FlatMapGroupsInPandasExec
            └── FlatMapCoGroupsInPandasExec
        ```
    
        **After:**
    
        ```
        ├── FlatMapGroupsInPandasExec
        └── FlatMapCoGroupsInPandasExec
        ```
    
    ### Why are the changes needed?
    
    The problem is that R code path is being matched with Python side:
    
    **Python:**
    
    ```
    └── BasePythonRunner
        ├── ArrowPythonRunner
        ├── CoGroupedArrowPythonRunner
        ├── PythonRunner
        └── PythonUDFRunner
    ```
    
    **R:**
    
    ```
    └── BaseRRunner
        ├── ArrowRRunner
        └── RRunner
    ```
    
    I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.
    
    `BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.
    
    `FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec`
    `MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec`
    
    In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Locally tested existing tests. Jenkins tests should verify this too.
    
    Closes #25989 from HyukjinKwon/SPARK-29317.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Sep 27, 2019
  1. [SPARK-29240][PYTHON] Pass Py4J column instance to support PySpark co…

    HyukjinKwon authored and dongjoon-hyun committed Sep 27, 2019
    …lumn in element_at function
    
    ### What changes were proposed in this pull request?
    
    This PR makes `element_at` in PySpark able to take PySpark `Column` instances.
    
    ### Why are the changes needed?
    
    To match with Scala side. Seems it was intended but not working correctly as a bug.
    
    ### Does this PR introduce any user-facing change?
    
    Yes. See below:
    
    ```python
    from pyspark.sql import functions as F
    x = spark.createDataFrame([([1,2,3],1),([4,5,6],2),([7,8,9],3)],['list','num'])
    x.withColumn('aa',F.element_at('list',x.num.cast('int'))).show()
    ```
    
    Before:
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/functions.py", line 2059, in element_at
        return Column(sc._jvm.functions.element_at(_to_java_column(col), extraction))
      File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
      File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
      File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
      File "/.../forked/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
      File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
        raise TypeError("Column is not iterable")
    TypeError: Column is not iterable
    ```
    
    After:
    
    ```
    +---------+---+---+
    |     list|num| aa|
    +---------+---+---+
    |[1, 2, 3]|  1|  1|
    |[4, 5, 6]|  2|  5|
    |[7, 8, 9]|  3|  9|
    +---------+---+---+
    ```
    
    ### How was this patch tested?
    
    Manually tested against literal, Python native types, and PySpark column.
    
    Closes #25950 from HyukjinKwon/SPARK-29240.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Sep 22, 2019
  1. [SPARK-27463][PYTHON][FOLLOW-UP] Run the tests of Cogrouped pandas UDF

    HyukjinKwon committed Sep 22, 2019
    ### What changes were proposed in this pull request?
    This is a followup for #24981
    Seems we mistakenly didn't added `test_pandas_udf_cogrouped_map` into `modules.py`. So we don't have official test results against that PR.
    
    ```
    ...
    Starting test(python3.6): pyspark.sql.tests.test_pandas_udf
    ...
    Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_agg
    ...
    Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_map
    ...
    Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_scalar
    ...
    Starting test(python3.6): pyspark.sql.tests.test_pandas_udf_window
    Finished test(python3.6): pyspark.sql.tests.test_pandas_udf (21s)
    ...
    Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_map (49s)
    ...
    Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_window (58s)
    ...
    Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_scalar (82s)
    ...
    Finished test(python3.6): pyspark.sql.tests.test_pandas_udf_grouped_agg (105s)
    ...
    ```
    
    If tests fail, we should revert that PR.
    
    ### Why are the changes needed?
    
    Relevant tests should be ran.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Jenkins tests.
    
    Closes #25890 from HyukjinKwon/SPARK-28840.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Sep 20, 2019
  1. [SPARK-29158][SQL][FOLLOW-UP] Create an actual test case under `src/t…

    HyukjinKwon authored and dongjoon-hyun committed Sep 20, 2019
    …est` and minor documentation correction
    
    ### What changes were proposed in this pull request?
    
    This PR is a followup of #25838 and proposes to create an actual test case under `src/test`. Previously, compile only test existed at `src/main`.
    
    Also, just changed the wordings in `SerializableConfiguration` just only to describe what it does (remove other words).
    
    ### Why are the changes needed?
    
    Tests codes should better exist in `src/test` not `src/main`. Also, it should better test a basic functionality.
    
    ### Does this PR introduce any user-facing change?
    
    No except minor doc change.
    
    ### How was this patch tested?
    
    Unit test was added.
    
    Closes #25867 from HyukjinKwon/SPARK-29158.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Sep 15, 2019
  1. [SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration …

    HyukjinKwon authored and dongjoon-hyun committed Sep 15, 2019
    …Guide tap in Spark documentation
    
    ### What changes were proposed in this pull request?
    
    Currently, there is no migration section for PySpark, SparkCore and Structured Streaming.
    It is difficult for users to know what to do when they upgrade.
    
    This PR proposes to create create a "Migration Guide" tap at Spark documentation.
    
    ![Screen Shot 2019-09-11 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/64688126-ad712f80-d4c6-11e9-8672-9a2c56c05bf8.png)
    
    ![Screen Shot 2019-09-11 at 7 27 15 PM](https://user-images.githubusercontent.com/6477701/64689915-389ff480-d4ca-11e9-8c54-7f46095d0d23.png)
    
    This page will contain migration guides for Spark SQL, PySpark, SparkR, MLlib, Structured Streaming and Core. Basically it is a refactoring.
    
    There are some new information added, which I will leave a comment inlined for easier review.
    
    1. **MLlib**
      Merge [ml-guide.html#migration-guide](https://spark.apache.org/docs/latest/ml-guide.html#migration-guide) and [ml-migration-guides.html](https://spark.apache.org/docs/latest/ml-migration-guides.html)
    
        ```
        'docs/ml-guide.md'
                ↓ Merge new/old migration guides
        'docs/ml-migration-guide.md'
        ```
    
    2. **PySpark**
      Extract PySpark specific items from https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
    
        ```
        'docs/sql-migration-guide-upgrade.md'
               ↓ Extract PySpark specific items
        'docs/pyspark-migration-guide.md'
        ```
    
    3. **SparkR**
      Move [sparkr.html#migration-guide](https://spark.apache.org/docs/latest/sparkr.html#migration-guide) into a separate file, and extract from [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html)
    
        ```
        'docs/sparkr.md'                     'docs/sql-migration-guide-upgrade.md'
         Move migration guide section ↘     ↙ Extract SparkR specific items
                         docs/sparkr-migration-guide.md
        ```
    
    4. **Core**
      Newly created at `'docs/core-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note.
    
    5. **Structured Streaming**
      Newly created at `'docs/ss-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note.
    
    6. **SQL**
      Merged [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html) and [sql-migration-guide-hive-compatibility.html](https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html)
        ```
        'docs/sql-migration-guide-hive-compatibility.md'     'docs/sql-migration-guide-upgrade.md'
         Move Hive compatibility section ↘                   ↙ Left over after filtering PySpark and SparkR items
                                      'docs/sql-migration-guide.md'
        ```
    
    ### Why are the changes needed?
    
    In order for users in production to effectively migrate to higher versions, and detect behaviour or breaking changes before upgrading and/or migrating.
    
    ### Does this PR introduce any user-facing change?
    Yes, this changes Spark's documentation at https://spark.apache.org/docs/latest/index.html.
    
    ### How was this patch tested?
    
    Manually build the doc. This can be verified as below:
    
    ```bash
    cd docs
    SKIP_API=1 jekyll build
    open _site/index.html
    ```
    
    Closes #25757 from HyukjinKwon/migration-doc.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Sep 11, 2019
  1. [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binar…

    HyukjinKwon committed Sep 11, 2019
    …y type
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to allow `bytes` as an acceptable type for binary type for `createDataFrame`.
    
    ### Why are the changes needed?
    
    `bytes` is a standard type for binary in Python. This should be respected in PySpark side.
    
    ### Does this PR introduce any user-facing change?
    
    Yes, _when specified type is binary_, we will allow `bytes` as a binary type. Previously this was not allowed in both Python 2 and Python 3 as below:
    
    ```python
    spark.createDataFrame([[b"abcd"]], "col binary")
    ```
    
    in Python 3
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
        rdd, schema = self._createFromLocal(map(prepare, data), schema)
      File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal
        data = list(data)
      File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
        verify_func(obj)
      File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
        verify_value(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
        verifier(v)
      File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
        verify_value(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
        verify_acceptable_types(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types
        % (dataType, obj, type(obj))))
    TypeError: field col: BinaryType can not accept object b'abcd' in type <class 'bytes'>
    ```
    
    in Python 2:
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
        rdd, schema = self._createFromLocal(map(prepare, data), schema)
      File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal
        data = list(data)
      File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
        verify_func(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
        verify_value(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
        verifier(v)
      File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
        verify_value(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
        verify_acceptable_types(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types
        % (dataType, obj, type(obj))))
    TypeError: field col: BinaryType can not accept object 'abcd' in type <type 'str'>
    ```
    
    So, it won't break anything.
    
    ### How was this patch tested?
    
    Unittests were added and also manually tested as below.
    
    ```bash
    ./run-tests --python-executables=python2,python3 --testnames "pyspark.sql.tests.test_serde"
    ```
    
    Closes #25749 from HyukjinKwon/SPARK-29041.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Sep 5, 2019
  1. [SPARK-28272][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_…

    HyukjinKwon committed Sep 5, 2019
    …part3.sql' into UDF test base
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base.
    
    <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
    <p>
    
    ```diff
    diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
    index f102383cb4d..eff33f280cf 100644
    --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out
    +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
     -3,7 +3,7
    
     -- !query 0
    -select max(min(unique1)) from tenk1
    +select udf(max(min(unique1))) from tenk1
     -- !query 0 schema
     struct<>
     -- !query 0 output
     -12,11 +12,11  It is not allowed to use an aggregate function in the argument of another aggreg
    
     -- !query 1
    -select (select count(*)
    -        from (values (1)) t0(inner_c))
    +select udf((select udf(count(*))
    +        from (values (1)) t0(inner_c))) as col
     from (values (2),(3)) t1(outer_c)
     -- !query 1 schema
    -struct<scalarsubquery():bigint>
    +struct<col:bigint>
     -- !query 1 output
     1
     1
    ```
    
    </p>
    </details>
    
    ### Why are the changes needed?
    
    To improve test coverage in UDFs.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually tested via:
    
    ```bash
     build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql"
    ```
    
    as guided in https://issues.apache.org/jira/browse/SPARK-27921
    
    Closes #25676 from HyukjinKwon/SPARK-28272.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  2. [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_…

    HyukjinKwon committed Sep 5, 2019
    …part4.sql' into UDF test base
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base.
    
    <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
    <p>
    
    ```diff
    ```
    
    </p>
    </details>
    
    ### Why are the changes needed?
    
    To improve test coverage in UDFs.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually tested via:
    
    ```bash
     build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql"
    ```
    
    as guided in https://issues.apache.org/jira/browse/SPARK-27921
    
    Closes #25677 from HyukjinKwon/SPARK-28971.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Sep 3, 2019
  1. [SPARK-28946][R][DOCS] Add some more information about building Spark…

    HyukjinKwon committed Sep 3, 2019
    …R on Windows
    
    ### What changes were proposed in this pull request?
    
    This PR adds three more information:
    
    - Mentions that `bash` in `PATH` to build is required.
    - Specifies supported JDK and Maven versions
    - Explicitly mentions that building on Windows is not the official support
    
    ### Why are the changes needed?
    
    In order to make SparkR developers on Windows able to work, and describe what is needed for AppVeyor build.
    
    ### Does this PR introduce any user-facing change?
    
    No. It just adds some information in `R/WINDOWS.md`
    
    ### How was this patch tested?
    
    This is already being tested as so in AppVeyor. Also, I tested as so (long ago though).
    
    Closes #25647 from HyukjinKwon/SPARK-28946.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Sep 2, 2019
  1. Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API"

    HyukjinKwon committed Sep 2, 2019
    This reverts commit 3821d75.
Commits on Aug 30, 2019
  1. [SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via J…

    HyukjinKwon authored and dongjoon-hyun committed Aug 30, 2019
    …enkins's test results
    
    ### What changes were proposed in this pull request?
    
    See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/
    
    ![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png)
    
    ```xml
    <?xml version="1.0" encoding="UTF-8"?>
    <testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475">
        ...
        <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703">
        </testcase>
        <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442">
        </testcase>
        <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33">
        </testcase>
        <system-out/>
        <system-err/>
    </testsuite>
    ```
    
    Root cause seems a bug in SBT - it truncates the test name based on the last dot.
    
    sbt/sbt#2949
    https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79
    
    I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log:
    
    ```diff
      [info] - inner-join.sql *** FAILED *** (4 seconds, 306 milliseconds)
    + [info]   inner-join.sql
      [info]   Expected "1	a
      [info]   1	a
      [info]   1	b
      [info]   1[]", but got "1	a
      [info]   1	a
      [info]   1	b
      [info]   1[	b]" Result did not match for query #6
      [info]   SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377)
      [info]   org.scalatest.exceptions.TestFailedException:
      [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
    ```
    
    It will at least prevent us to search full logs to identify which test file is failed by clicking filed test.
    
    Note that this PR does not fully fix the issue but only fix the logs on its failed tests.
    
    ### Why are the changes needed?
    To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed.
    
    ### Does this PR introduce any user-facing change?
    It will print out the file name of failed tests in Jenkins' test reports.
    
    ### How was this patch tested?
    Manually tested but Jenkins tests are required in this PR.
    
    Now it at least shows which file it is:
    
    ![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png)
    
    Closes #25630 from HyukjinKwon/SPARK-28894-1.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  2. [SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.2.0 and fix buil…

    HyukjinKwon authored and dongjoon-hyun committed Aug 30, 2019
    …d profile on AppVeyor
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to upgrade scala-maven-plugin from 3.4.4 to 4.2.0.
    
    Upgrade to 4.1.1 was reverted due to unexpected build failure on AppVeyor.
    
    The root cause seems to be an issue specific to AppVeyor - loading the system library 'kernel32.dll' seems being failed.
    
    ```
    Suppressed: java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.platform.win32.Kernel32
            at sbt.internal.io.WinMilli$.getHandle(Milli.scala:264)
            at sbt.internal.io.WinMilli$.getModifiedTimeNative(Milli.scala:289)
            at sbt.internal.io.WinMilli$.getModifiedTimeNative(Milli.scala:260)
            at sbt.internal.io.MilliNative.getModifiedTime(Milli.scala:61)
            at sbt.internal.io.Milli$.getModifiedTime(Milli.scala:360)
            at sbt.io.IO$.$anonfun$getModifiedTimeOrZero$1(IO.scala:1373)
            at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
            at sbt.internal.io.Retry$.liftedTree2$1(Retry.scala:38)
            at sbt.internal.io.Retry$.impl$1(Retry.scala:38)
            at sbt.internal.io.Retry$.apply(Retry.scala:52)
            at sbt.internal.io.Retry$.apply(Retry.scala:24)
            at sbt.io.IO$.getModifiedTimeOrZero(IO.scala:1373)
            at sbt.internal.inc.caching.ClasspathCache$.fromCacheOrHash$1(ClasspathCache.scala:44)
            at sbt.internal.inc.caching.ClasspathCache$.$anonfun$hashClasspath$1(ClasspathCache.scala:53)
            at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:659)
            at scala.collection.parallel.Task.$anonfun$tryLeaf$1(Tasks.scala:53)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:67)
            at scala.collection.parallel.Task.tryLeaf(Tasks.scala:56)
            at scala.collection.parallel.Task.tryLeaf$(Tasks.scala:50)
            at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
            at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.internal(Tasks.scala:170)
            ... 25 more
    ```
    
    By setting `-Djna.nosys=true`, it directly loads the library from the jar instead of system's.
    
    In this way, the build seems working fine.
    
    ### Why are the changes needed?
    
    It upgrades the plugin to fix bugs and fixes the CI build.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    It was tested at #25497
    
    Closes #25633 from HyukjinKwon/SPARK-28759.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Aug 28, 2019
  1. [SPARK-28881][PYTHON][TESTS][FOLLOW-UP] Use SparkSession(SparkContext…

    HyukjinKwon committed Aug 28, 2019
    …(...)) to prevent for Spark conf to affect other tests
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to match the test with branch-2.4. See #25593 (comment)
    
    Seems using `SparkSession.builder` with Spark conf possibly affects other tests.
    
    ### Why are the changes needed?
    To match with branch-2.4 and to make easier to backport.
    
    ### Does this PR introduce any user-facing change?
    No.
    
    ### How was this patch tested?
    Test was fixed.
    
    Closes #25603 from HyukjinKwon/SPARK-28881-followup.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Aug 27, 2019
  1. [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Ar…

    HyukjinKwon committed Aug 27, 2019
    …row optimization throws an exception per maxResultSize
    
    ### What changes were proposed in this pull request?
    This PR proposes to add a test case for:
    
    ```bash
    ./bin/pyspark --conf spark.driver.maxResultSize=1m
    spark.conf.set("spark.sql.execution.arrow.enabled",True)
    ```
    
    ```python
    spark.range(10000000).toPandas()
    ```
    
    ```
    Empty DataFrame
    Columns: [id]
    Index: []
    ```
    
    which can result in partial results (see #25593 (comment)). This regression was found between Spark 2.3 and Spark 2.4, and accidentally fixed.
    
    ### Why are the changes needed?
    To prevent the same regression in the future.
    
    ### Does this PR introduce any user-facing change?
    No.
    
    ### How was this patch tested?
    Test was added.
    
    Closes #25594 from HyukjinKwon/SPARK-28881.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Aug 23, 2019
  1. [SPARK-28839][CORE] Avoids NPE in context cleaner when dynamic alloca…

    HyukjinKwon authored and vanzin committed Aug 23, 2019
    …tion and shuffle service are on
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to avoid to thrown NPE at context cleaner when shuffle service is on - it is kind of a small followup of #24817
    
    Seems like it sets `null` for `shuffleIds` to track when the service is on. Later, `removeShuffle` tries to remove an element at `shuffleIds` which leads to NPE. It fixes it by explicitly not sending the event (`ShuffleCleanedEvent`) in this case.
    
    See the code path below:
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/SparkContext.scala#L584
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L125
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L190
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L220-L230
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L353-L357
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L347
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L400-L406
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L475
    
    https://github.com/apache/spark/blob/cbad616d4cb0c58993a88df14b5e30778c7f7e85/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L427
    
    ### Why are the changes needed?
    
    This is a bug fix.
    
    ### Does this PR introduce any user-facing change?
    
    It prevents the exception:
    
    ```
    19/08/21 06:44:01 ERROR AsyncEventQueue: Listener ExecutorMonitor threw an exception
    java.lang.NullPointerException
    	at org.apache.spark.scheduler.dynalloc.ExecutorMonitor$Tracker.removeShuffle(ExecutorMonitor.scala:479)
    	at org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2(ExecutorMonitor.scala:408)
    	at org.apache.spark.scheduler.dynalloc.ExecutorMonitor.$anonfun$cleanupShuffle$2$adapted(ExecutorMonitor.scala:407)
    	at scala.collection.Iterator.foreach(Iterator.scala:941)
    	at scala.collection.Iterator.foreach$(Iterator.scala:941)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    	at org.apache.spark.scheduler.dynalloc.ExecutorMonitor.cleanupShuffle(ExecutorMonitor.scala:407)
    	at org.apache.spark.scheduler.dynalloc.ExecutorMonitor.onOtherEvent(ExecutorMonitor.sc
    ```
    
    ### How was this patch test?
    
    Unittest was added.
    
    Closes #25551 from HyukjinKwon/SPARK-28839.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Commits on Aug 19, 2019
  1. Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.1.1"

    HyukjinKwon committed Aug 19, 2019
    This reverts commit 1819a6f.
  2. [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions

    HyukjinKwon committed Aug 19, 2019
    <!--
    Thanks for sending a pull request!  Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
      2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
      4. Be sure to keep the PR description updated to reflect all changes.
      5. Please write your PR title to summarize what this PR proposes.
      6. If possible, provide a concise example to reproduce the issue for a faster review.
    -->
    
    ### What changes were proposed in this pull request?
    <!--
    Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
    If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
      1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
      2. If you fix some SQL features, you can provide some references of other DBMSes.
      3. If there is design documentation, please add the link.
      4. If there is a discussion in the mailing list, please add the link.
    -->
    
    This PR proposes to set minimum and maximum Java version specification. (see https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-portable-packages).
    
    Seems there is not the standard way to specify both given the documentation and other packages (see https://gist.github.com/glin/bd36cf1eb0c7f8b1f511e70e2fb20f8d).
    
    I found two ways from existing packages on CRAN.
    
    ```
    Package (<= 1 & > 2)
    Package (<= 1, > 2)
    ```
    
    The latter seems closer to other standard notations such as `R (>= 2.14.0), R (>= r56550)`. So I have chosen the latter way.
    
    ### Why are the changes needed?
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you propose a new API, clarify the use case for a new API.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    Seems the package might be rejected by CRAN. See #25472 (comment)
    
    ### Does this PR introduce any user-facing change?
    <!--
    If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
    If no, write 'No'.
    -->
    
    No.
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    -->
    
    JDK 8
    
    ```bash
    ./build/mvn -DskipTests -Psparkr clean package
    ./R/run-tests.sh
    
    ...
    basic tests for CRAN: .............
    ...
    ```
    
    JDK 11
    
    ```bash
    ./build/mvn -DskipTests -Psparkr -Phadoop-3.2 clean package
    ./R/run-tests.sh
    
    ...
    basic tests for CRAN: .............
    ...
    ```
    
    Closes #25490 from HyukjinKwon/SPARK-28756.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Aug 16, 2019
  1. [SPARK-28755][R][TESTS] Increase tolerance in 'spark.mlp' SparkR test…

    HyukjinKwon authored and dongjoon-hyun committed Aug 16, 2019
    … for JDK 11
    
    <!--
    Thanks for sending a pull request!  Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
      2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
      4. Be sure to keep the PR description updated to reflect all changes.
      5. Please write your PR title to summarize what this PR proposes.
      6. If possible, provide a concise example to reproduce the issue for a faster review.
    -->
    
    ### What changes were proposed in this pull request?
    <!--
    Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
    If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
      1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
      2. If you fix some SQL features, you can provide some references of other DBMSes.
      3. If there is design documentation, please add the link.
      4. If there is a discussion in the mailing list, please add the link.
    -->
    
    This PR proposes to increase the tolerance for the exact value comparison in `spark.mlp` test. I don't know the root cause but some tolerance is already expected. I suspect it is not a big deal considering all other tests pass.
    
    The values are fairly close:
    
    JDK 8:
    
    ```
    -24.28415, 107.8701, 16.86376, 1.103736, 9.244488
    ```
    
    JDK 11:
    
    ```
    -24.33892, 108.0316, 16.89082, 1.090723, 9.260533
    ```
    
    ### Why are the changes needed?
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you propose a new API, clarify the use case for a new API.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    To fully support JDK 11. See, for instance, #25443 and #25423 for ongoing efforts.
    
    ### Does this PR introduce any user-facing change?
    <!--
    If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
    If no, write 'No'.
    -->
    
    No
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    -->
    
    Manually tested on the top of #25472 with JDK 11
    
    ```bash
    ./build/mvn -DskipTests -Psparkr -Phadoop-3.2 package
    ./bin/sparkR
    ```
    
    ```R
    absoluteSparkPath <- function(x) {
      sparkHome <- sparkR.conf("spark.home")
      file.path(sparkHome, x)
    }
    df <- read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
                  source = "libsvm")
    model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3),
                       solver = "l-bfgs", maxIter = 100, tol = 0.00001, stepSize = 1, seed = 1)
    summary <- summary(model)
    head(summary$weights, 5)
    ```
    
    Closes #25478 from HyukjinKwon/SPARK-28755.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  2. [SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests to pass i…

    HyukjinKwon committed Aug 16, 2019
    …n JDK 11
    
    <!--
    Thanks for sending a pull request!  Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
      2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
      4. Be sure to keep the PR description updated to reflect all changes.
      5. Please write your PR title to summarize what this PR proposes.
      6. If possible, provide a concise example to reproduce the issue for a faster review.
    -->
    
    ### What changes were proposed in this pull request?
    <!--
    Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
    If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
      1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
      2. If you fix some SQL features, you can provide some references of other DBMSes.
      3. If there is design documentation, please add the link.
      4. If there is a discussion in the mailing list, please add the link.
    -->
    
    This PR proposes to fix both tests below:
    
    ```
    ======================================================================
    FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction
        self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4))
    AssertionError: False is not true
    ```
    
    ```
    File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 386, in __main__.GaussianMixtureModel
    Failed example:
        abs(softPredicted[0] - 1.0) < 0.001
    Expected:
        True
    Got:
        False
    **********************************************************************
    File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 388, in __main__.GaussianMixtureModel
    Failed example:
        abs(softPredicted[1] - 0.0) < 0.001
    Expected:
        True
    Got:
        False
    ```
    
    to pass in JDK 11.
    
    The root cause seems to be different float values being understood via Py4J. This issue also was found in #25132 before.
    
    When floats are transferred from Python to JVM, the values are sent as are. Python floats are not "precise" due to its own limitation - https://docs.python.org/3/tutorial/floatingpoint.html.
    For some reasons, the floats from Python on JDK 8 and JDK 11 are different, which is already explicitly not guaranteed.
    
    This seems why only some tests in PySpark with floats are being failed.
    
    So, this PR fixes it by increasing tolerance in identified test cases in PySpark.
    
    ### Why are the changes needed?
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you propose a new API, clarify the use case for a new API.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    To fully support JDK 11. See, for instance, #25443 and #25423 for ongoing efforts.
    
    ### Does this PR introduce any user-facing change?
    <!--
    If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
    If no, write 'No'.
    -->
    
    No.
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    -->
    
    Manually tested as described in JIRAs:
    
    ```
    $ build/sbt -Phadoop-3.2 test:package
    $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python
    ```
    
    ```
    $ build/sbt -Phadoop-3.2 test:package
    $ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables python
    ```
    
    Closes #25475 from HyukjinKwon/SPARK-28735.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  3. [SPARK-28578][INFRA] Improve Github pull request template

    HyukjinKwon committed Aug 16, 2019
    <!--
    Thanks for sending a pull request!  Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
      2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
      4. Be sure to keep the PR description updated to reflect all changes.
      5. Please write your PR title to summarize what this PR proposes.
      6. If possible, provide a concise example to reproduce the issue for a faster review.
    -->
    
    ### What changes were proposed in this pull request?
    <!--
    Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
    If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
      1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
      2. If you fix some SQL features, you can provide some references of other DBMSes.
      3. If there is design documentation, please add the link.
      4. If there is a discussion in the mailing list, please add the link.
    -->
    
    This PR proposes to improve the Github template for better and faster review iterations and better interactions between PR authors and reviewers.
    
    As suggested in the the [dev mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-New-sections-in-Github-Pull-Request-description-template-td27527.html), this PR referred [Kubernates' PR template](https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md).
    
    Therefore, those fields are newly added:
    
    ```
    ### Why are the changes needed?
    ### Does this PR introduce any user-facing change?
    ```
    
    and some comments were added.
    
    ### Why are the changes needed?
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you propose a new API, clarify the use case for a new API.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    Currently, many PR descriptions are poorly formatted, which causes some overheads between PR authors and reviewers.
    
    There are multiple problems by those poorly formatted PR descriptions:
    
    - Some PRs still write single line in PR description with 500+- code changes in a critical path.
    - Some PRs do not describe behaviour changes and reviewers need to find and document.
    - Some PRs are hard to review without outlines but they are not mentioned sometimes.
    - Spark is being old and sometimes we need to track the history deep. Due to poorly formatted PR description,  sometimes it requires to read whole codes of whole commit histories to find the root cause of a bug.
    - Reviews take a while but the number of PR still grows.
    
    This PR targets to alleviate the problems and situation.
    
    ### Does this PR introduce any user-facing change?
    <!--
    If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
    If no, write 'No'.
    -->
    
    Yes, it changes the PR templates when PRs are open. This PR uses the template this PR proposes.
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    -->
    
    Manually tested via Github preview feature.
    
    Closes #25310 from HyukjinKwon/SPARK-28578.
    
    Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
    Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Aug 8, 2019
  1. [SPARK-28654][SQL] Move "Extract Python UDFs" to the last in optimizer

    HyukjinKwon authored and cloud-fan committed Aug 8, 2019
    ## What changes were proposed in this pull request?
    
    Plans after "Extract Python UDFs" are very flaky and error-prone to other rules.
    
    For instance, if we add some rules, for instance, `PushDownPredicates` in `postHocOptimizationBatches`, the test in `BatchEvalPythonExecSuite` fails:
    
    ```scala
    test("Python UDF refers to the attributes from more than one child") {
      val df = Seq(("Hello", 4)).toDF("a", "b")
      val df2 = Seq(("Hello", 4)).toDF("c", "d")
      val joinDF = df.crossJoin(df2).where("dummyPythonUDF(a, c) == dummyPythonUDF(d, c)")
      val qualifiedPlanNodes = joinDF.queryExecution.executedPlan.collect {
        case b: BatchEvalPythonExec => b
      }
      assert(qualifiedPlanNodes.size == 1)
    }
    ```
    
    ```
    Invalid PythonUDF dummyUDF(a#63, c#74), requires attributes from more than one child.
    ```
    
    This is because Python UDF extraction optimization is rolled back as below:
    
    ```
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates ===
    !Filter (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18))   Join Cross, (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18))
    !+- Join Cross                                         :- Project [_1#2 AS a#7, _2#3 AS b#8]
    !   :- Project [_1#2 AS a#7, _2#3 AS b#8]              :  +- LocalRelation [_1#2, _2#3]
    !   :  +- LocalRelation [_1#2, _2#3]                   +- Project [_1#13 AS c#18, _2#14 AS d#19]
    !   +- Project [_1#13 AS c#18, _2#14 AS d#19]             +- LocalRelation [_1#13, _2#14]
    !      +- LocalRelation [_1#13, _2#14]
    ```
    
    Seems we should do Python UDFs cases at the last even after post hoc rules.
    
    Note that this actually rather follows the way in previous versions when those were in physical plans (see SPARK-24721 and SPARK-12981). Those optimization rules were supposed to be placed at the end.
    
    Note that I intentionally didn't move `ExperimentalMethods` (`spark.experimental.extraStrategies`). This is an explicit experimental API and I wanted to just-in-case workaround after this change for now.
    
    ## How was this patch tested?
    
    Existing tests should cover.
    
    Closes #25386 from HyukjinKwon/SPARK-28654.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Commits on Aug 6, 2019
  1. [SPARK-28622][SQL][PYTHON] Rename PullOutPythonUDFInJoinCondition to …

    HyukjinKwon authored and gatorsmile committed Aug 6, 2019
    …ExtractPythonUDFFromJoinCondition and move to 'Extract Python UDFs'
    
    ## What changes were proposed in this pull request?
    
    This PR targets to rename `PullOutPythonUDFInJoinCondition` to `ExtractPythonUDFFromJoinCondition` and move to 'Extract Python UDFs' together with other Python UDF related rules.
    
    Currently `PullOutPythonUDFInJoinCondition` rule is alone outside of other 'Extract Python UDFs' rules together.
    
    and the name `ExtractPythonUDFFromJoinCondition` is matched to existing Python UDF extraction rules.
    
    ## How was this patch tested?
    
    Existing tests should cover.
    
    Closes #25358 from HyukjinKwon/move-python-join-rule.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
  2. [SPARK-28537][SQL][HOTFIX][FOLLOW-UP] Add supportColumnar in DebugExec

    HyukjinKwon committed Aug 6, 2019
    ## What changes were proposed in this pull request?
    
    This PR add supportColumnar in DebugExec. Seems there was a conflict between #25274 and #25264
    
    Currently tests are broken in Jenkins:
    
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108687/
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108688/
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108693/
    
    ```
    org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: ColumnarToRow +- InMemoryTableScan [id#356956L]       +- InMemoryRelation [id#356956L], StorageLevel(disk, memory, deserialized, 1 replicas)             +- *(1) Range (0, 5, step=1, splits=2)
    Stacktrace
    sbt.ForkMain$ForkError: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
    ColumnarToRow
    +- InMemoryTableScan [id#356956L]
          +- InMemoryRelation [id#356956L], StorageLevel(disk, memory, deserialized, 1 replicas)
                +- *(1) Range (0, 5, step=1, splits=2)
    
    	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:431)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:323)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
    ```
    
    ## How was this patch tested?
    
    Manually tested the failed test.
    
    Closes #25365 from HyukjinKwon/SPARK-28537.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Aug 1, 2019
  1. [SPARK-28568][SHUFFLE][DOCS] Make Javadoc in org.apache.spark.shuffle…

    HyukjinKwon authored and dongjoon-hyun committed Aug 1, 2019
    ….api visible
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to make Javadoc in org.apache.spark.shuffle.api visible.
    
    ## How was this patch tested?
    
    Manually built the doc and checked:
    
    ![Screen Shot 2019-08-01 at 4 48 23 PM](https://user-images.githubusercontent.com/6477701/62275587-400cc080-b47d-11e9-8fba-c4a0607093d1.png)
    
    Closes #25323 from HyukjinKwon/SPARK-28568.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  2. [SPARK-28586][INFRA] Make merge-spark-pr script compatible with Python 3

    HyukjinKwon authored and dongjoon-hyun committed Aug 1, 2019
    ## What changes were proposed in this pull request?
    
    This PR proposes to make `merge_spark_pr.py` script Python 3 compatible.
    
    ## How was this patch tested?
    
    Manually tested against my forked remote with the PR and JIRA below:
    
    #25321
    #25286
    https://issues.apache.org/jira/browse/SPARK-28153
    
    Closes #25322 from HyukjinKwon/merge-script.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Jul 31, 2019
  1. [SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to…

    HyukjinKwon authored and cloud-fan committed Jul 31, 2019
    … support input_file_name with Python UDF)
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder.
    
    Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later.
    
    1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize.
    
    2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`.
    
    3. Both now have two different references. Updating at child isn't reflected to parent.
    
    This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected.
    
    I also tried to explain this a bit more at #24958 (comment).
    
    ## How was this patch tested?
    
    Manually tested and unittest was added.
    
    Closes #24958 from HyukjinKwon/SPARK-28153.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Commits on Jul 29, 2019
  1. [SPARK-28550][K8S][TESTS] Unset SPARK_HOME environment variable in K8…

    HyukjinKwon authored and dongjoon-hyun committed Jul 29, 2019
    …S integration preparation
    
    ## What changes were proposed in this pull request?
    
    Currently, if we run the Kubernetes integration tests with `SPARK_HOME` already set, it refers the `SPARK_HOME` even when `--spark-tgz` is specified.
    
    This PR proposes to unset `SPARK_HOME` to let the docker-image-tool script detect `SPARK_HOME`. Otherwise, it cannot indicate the unpacked directory as its home.
    
    ## How was this patch tested?
    
    ```bash
    export SPARK_HOME=`pwd`
    dev/make-distribution.sh --pip --tgz -Phadoop-2.7 -Pkubernetes
    resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --deploy-mode docker-for-desktop --spark-tgz $PWD/spark-*.tgz
    ```
    
    **Before:**
    
    ```
    + /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh -r docker.io/kubespark -t 650B51C8-BBED-47C9-AEAB-E66FC9A0E64E -p /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
    cp: resource-managers/kubernetes/docker/src/main/dockerfiles: No such file or directory
    cp: assembly/target/scala-2.12/jars: No such file or directory
    cp: resource-managers/kubernetes/integration-tests/tests: No such file or directory
    cp: examples/target/scala-2.12/jars/*: No such file or directory
    cp: resource-managers/kubernetes/docker/src/main/dockerfiles: No such file or directory
    cp: resource-managers/kubernetes/docker/src/main/dockerfiles: No such file or directory
    Cannot find docker image. This script must be run from a runnable distribution of Apache Spark.
    ...
    [INFO] Spark Project Kubernetes Integration Tests ......... FAILURE [  4.870 s]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD FAILURE
    ```
    
    **After:**
    
    ```
    + /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh -r docker.io/kubespark -t 2BA5883A-A0AC-4D2B-8D00-702D31B59B23 -p /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
    Sending build context to Docker daemon  250.2MB
    Step 1/15 : FROM openjdk:8-alpine
     ---> a3562aa0b991
    ...
    Successfully built 8614fb5ac279
    Successfully tagged kubespark/spark:2BA5883A-A0AC-4D2B-8D00-702D31B59B23
    ```
    
    Closes #25283 from HyukjinKwon/SPARK-28550.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Commits on Jul 27, 2019
  1. [SPARK-28536][SQL][PYTHON][TESTS] Reduce shuffle partitions in Python…

    HyukjinKwon authored and dongjoon-hyun committed Jul 27, 2019
    … UDF tests in SQLQueryTestSuite
    
    ## What changes were proposed in this pull request?
    
    In Python UDF tests, the number of shuffle partitions matters considerably in the testing time because it requires to fork and communicate between external processes.
    
    **Before:**
    
    ![image](https://user-images.githubusercontent.com/6477701/61989374-465c0080-b069-11e9-9936-b386d0cccf7a.png)
    
    **After: (with 4)**
    
    ![Screen Shot 2019-07-27 at 10 43 34 AM](https://user-images.githubusercontent.com/9700541/61997757-743a4880-b05b-11e9-9180-8d0976bda3bd.png)
    
    ## How was this patch tested?
    
    Manually tested in my local.
    
    **Before:**
    
    ```
    [info] SQLQueryTestSuite:
    [info] - udf/udf-window.sql - Scala UDF (58 seconds, 558 milliseconds)
    [info] - udf/udf-window.sql - Regular Python UDF (58 seconds, 371 milliseconds)
    [info] - udf/udf-window.sql - Scalar Pandas UDF (1 minute, 8 seconds)
    ```
    
    **After:**
    
    ```
    [info] SQLQueryTestSuite:
    [info] - udf/udf-window.sql - Scala UDF (14 seconds, 690 milliseconds)
    [info] - udf/udf-window.sql - Regular Python UDF (10 seconds, 467 milliseconds)
    [info] - udf/udf-window.sql - Scalar Pandas UDF (10 seconds, 895 milliseconds)
    ```
    
    Closes #25271 from HyukjinKwon/SPARK-28536.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  2. [SPARK-28441][SQL][TESTS][FOLLOW-UP] Skip Python tests if python exec…

    HyukjinKwon committed Jul 27, 2019
    …utable and pyspark library are unavailable
    
    ##  What changes were proposed in this pull request?
    
    We should add `assume(shouldTestPythonUDFs)`. Maybe it's not a biggie in general but it can matter in other venders' testing base. For instance, if somebody launches a test in a minimal docker image, it might make the tests failed suddenly.
    
    This skipping stuff isn't completely new in our test base. See `TestUtils.testCommandAvailable` for instance.
    
    ## How was this patch tested?
    
    Manually tested.
    
    Closes #25272 from HyukjinKwon/SPARK-28441.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Jul 24, 2019
  1. [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoc…

    HyukjinKwon committed Jul 24, 2019
    …h in EpochTracker (to support Python UDFs)
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to use `InheritableThreadLocal` instead of `ThreadLocal` for current epoch in `EpochTracker`. Python UDF needs threads to write out to and read it from Python processes and when there are new threads, previously set epoch is lost.
    
    After this PR, Python UDFs can be used at Structured Streaming with the continuous mode.
    
    ## How was this patch tested?
    
    The test cases were written on the top of #24945.
    Unit tests were added.
    
    Manual tests.
    
    Closes #24946 from HyukjinKwon/SPARK-27234.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Commits on Jul 22, 2019
  1. [SPARK-28321][DOCS][FOLLOW-UP] Update migration guide by 0-args Java …

    HyukjinKwon authored and cloud-fan committed Jul 22, 2019
    …UDF's internal behaviour change
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to add a note in the migration guide. See #25108 (comment)
    
    ## How was this patch tested?
    
    N/A
    
    Closes #25224 from HyukjinKwon/SPARK-28321-doc.
    
    Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
    Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Commits on Jul 19, 2019
  1. [SPARK-28389][SQL][FOLLOW-UP] Use one example in 'add_months' behavio…

    HyukjinKwon committed Jul 19, 2019
    …r change at migration guide
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to add one example to describe 'add_months'  behaviour change by #25153.
    
    **Spark 2.4:**
    
    ```sql
    select add_months(DATE'2019-02-28', 1)
    ```
    
    ```
    +--------------------------------+
    |add_months(DATE '2019-02-28', 1)|
    +--------------------------------+
    |                      2019-03-31|
    +--------------------------------+
    ```
    
    **Current master:**
    
    ```sql
    select add_months(DATE'2019-02-28', 1)
    ```
    
    ```
    +--------------------------------+
    |add_months(DATE '2019-02-28', 1)|
    +--------------------------------+
    |                      2019-03-28|
    +--------------------------------+
    ```
    
    ## How was this patch tested?
    
    Manually tested on Spark 2.4.1 and the current master.
    
    Closes #25199 from HyukjinKwon/SPARK-28389.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Older
You can’t perform that action at this time.