Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pipeline Example has syntax errors] #15728

Closed
wants to merge 1,479 commits into from

Conversation

jagadeesanas2
Copy link
Contributor

What changes were proposed in this pull request?

[Fix] [branch-2.0] In Python 3, there is only one integer type (i.e., int), which mostly behaves like the long type in Python 2. Since Python 3 won't accept "L", so removed "L" in all examples.

How was this patch tested?

Unit tests.

cloud-fan and others added 30 commits September 1, 2016 08:54
…ommand to handle ALTER VIEW AS

## What changes were proposed in this pull request?

Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 3 bugs:

1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no database part and temp view exists
2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist.
3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.

The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we need different code path to handle them. However, in `CreateViewCommand`, there is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce extra flag. But instead of doing this, I think a more natural way is to separate the ALTER VIEW AS logic into a new command.

backport apache#14874 to 2.0

## How was this patch tested?

new tests in SQLViewSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#14893 from cloud-fan/minor4.
## What changes were proposed in this pull request?

The usage in the original example is incorrect. This PR fixes it.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes apache#14903 from junyangq/SPARKR-FixWindowPartitionByDoc.

(cherry picked from commit d008638)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
…class defined in repl again

## What changes were proposed in this pull request?

After digging into the logs, I noticed the failure is because in this test, it starts a local cluster with 2 executors. However, when SparkContext is created, executors may be still not up. When one of the executor is not up during running the job, the blocks won't be replicated.

This PR just adds a wait loop before running the job to fix the flaky test.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#14905 from zsxwing/SPARK-17318-2.

(cherry picked from commit 21c0a4f)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
## What changes were proposed in this pull request?

Ports apache#14841 and apache#14910 from `master` to `branch-2.0`

Jira : https://issues.apache.org/jira/browse/SPARK-17271

Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
`SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId,
        qualifier = Some("a")
      )
```

Expression in child SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId)
```

Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order.

This PR includes following changes:
- Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals)
- Fixed `EnsureRequirements` to use semantic comparison of SortOrder

## How was this patch tested?

- Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite`

Author: Tejas Patil <tejasp@fb.com>

Closes apache#14920 from tejasapatil/SPARK-17271_2.0_port.
## What changes were proposed in this pull request?

This removes partition columns from column metadata of partitions to match tables.

A change introduced in SPARK-14388 removed partition columns from the column metadata of tables, but not for partitions. This causes TableReader to believe that the schema is different between table and partition, and create an unnecessary conversion object inspector in TableReader.

## How was this patch tested?

Existing unit tests.

Author: Brian Cho <bcho@fb.com>

Closes apache#14515 from dafrista/partition-columns-metadata.

(cherry picked from commit 473d786)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
…ned exception

## What changes were proposed in this pull request?

Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer results in a `java.sql.SQLException: Method` not supported exception from `org.apache.hive.jdbc.HiveResultSetMetaData.isSigned`. Here are two user reports of this issue:

- https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
- https://stackoverflow.com/questions/32195946/method-not-supported-in-spark

I have filed [HIVE-14684](https://issues.apache.org/jira/browse/HIVE-14684) to attempt to fix this in Hive by implementing the isSigned method, but in the meantime / for compatibility with older JDBC drivers I think we should add special-case error handling to work around this bug.

This patch updates `JDBCRDD`'s `ResultSetMetadata` to schema conversion to catch the "Method not supported" exception from Hive and return `isSigned = true`. I believe that this is safe because, as far as I know, Hive does not support unsigned numeric types.

## How was this patch tested?

Tested manually against a Spark Thrift Server.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#14911 from JoshRosen/hive-jdbc-workaround.

(cherry picked from commit 15539e5)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
## What changes were proposed in this pull request?

SPARK-15373 (apache#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, some class was renamed like 'timeline to vis-timeline' but that ticket didn't care and now style is broken.

In this PR, I've restored the style by modifying `timeline-view.css` and `timeline-view.js`.

## How was this patch tested?

manual tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

* Before
<img width="1258" alt="2016-09-01 1 38 31" src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png">

* After
<img width="1256" alt="2016-09-01 3 30 19" src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes apache#14900 from sarutak/SPARK-17342.

(cherry picked from commit 2ab8dbd)
Signed-off-by: Sean Owen <sowen@cloudera.com>
… when collecting SparkDataFrame

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))

'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2

The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))
'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y: num  2 2 2 2 2
R Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#14613 from wangmiao1981/type.

(cherry picked from commit 0f30cde)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
…ecause of calculation error

## What changes were proposed in this pull request?

In StagePage, executor-computing-time is calculated but calculation error can occur potentially because it's calculated by subtraction of floating numbers.

Following capture is an example.

<img width="949" alt="capture-timeline" src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png">

## How was this patch tested?

Manual tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes apache#14908 from sarutak/SPARK-17352.

(cherry picked from commit 7ee24da)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Function-related `HiveExternalCatalog` APIs do not have enough verification logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become consistent in the error handling.

For example, below is the exception we got when calling `renameFunction`.
```
15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db1, returning NoSuchObjectException
15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db2, returning NoSuchObjectException
15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object "org.apache.hadoop.hive.metastore.model.MFunction205629e9" using statement "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUEFUNCTION' defined on 'FUNCS'.
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
```

Improved the existing test cases to check whether the messages are right.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#14521 from gatorsmile/functionChecking.

(cherry picked from commit 247a4fa)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

## What changes were proposed in this pull request?

Set SparkSession._instantiatedContext as None so that we can recreate SparkSession again.

## How was this patch tested?

Tested manually using the following command in pyspark shell
```
spark.stop()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sql("show databases").show()
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes apache#14857 from zjffdu/SPARK-17261.

(cherry picked from commit ea66228)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
## What changes were proposed in this pull request?

Add sparkR.version() API.

```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#14935 from felixcheung/rsparksessionversion.

(cherry picked from commit 812333e)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
…when match fails

## What changes were proposed in this pull request?

Doc change - see https://issues.apache.org/jira/browse/SPARK-16324

## How was this patch tested?

manual check

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#14934 from felixcheung/regexpextractdoc.

(cherry picked from commit 419eefd)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
## What changes were proposed in this pull request?

change since version in doc

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#14939 from felixcheung/rsparkversion2.

(cherry picked from commit eac1d0e)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
…on in DataFrameWriter

Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.

For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

Ideally, we should make all the analyzer rules all idempotent, that may require lots of effort to double checking them one by one (may be not easy).

An easier approach could be never feed a optimized plan into Analyzer, this PR fix the case for RunnableComand, they will be optimized, during execution, the passed `query` will also be passed into QueryExecution again. This PR make these `query` not part of the children, so they will not be optimized and analyzed again.

Right now, we did not know a logical plan is optimized or not, we could introduce a flag for that, and make sure a optimized logical plan will not be analyzed again.

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes apache#14797 from davies/fix_writer.

(cherry picked from commit ed9c884)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
… row groups shouldn't throw an error

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes apache#14941 from sameeragarwal/parquet-exception-2.

(cherry picked from commit a2c9acb)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
…secutive row groups shouldn't throw an error"

This reverts commit a3930c3.
## What changes were proposed in this pull request?

This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.

## How was this patch tested?

Manual test.

![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes apache#14942 from junyangq/fixSparkRSessionDoc.

(cherry picked from commit d2fde6b)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
… type

## What changes were proposed in this pull request?

We propose to fix the Encoder type in the Dataset example

## How was this patch tested?

The PR will be tested with the current unit test cases

Author: CodingCat <zhunansjtu@gmail.com>

Closes apache#14901 from CodingCat/SPARK-17347.

(cherry picked from commit 97da410)
Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request?
was not dropping table `parquet_t3`

## How was this patch tested?
tested `LogicalPlanToSQLSuite` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes apache#13767 from techaddict/minor-8.

(cherry picked from commit a8a35b3)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…m Hive Metastore

### What changes were proposed in this pull request?
The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty.

This PR is to fix the issue.

### How was this patch tested?
Fixed the test case to verify the change.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#14550 from gatorsmile/tableComment.

(cherry picked from commit bdd5371)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…e and hive serde tables

Currently there are 2 inconsistence:

1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null

new test in `HiveDDLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#14302 from cloud-fan/minor3.

(cherry picked from commit a2abb58)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
the `catalogString` for `ArrayType` and `MapType` currently calls the `simpleString` method on its children. This is a problem when the child is a struct, the `struct.simpleString` implementation truncates the number of fields it shows (25 at max). This breaks the generation of a proper `catalogString`, and has shown to cause errors while writing to Hive.

This PR fixes this by providing proper `catalogString` implementations for `ArrayData` or `MapData`.

## How was this patch tested?
Added testing for `catalogString` to `DataTypeSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes apache#14938 from hvanhovell/SPARK-17335.

(cherry picked from commit c2a1576)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
### What changes were proposed in this pull request?
In the latest branch 2.0, we have two test case failure due to backport.

- test("ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.")
- test("SPARK-6212: The EXPLAIN output of CTAS only shows the analyzed plan")

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#14951 from gatorsmile/fixTestFailure.
…le bugs in CREATE TABLE LIKE command

### What changes were proposed in this pull request?
This PR is to backport apache#14531.

The existing `CREATE TABLE LIKE` command has multiple issues:

- The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location.

- The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`.

- When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)

- The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept.

- The `INDEX` table is not supported. Thus, we should throw an exception in this case.

- `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain.

- Add a support for temp tables

- Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table.

- `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features.

- When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`.

This PR is to fix the above issues.

### How was this patch tested?
Improve the test coverage by adding more test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#14946 from gatorsmile/createTableLike20.
…e to missing otherCopyArgs

## What changes were proposed in this pull request?

`TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs to include currying construction arguments, otherwise it reports AssertException telling that the construction argument values' count doesn't match the construction argument names' count.

For class `MetastoreRelation`, it has a currying construction parameter `client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes apache#14928 from clockfly/metastore_relation_toJSON.

(cherry picked from commit afb3d5d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…beelines

## What changes were proposed in this pull request?
Cached table(parquet/orc) couldn't be shard between beelines, because the `sameResult` method used by `CacheManager` always return false(`sparkSession` are different) when compare two `HadoopFsRelation` in different beelines. So we make `sparkSession` a curry parameter.

## How was this patch tested?
Beeline1
```
1: jdbc:hive2://localhost:10000> CACHE TABLE src_pqt;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (5.143 seconds)
1: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                            plan                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan ==
InMemoryTableScan [key#49, value#50]
   +- InMemoryRelation [key#49, value#50], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt`
         +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string>  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
```

Beeline2
```
0: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                            plan                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan ==
InMemoryTableScan [key#68, value#69]
   +- InMemoryRelation [key#68, value#69], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt`
         +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string>  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
```

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes apache#14913 from watermen/SPARK-17358.

(cherry picked from commit 64e826f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ption due to missing otherCopyArgs"

This reverts commit 7b1aa21.
…on due to missing otherCopyArgs

backport apache#14928 to 2.0

## What changes were proposed in this pull request?

`TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs to include currying construction arguments, otherwise it reports AssertException telling that the construction argument values' count doesn't match the construction argument names' count.

For class `MetastoreRelation`, it has a currying construction parameter `client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes apache#14968 from clockfly/metastore_toJSON_fix_for_spark_2.0.
hayashidac and others added 26 commits October 26, 2016 07:14
… to show https url when ssl is enabled

spark history server log needs to be fixed to show https url when ssl is enabled

Author: chie8842 <chie@chie-no-Mac-mini.local>

Closes apache#15611 from hayashidac/SPARK-16988.

(cherry picked from commit c329a56)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
…eption when saving DF to MySQL

## What changes were proposed in this pull request?

On null next exception in JDBC, don't init it as cause or suppressed

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15599 from srowen/SPARK-18022.

(cherry picked from commit 6c7d094)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…for query

## What changes were proposed in this pull request?

The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
`a = b, a = f(b, c)`
Applying both these rules in the next iteration would infer:
`f(b, c) = f(f(b, c), c)`
This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.

~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~
To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that.

## How was this patch tested?

Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15319 from jiangxb1987/constraints.

(cherry picked from commit 3c02357)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…rdless of warehouse dir's existence

## What changes were proposed in this pull request?
Appending a trailing slash, if there already isn't one for the
sake comparison of the two paths. It doesn't take away from
the essence of the check, but removes any potential mismatch
due to lack of trailing slash.

## How was this patch tested?
Ran unit tests and they passed.

Author: Mark Grover <mark@apache.org>

Closes apache#15623 from markgrover/spark-18093.

(cherry picked from commit 4bee954)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError.

## How was this patch tested?
Added an end-to-end test in FailureSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes apache#13982 from petermaxlee/SPARK-16304.
## What changes were proposed in this pull request?

The `UnaryNode.getAliasedConstraints` function fails to replace all expressions by their alias where constraints contains more than one expression to be replaced.
For example:
```
val tr = LocalRelation('a.int, 'b.string, 'c.int)
val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
multiAlias.analyze.constraints
```
currently outputs:
```
ExpressionSet(Seq(
    IsNotNull(resolveColumn(multiAlias.analyze, "x")),
    IsNotNull(resolveColumn(multiAlias.analyze, "y"))
)
```
The constraint `resolveColumn(multiAlias.analyze, "x") === resolveColumn(multiAlias.analyze, "y") + 10)` is missing.

## How was this patch tested?

Add new test cases in `ConstraintPropagationSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15597 from jiangxb1987/alias-constraints.

(cherry picked from commit fa7d9d7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
## What changes were proposed in this pull request?

Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal.

## How was this patch tested?

Verified manually.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#15630 from zsxwing/kafka-unidoc.

(cherry picked from commit 7d10631)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…(branch 2.0)

## What changes were proposed in this pull request?

Backport apache#15520 to 2.0.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#15646 from zsxwing/SPARK-13747-2.0.
…lementation classes

## What changes were proposed in this pull request?

This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes:
* Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive.
* Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer".
* Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`.
* Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code.
* The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint.
* `MemoryStream` now cleans committed batches out of its internal buffer.
* `TextSocketSource` now cleans committed batches from its internal buffer.

## How was this patch tested?
Existing regression tests already exercise the new code.

Author: frreiss <frreiss@us.ibm.com>

Closes apache#14553 from frreiss/fred-16963.

(cherry picked from commit 5b27598)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…or() on dataframe produced by RunnableCommand

A short code snippet that uses toLocalIterator() on a dataframe produced by a RunnableCommand
reproduces the problem. toLocalIterator() is called by thriftserver when
`spark.sql.thriftServer.incrementalCollect`is set to handle queries producing large result
set.

**Before**
```SQL
scala> spark.sql("show databases")
res0: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> res0.toLocalIterator()
16/10/26 03:00:24 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```

**After**
```SQL
scala> spark.sql("drop database databases")
res30: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show databases")
res31: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> res31.toLocalIterator().asScala foreach println
[default]
[parquet]
```
Added a test in DDLSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes apache#15642 from dilipbiswal/SPARK-18009.

(cherry picked from commit dd4f088)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This PR fixes checkstyle.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#15656 from yhuai/fix-format.

(cherry picked from commit d3b4831)
Signed-off-by: Yin Huai <yhuai@databricks.com>
## What changes were proposed in this pull request?

maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.

## How was this patch tested?

Added unit test

Author: cody koeninger <cody@koeninger.org>

Closes apache#15527 from koeninger/SPARK-17813.

(cherry picked from commit 1042325)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…ion"

## What changes were proposed in this pull request?

A follow up PR for apache#14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#15661 from zsxwing/fix-StreamingQuerySuite.

(cherry picked from commit 79fd0cc)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
… throws exception

## What changes were proposed in this pull request?

Fixed the issue that ForeachSink didn't rethrow the exception.

## How was this patch tested?

The fixed unit test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#15674 from zsxwing/foreach-sink-error.

(cherry picked from commit 59cccbd)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
… for Kafka 0.10 integration doc

## What changes were proposed in this pull request?

added java code snippet for Kafka 0.10 integration doc

## How was this patch tested?

SKIP_API=1 jekyll build

## Screenshot

![kafka-doc](https://cloud.githubusercontent.com/assets/15843379/19826272/bf0d8a4c-9db8-11e6-9e40-1396723df4bc.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#15679 from lw-lin/kafka-010-examples.

(cherry picked from commit 505b927)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…eaking history server (branch 2.0)

## What changes were proposed in this pull request?

Backport apache#15663 to branch-2.0 and fixed conflicts in `ReplayListenerBus`.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#15695 from zsxwing/fix-event-log-2.0.
…the files

## What changes were proposed in this pull request?

The test `when schema inference is turned on, should read partition data` should not delete files because the source maybe is listing files. This PR just removes the delete actions since they are not necessary.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#15699 from zsxwing/SPARK-18030.

(cherry picked from commit de3f87f)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…ion error

Enclose --conf option value with "" to support multi value configs like spark.driver.extraJavaOptions, without "", driver will fail to start.

Jenkins Tests.

Test in our production environment, also unit tests, It is a very small change.

Author: Wang Lei <lei.wang@kongming-inc.com>

Closes apache#15643 from LeightonWong/messos-cluster.

(cherry picked from commit 9b377aa)
Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request?
Likewise [DataSet.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L156) KeyValueGroupedDataset should mark the queryExecution as transient.

As mentioned in the Jira ticket, without transient we saw serialization issues like

```
Caused by: java.io.NotSerializableException: org.apache.spark.sql.execution.QueryExecution
Serialization stack:
        - object not serializable (class: org.apache.spark.sql.execution.QueryExecution, value: ==
```

## How was this patch tested?

Run the query which is specified in the Jira ticket before and after:
```
val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4)))).as[(Int,Int)]
val grouped = a.groupByKey(
{x:(Int,Int)=>x._1}
)
val mappedGroups = grouped.mapGroups((k,x)=>
{(k,1)}
)
val yyy = sc.broadcast(1)
val last = mappedGroups.rdd.map(xx=>
{ val simpley = yyy.value 1 }
)
```

Author: Ergin Seyfe <eseyfe@fb.com>

Closes apache#15706 from seyfe/keyvaluegrouped_serialization.

(cherry picked from commit 8a538c9)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…indow/GroupBy

## What changes were proposed in this pull request?

Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case.

For example,

```
spark.read.load("/some-data")
  .withColumn("date_dt", to_date($"date"))
  .withColumn("year", year($"date_dt"))
  .withColumn("week", weekofyear($"date_dt"))
  .withColumn("user_count", count($"userId"))
  .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
)
```

creates the following output:

```
org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
```

In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem.
## How was this patch tested?

Manually test

Before:

```
scala> spark.sql("select col, count(col) from tbl")
org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
```

After:

```
scala> spark.sql("select col, count(col) from tbl")
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;;
```

Also add new test sqls in `group-by.sql`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15672 from jiangxb1987/groupBy-empty.

(cherry picked from commit d0272b4)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…SPARK-18114

## What changes were proposed in this pull request?

Fix style error introduced in cherry-pick of apache#15643 to branch-2.0.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15719 from srowen/SPARK-18114.2.
@felixcheung
Copy link
Member

could you close this and create a PR against branch-2.0 instead of master?

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67951 has finished for PR 15728 at commit 6b7796f.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.